Openfire Java VM issues - 99% CPU

I’ve been having serious problems with my Openfire 3.5.2 install (Red Hat) for a couple of weeks now. I’ve posted a few questions but they seem to have drawn a blank. I’ve closed the original questions as I was on the wrong track I think - and opened this one in the hope that someone can assist.

The problem in summary - Java sproradically uses 95% + CPU, performance falls on the floor and no new users can login. Then, for no apparent reason, CPU will drop right back down to normal (approx 5%) and everything is ok again. Java Memory is set to 512MB, the server itself has 2GB RAM available, and a 2GHz CPU. There is an absolute maximum of 75 users logged in via Spark at any one time. I have followed the guidelines in the JVM Settings & Debugging document on this site, setting the initial heap to 64MB, maximum heap 512MB. I have added these parameters to OPENFIRE_OPTS in /etc/sysconfig/openfire. When CPU is at the max, Java Memory in the Openfire web console is still sitting around 10% - it doesn’t appear to be filling up at all.

I then traced the individual threads within the Java process that were using up the CPU - I did this by first getting the thread IDs using the “top” command in Red Hat, converting the values to hex, running kill -3 on the PID, and looking at the output in \opt\openfire\logs\nohup.out. From there I was able to see that in this case - three threads were using up all the CPU%. They are all “client-x” connections, and they all look very different to the “client-x” connections that are behaving normally - output below, the “bad” thread first:

“client-13” daemon prio=10 tid=0x09e6f800 nid=0x13f0 in Object.wait() [0x8f69b000…0x8f69c130]

** java.lang.Thread.State: TIMED_WAITING (on object monitor)**

** at java.lang.Object.wait(Native Method)**

** at com.sun.jndi.ldap.Connection.readReply(Unknown Source)**

** - locked <0x915125e0> (a com.sun.jndi.ldap.LdapRequest)**

** at com.sun.jndi.ldap.LdapClient.getSearchReply(Unknown Source)**

** at com.sun.jndi.ldap.LdapClient.search(Unknown Source)**

** at com.sun.jndi.ldap.LdapCtx.doSearch(Unknown Source)**

** at com.sun.jndi.ldap.LdapCtx.searchAux(Unknown Source)**

** at com.sun.jndi.ldap.LdapCtx.c_search(Unknown Source)**

** at com.sun.jndi.toolkit.ctx.ComponentDirContext.p_search(Unknown Source)**

** at com.sun.jndi.toolkit.ctx.PartialCompositeDirContext.search(Unknown Source)**

** at com.sun.jndi.toolkit.ctx.PartialCompositeDirContext.search(Unknown Source)**

** at javax.naming.directory.InitialDirContext.search(Unknown Source)**

** at org.jivesoftware.openfire.ldap.LdapGroupProvider.populateGroups(LdapGroupProvid er.java:671)**

** at org.jivesoftware.openfire.ldap.LdapGroupProvider.getGroup(LdapGroupProvider.jav a:100)**

** at org.jivesoftware.openfire.group.GroupManager.getGroup(GroupManager.java:205)**

** - locked <0xb2d144b8> (a java.lang.String)**

** at org.jivesoftware.openfire.group.GroupCollection$UserIterator.getNextElement(Gro upCollection.java:103)**

** at org.jivesoftware.openfire.group.GroupCollection$UserIterator.hasNext(GroupColle ction.java:66)**

** at org.jivesoftware.openfire.roster.RosterManager.hasMutualVisibility(RosterManage r.java:877)**

** at org.jivesoftware.openfire.roster.Roster.(Roster.java:144)**

** at org.jivesoftware.openfire.roster.RosterManager.getRoster(RosterManager.java:86)**

** - locked <0xb291b7e8> (a java.lang.String)**

** at org.jivesoftware.openfire.user.User.getRoster(User.java:369)**

** at org.jivesoftware.openfire.handler.IQRosterHandler.manageRoster(IQRosterHandler. java:201)**

** at org.jivesoftware.openfire.handler.IQRosterHandler.handleIQ(IQRosterHandler.java :106)**

** at org.jivesoftware.openfire.handler.IQHandler.process(IQHandler.java:49)**

** at org.jivesoftware.openfire.IQRouter.handle(IQRouter.java:349)**

** at org.jivesoftware.openfire.IQRouter.route(IQRouter.java:101)**

** at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:68)**

** at org.jivesoftware.openfire.net.StanzaHandler.processIQ(StanzaHandler.java:311)**

** at org.jivesoftware.openfire.net.ClientStanzaHandler.processIQ(ClientStanzaHandler .java:79)**

** at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:276)**

** at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:175)**

** at org.jivesoftware.openfire.nio.ConnectionHandler.messageReceived(ConnectionHandl er.java:133)**

** at org.apache.mina.common.support.AbstractIoFilterChain$TailFilter.messageReceived (AbstractIoFilterChain.java:570)**

** at org.apache.mina.common.support.AbstractIoFilterChain.callNextMessageReceived(Ab stractIoFilterChain.java:299)**

** at org.apache.mina.common.support.AbstractIoFilterChain.access$1100(AbstractIoFilt erChain.java:53)**

** at org.apache.mina.common.support.AbstractIoFilterChain$EntryImpl$1.messageReceive d(AbstractIoFilterChain.java:648)**

** at org.apache.mina.common.IoFilterAdapter.messageReceived(IoFilterAdapter.java:80)**

** at org.apache.mina.common.support.AbstractIoFilterChain.callNextMessageReceived(Ab stractIoFilterChain.java:299)**

** at org.apache.mina.common.support.AbstractIoFilterChain.access$1100(AbstractIoFilt erChain.java:53)**

** at org.apache.mina.common.support.AbstractIoFilterChain$EntryImpl$1.messageReceive d(AbstractIoFilterChain.java:648)**

** at org.apache.mina.filter.codec.support.SimpleProtocolDecoderOutput.flush(SimplePr otocolDecoderOutput.java:58)**

** at org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecF ilter.java:185)**

** at org.apache.mina.common.support.AbstractIoFilterChain.callNextMessageReceived(Ab stractIoFilterChain.java:299)**

** at org.apache.mina.common.support.AbstractIoFilterChain.access$1100(AbstractIoFilt erChain.java:53)**

** at org.apache.mina.common.support.AbstractIoFilterChain$EntryImpl$1.messageReceive d(AbstractIoFilterChain.java:648)**

** at org.apache.mina.filter.executor.ExecutorFilter.processEvent(ExecutorFilter.java :239)**

** at org.apache.mina.filter.executor.ExecutorFilter$ProcessEventsRunnable.run(Execut orFilter.java:283)**

** at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)**

** at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)**

** at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:51)**

** at java.lang.Thread.run(Unknown Source)**

And here’s one of the normal ones…

“client-12” daemon prio=10 tid=0x0a895800 nid=0x13ef waiting on condition [0x8f6bc000…0x8f6bcdb0]

** java.lang.Thread.State: WAITING (parking)**

** at sun.misc.Unsafe.park(Native Method)**

** - parking to wait for <0x947dad28> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)**

** at java.util.concurrent.locks.LockSupport.park(Unknown Source)**

** at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Unk nown Source)**

** at java.util.concurrent.LinkedBlockingQueue.take(Unknown Source)**

** at java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)**

** at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)**

** at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:51)**

** at java.lang.Thread.run(Unknown Source)**

The only other piece of info I have is a hunch that the bad threads are client connections that have timed out in some way? I have a feeling that they could relate to Spark sessions that have been left idle for a period of time, for example if a user has gone away for a few days but left his PC logged in and Spark still active. This is just a hunch - I don’t have any direct evidence for this.

So - has anyone got any ideas from this new info? Is there a way perhaps that I can tell which user “client-13” actually is? I would really appreciate one of the experts taking a look at this as it’s really hurting us - and people are losing confidence in the product - which is a shame as it’s potentially so powerful.

Nick

Hi Nick,

I’m not an expert on reading trace logs, but the top few lines seem to indicate that threads are timing out with a TIMED_WAITING state and aren’t freeing themselves up. The cause seems to be **
**

** at com.sun.jndi.ldap.Connection.readReply(Unknown Source)**

** - locked <0x915125e0> (a com.sun.jndi.ldap.LdapRequest)**

** at com.sun.jndi.ldap.LdapClient.getSearchReply(Unknown Source)**

which seems to indicate that it’s something to do with communication to the ldap server, might be that the ldap server is down or missing or unreachable or something else.

I found some other info that might help, but they don’t seem to provide any solutions either:

http://forums.sun.com/thread.jspa?messageID=10298964

http://forums.sun.com/thread.jspa?threadID=5250678&messageID=10036697

http://forge.novell.com/pipermail/extjndi-dev/2004-November/000032.html

http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Thread.State.html

http://forums.sun.com/thread.jspa?messageID=10176508

unfortunately I haven’t played around with ldap/AD SSO integration so I don’t know what to actually expect.

Cheers.

Cheers Puff, I think you’re on the right track… I don’t want to speak too soon, but last night I made a change to my LDAP filters in openfire.xml - as a number of people had pointed out the LDAP problems in the trace logs. Basically what I did was narrow down massively the number of LDAP groups that Openfire has to search / process, from about 5000 to er, 8… :slight_smile: This morning, with 50 users now logged on, the CPU is behaving perfectly. I’ll monitor throughout the day, but it’s looking good - if successful, I’ll post the actual changes I made later on in case it can help anybody else.

Nick

OK, this is solved. As expected from the last few days, it was LDAP searches / lookups that were hammering the CPU. My installation has to use the domain as the base DN, as we have users and groups scattered all over the place. Our AD is also a mess, to say the least. So without an LDAP filter on the groups, Openfire was having to deal with thousands of groups, some of which contained various dubious objects, some of which were empty, some of which contained groups within groups etc.

To fix it, I created fresh groups for Openfire with a standard naming structure - “Openfire Shared Groups - Department Name”. Only users were added to these groups, no nested groups. I then put an LDAP filter in openfire.xml -

(objectClass=group)(cn=Openfire*)

On restart of the Openfire service, this had an immediate effect. Average CPU was down from 80%+ to 5%.

So the moral of this thread - either tidy up your AD, or use LDAP filters!

Nick

PS - creating new groups for Openfire in AD is not that bad a task. The way I did it for large groups that already existed, was just to type the group name into Outlook, expand it (including all nested groups), and then copy and paste the users into a new group in ADUC.