I’ve been having serious problems with my Openfire 3.5.2 install (Red Hat) for a couple of weeks now. I’ve posted a few questions but they seem to have drawn a blank. I’ve closed the original questions as I was on the wrong track I think - and opened this one in the hope that someone can assist.
The problem in summary - Java sproradically uses 95% + CPU, performance falls on the floor and no new users can login. Then, for no apparent reason, CPU will drop right back down to normal (approx 5%) and everything is ok again. Java Memory is set to 512MB, the server itself has 2GB RAM available, and a 2GHz CPU. There is an absolute maximum of 75 users logged in via Spark at any one time. I have followed the guidelines in the JVM Settings & Debugging document on this site, setting the initial heap to 64MB, maximum heap 512MB. I have added these parameters to OPENFIRE_OPTS in /etc/sysconfig/openfire. When CPU is at the max, Java Memory in the Openfire web console is still sitting around 10% - it doesn’t appear to be filling up at all.
I then traced the individual threads within the Java process that were using up the CPU - I did this by first getting the thread IDs using the “top” command in Red Hat, converting the values to hex, running kill -3 on the PID, and looking at the output in \opt\openfire\logs\nohup.out. From there I was able to see that in this case - three threads were using up all the CPU%. They are all “client-x” connections, and they all look very different to the “client-x” connections that are behaving normally - output below, the “bad” thread first:
“client-13” daemon prio=10 tid=0x09e6f800 nid=0x13f0 in Object.wait() [0x8f69b000…0x8f69c130]
** java.lang.Thread.State: TIMED_WAITING (on object monitor)**
** at java.lang.Object.wait(Native Method)**
** at com.sun.jndi.ldap.Connection.readReply(Unknown Source)**
** - locked <0x915125e0> (a com.sun.jndi.ldap.LdapRequest)**
** at com.sun.jndi.ldap.LdapClient.getSearchReply(Unknown Source)**
** at com.sun.jndi.ldap.LdapClient.search(Unknown Source)**
** at com.sun.jndi.ldap.LdapCtx.doSearch(Unknown Source)**
** at com.sun.jndi.ldap.LdapCtx.searchAux(Unknown Source)**
** at com.sun.jndi.ldap.LdapCtx.c_search(Unknown Source)**
** at com.sun.jndi.toolkit.ctx.ComponentDirContext.p_search(Unknown Source)**
** at com.sun.jndi.toolkit.ctx.PartialCompositeDirContext.search(Unknown Source)**
** at com.sun.jndi.toolkit.ctx.PartialCompositeDirContext.search(Unknown Source)**
** at javax.naming.directory.InitialDirContext.search(Unknown Source)**
** at org.jivesoftware.openfire.ldap.LdapGroupProvider.populateGroups(LdapGroupProvid er.java:671)**
** at org.jivesoftware.openfire.ldap.LdapGroupProvider.getGroup(LdapGroupProvider.jav a:100)**
** at org.jivesoftware.openfire.group.GroupManager.getGroup(GroupManager.java:205)**
** - locked <0xb2d144b8> (a java.lang.String)**
** at org.jivesoftware.openfire.group.GroupCollection$UserIterator.getNextElement(Gro upCollection.java:103)**
** at org.jivesoftware.openfire.group.GroupCollection$UserIterator.hasNext(GroupColle ction.java:66)**
** at org.jivesoftware.openfire.roster.RosterManager.hasMutualVisibility(RosterManage r.java:877)**
** at org.jivesoftware.openfire.roster.Roster.(Roster.java:144)**
** at org.jivesoftware.openfire.roster.RosterManager.getRoster(RosterManager.java:86)**
** - locked <0xb291b7e8> (a java.lang.String)**
** at org.jivesoftware.openfire.user.User.getRoster(User.java:369)**
** at org.jivesoftware.openfire.handler.IQRosterHandler.manageRoster(IQRosterHandler. java:201)**
** at org.jivesoftware.openfire.handler.IQRosterHandler.handleIQ(IQRosterHandler.java :106)**
** at org.jivesoftware.openfire.handler.IQHandler.process(IQHandler.java:49)**
** at org.jivesoftware.openfire.IQRouter.handle(IQRouter.java:349)**
** at org.jivesoftware.openfire.IQRouter.route(IQRouter.java:101)**
** at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:68)**
** at org.jivesoftware.openfire.net.StanzaHandler.processIQ(StanzaHandler.java:311)**
** at org.jivesoftware.openfire.net.ClientStanzaHandler.processIQ(ClientStanzaHandler .java:79)**
** at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:276)**
** at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:175)**
** at org.jivesoftware.openfire.nio.ConnectionHandler.messageReceived(ConnectionHandl er.java:133)**
** at org.apache.mina.common.support.AbstractIoFilterChain$TailFilter.messageReceived (AbstractIoFilterChain.java:570)**
** at org.apache.mina.common.support.AbstractIoFilterChain.callNextMessageReceived(Ab stractIoFilterChain.java:299)**
** at org.apache.mina.common.support.AbstractIoFilterChain.access$1100(AbstractIoFilt erChain.java:53)**
** at org.apache.mina.common.support.AbstractIoFilterChain$EntryImpl$1.messageReceive d(AbstractIoFilterChain.java:648)**
** at org.apache.mina.common.IoFilterAdapter.messageReceived(IoFilterAdapter.java:80)**
** at org.apache.mina.common.support.AbstractIoFilterChain.callNextMessageReceived(Ab stractIoFilterChain.java:299)**
** at org.apache.mina.common.support.AbstractIoFilterChain.access$1100(AbstractIoFilt erChain.java:53)**
** at org.apache.mina.common.support.AbstractIoFilterChain$EntryImpl$1.messageReceive d(AbstractIoFilterChain.java:648)**
** at org.apache.mina.filter.codec.support.SimpleProtocolDecoderOutput.flush(SimplePr otocolDecoderOutput.java:58)**
** at org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecF ilter.java:185)**
** at org.apache.mina.common.support.AbstractIoFilterChain.callNextMessageReceived(Ab stractIoFilterChain.java:299)**
** at org.apache.mina.common.support.AbstractIoFilterChain.access$1100(AbstractIoFilt erChain.java:53)**
** at org.apache.mina.common.support.AbstractIoFilterChain$EntryImpl$1.messageReceive d(AbstractIoFilterChain.java:648)**
** at org.apache.mina.filter.executor.ExecutorFilter.processEvent(ExecutorFilter.java :239)**
** at org.apache.mina.filter.executor.ExecutorFilter$ProcessEventsRunnable.run(Execut orFilter.java:283)**
** at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)**
** at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)**
** at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:51)**
** at java.lang.Thread.run(Unknown Source)**
And here’s one of the normal ones…
“client-12” daemon prio=10 tid=0x0a895800 nid=0x13ef waiting on condition [0x8f6bc000…0x8f6bcdb0]
** java.lang.Thread.State: WAITING (parking)**
** at sun.misc.Unsafe.park(Native Method)**
** - parking to wait for <0x947dad28> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)**
** at java.util.concurrent.locks.LockSupport.park(Unknown Source)**
** at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Unk nown Source)**
** at java.util.concurrent.LinkedBlockingQueue.take(Unknown Source)**
** at java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)**
** at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)**
** at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:51)**
** at java.lang.Thread.run(Unknown Source)**
The only other piece of info I have is a hunch that the bad threads are client connections that have timed out in some way? I have a feeling that they could relate to Spark sessions that have been left idle for a period of time, for example if a user has gone away for a few days but left his PC logged in and Spark still active. This is just a hunch - I don’t have any direct evidence for this.
So - has anyone got any ideas from this new info? Is there a way perhaps that I can tell which user “client-13” actually is? I would really appreciate one of the experts taking a look at this as it’s really hurting us - and people are losing confidence in the product - which is a shame as it’s potentially so powerful.
Nick