200 concurrent users using 350MB and 10% CPU?

We have Openfire 3.5.2 running with currently approx. 200 concurrent connections with most of them being SSL connections. After a few hours the memory and CPU consumption rises from 20 MB / 1% to 350 MB / 10%. After some days it sometimes even jumps to about 900MB. At this point connections become unreliable and we have to restart the server.

ps -T -p 5779 -o pid,tid,pri,time | grep -v ‘00:00:00’ shows 1300 threads.

The most active thread is

5779 5790 24 00:23:02 XX

In nohup.out this is the “VM Thread”.

“VM Thread” prio=10 tid=0x08119400 nid=0x169e runnable

“GC task thread#0 (ParallelGC)” prio=10 tid=0x0805e400 nid=0x169a runnable

“GC task thread#1 (ParallelGC)” prio=10 tid=0x0805f400 nid=0x169b runnable

“VM Periodic Task Thread” prio=10 tid=0x08134400 nid=0x16a5 waiting on condition

JNI global references: 807

Heap

PSYoungGen total 169280K, used 110135K [0xe68f0000, 0xf3850000, 0xf4c70000)

eden space 134208K, 61% used [0xe68f0000,0xeb9e3d78,0xeec00000)

from space 35072K, 77% used [0xeec00000,0xf0699f00,0xf0e40000)

to space 34496K, 0% used [0xf16a0000,0xf16a0000,0xf3850000)

PSOldGen total 440576K, used 231752K [0x74c70000, 0x8fab0000, 0xe68f0000)

object space 440576K, 52% used [0x74c70000,0x82ec21f8,0x8fab0000)

PSPermGen total 30208K, used 29287K [0x70c70000, 0x729f0000, 0x74c70000)

object space 30208K, 96% used [0x70c70000,0x72909e40,0x729f0000)

We also see several hundred “Thread-n” threads.

“Timer-9801” daemon prio=10 tid=0x4e895800 nid=0x7db9 in Object.wait() http://0x32158000…0x321590b0

java.lang.Thread.State: WAITING (on object monitor)

at java.lang.Object.wait(Native Method)

at java.lang.Object.wait(Object.java:485)

at java.util.TimerThread.mainLoop(Unknown Source)

  • locked <0x7fc03a20> (a java.util.TaskQueue)

at java.util.TimerThread.run(Unknown Source)

and a lot of “Smack Reconnection Manager” threads.

“Smack Reconnection Manager” daemon prio=10 tid=0x09674c00 nid=0x3fb3 waiting on condition http://0x2e2b2000…0x2e2b2e30

java.lang.Thread.State: WAITING (parking)

at sun.misc.Unsafe.park(Native Method)

  • parking to wait for <0x775b05e0> (a java.util.concurrent.Semaphore$NonfairSync)

at java.util.concurrent.locks.LockSupport.park(Unknown Source)

at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unk nown Source)

at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterrupti bly(Unknown Source)

at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibl y(Unknown Source)

at java.util.concurrent.Semaphore.acquire(Unknown Source)

at org.jivesoftware.smack.PacketReader.startup(PacketReader.java:151)

at org.jivesoftware.smack.XMPPConnection.initConnection(XMPPConnection.java:1066)

at org.jivesoftware.smack.XMPPConnection.connectUsingConfiguration(XMPPConnection. java:1025)

at org.jivesoftware.smack.XMPPConnection.connect(XMPPConnection.java:1536)

at org.jivesoftware.smack.ReconnectionManager$2.run(ReconnectionManager.java:174)

The beginning of nohup.opt also shows a few hundred of the following exceptions:

java.io.EOFException: no more data available - expected end tag </stream:stream> to close start tag stream:stream from line 1, parser stopped on END_TAG seen …</stream:features> … @1:343

at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:3035)

at org.xmlpull.mxp1.MXParser.more(MXParser.java:3046)

at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1384)

at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)

at org.jivesoftware.smack.PacketReader.parsePackets(PacketReader.java:368)

at org.jivesoftware.smack.PacketReader.access$000(PacketReader.java:44)

at org.jivesoftware.smack.PacketReader$1.run(PacketReader.java:76)

java.net.SocketException: Connection closed by remote host

at com.sun.net.ssl.internal.ssl.SSLSocketImpl.checkWrite(Unknown Source)

at com.sun.net.ssl.internal.ssl.AppOutputStream.write(Unknown Source)

at sun.nio.cs.StreamEncoder.writeBytes(Unknown Source)

at sun.nio.cs.StreamEncoder.implFlushBuffer(Unknown Source)

at sun.nio.cs.StreamEncoder.implFlush(Unknown Source)

at sun.nio.cs.StreamEncoder.flush(Unknown Source)

at java.io.OutputStreamWriter.flush(Unknown Source)

at java.io.BufferedWriter.flush(Unknown Source)

at org.jivesoftware.smack.PacketWriter.writePackets(PacketWriter.java:274)

at org.jivesoftware.smack.PacketWriter.access$000(PacketWriter.java:40)

java.io.EOFException: input contained no data

at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:3003)

at org.xmlpull.mxp1.MXParser.more(MXParser.java:3046)

at org.xmlpull.mxp1.MXParser.parseProlog(MXParser.java:1410)

at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1395)

at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)

at org.jivesoftware.smack.PacketReader.parsePackets(PacketReader.java:368)

at org.jivesoftware.smack.PacketReader.access$000(PacketReader.java:44)

at org.jivesoftware.smack.PacketReader$1.run(PacketReader.java:76)

java.io.EOFException: no more data available - expected end tag </stream:stream> to close start tag stream:stream from line 1, parser stopped on END_TAG seen …</stream:features>… @1:342

at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:3035)

at org.xmlpull.mxp1.MXParser.more(MXParser.java:3046)

at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1144)

at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)

at org.jivesoftware.smack.PacketReader.parsePackets(PacketReader.java:368)

at org.jivesoftware.smack.PacketReader.access$000(PacketReader.java:44)

at org.jivesoftware.smack.PacketReader$1.run(PacketReader.java:76)

java.net.SocketException: Broken pipe

at java.net.SocketOutputStream.socketWrite0(Native Method)

at java.net.SocketOutputStream.socketWrite(Unknown Source)

at java.net.SocketOutputStream.write(Unknown Source)

at com.sun.net.ssl.internal.ssl.OutputRecord.writeBuffer(Unknown Source)

at com.sun.net.ssl.internal.ssl.OutputRecord.write(Unknown Source)

at com.sun.net.ssl.internal.ssl.SSLSocketImpl.writeRecordInternal(Unknown Source)

at com.sun.net.ssl.internal.ssl.SSLSocketImpl.writeRecord(Unknown Source)

at com.sun.net.ssl.internal.ssl.AppOutputStream.write(Unknown Source)

at sun.nio.cs.StreamEncoder.writeBytes(Unknown Source)

at sun.nio.cs.StreamEncoder.implFlushBuffer(Unknown Source)

at sun.nio.cs.StreamEncoder.implFlush(Unknown Source)

at sun.nio.cs.StreamEncoder.flush(Unknown Source)

at java.io.OutputStreamWriter.flush(Unknown Source)

at java.io.BufferedWriter.flush(Unknown Source)

at org.jivesoftware.smack.PacketWriter.writePackets(PacketWriter.java:274)

at org.jivesoftware.smack.PacketWriter.access$000(PacketWriter.java:40)

at org.jivesoftware.smack.PacketWriter$1.run(PacketWriter.java:87)

The server is only using the Gateway 1.2.3 plugin.

Any idea what could be causing this? Is there a problem with the connection cleanup?

Any help would be appreciated.

Thanks,

Henrik

Hm, it seems like that you are running into the same problem that i’ve mentioned in my post weeks ago. See: http://www.igniterealtime.org/community/thread/33034?tstart=0

Sorry that i can’t help you out because i am looking also for an solution to fix that.

I had a nagios plugin check Openfire by doing a TCP connect on the XMPP port every 30 minutes. I now disabled this check and restarted Openfire. After about 10 hours the memory usage is still below 100MB. I will observe this further, but this may have caused the memory problem.

The CPU usage is still between 5% and 10%, but what concerns me more is that I have a constantly very high load on the machine. The 15 min average never goes below 20.0. The 1 min average sometimes drops to 0.1 just to increase back up. I have already seen values around 100.0.

jabber:/ # uptime

9:50pm up 1 day 20:29, 1 user, load average: 16.14, 35.15, 26.33

jabber:/ # top

top - 21:55:18 up 1 day, 20:34, 1 user, load average: 0.87, 14.35, 19.75

Tasks: 69 total, 1 running, 68 sleeping, 0 stopped, 0 zombie

Cpu(s): 2.7%us, 0.2%sy, 0.0%ni, 97.0%id, 0.2%wa, 0.0%hi, 0.0%si, 0.0%st

Mem: 4020492k total, 647752k used, 3372740k free, 137980k buffers

Swap: 2104504k total, 0k used, 2104504k free, 132148k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

19930 daemon 18 0 2948m 276m 6940 S 5 7.0 43:11.68 java

I have to correct myself. Seems it was not the nagios plugin.

Openfire is now going from less than 100MB to 300MB memory consumption within 5 seconds and then drops back to 100MB (probably because GC kicks in) and then goes back to 300MB in 5 seconds.

error.log, warn.log and info.log do not show anything that would look suspicous, but debug.log is full of the following:

2008.06.19 23:28:11 609680 (01/15/00) - #2 registered a statement as closed which wasn’t known to be open. This could happen if you close a statement twice.

2008.06.19 23:28:11 609680 (01/15/00) - Connection #2 tested: OK

2008.06.19 23:28:11 609680 (02/15/00) - Connection #1 tested: OK

2008.06.19 23:28:11 609681 (02/15/00) - Connection #3 tested: OK

2008.06.19 23:28:11 609682 (02/15/00) - #1 registered a statement as closed which wasn’t known to be open. This could happen if you close a statement twice.

2008.06.19 23:28:11 609682 (02/15/00) - #3 registered a statement as closed which wasn’t known to be open. This could happen if you close a statement twice.

2008.06.19 23:28:11 609682 (02/15/00) - Connection #1 tested: OK

2008.06.19 23:28:11 609682 (01/15/00) - Connection #3 tested: OK

2008.06.19 23:28:11 609682 (01/15/00) - Connection #5 tested: OK

2008.06.19 23:28:11 609683 (02/15/00) - Connection #7 tested: OK

2008.06.19 23:28:11 609684 (02/15/00) - #5 registered a statement as closed which wasn’t known to be open. This could happen if you close a statement twice.

2008.06.19 23:28:11 609684 (02/15/00) - #7 registered a statement as closed which wasn’t known to be open. This could happen if you close a statement twice.

2008.06.19 23:28:11 609684 (02/15/00) - Connection #5 tested: OK

2008.06.19 23:28:11 609684 (01/15/00) - Connection #7 tested: OK

2008.06.19 23:28:11 609684 (01/15/00) - Connection #11 tested: OK

2008.06.19 23:28:11 609685 (01/15/00) - #11 registered a statement as closed which wasn’t known to be open. This could happen if you close a statement twice.

2008.06.19 23:28:11 609685 (01/15/00) - Connection #11 tested: OK

2008.06.19 23:28:11 609685 (01/15/00) - Connection #10 tested: OK

2008.06.19 23:28:11 609686 (03/15/00) - #10 registered a statement as closed which wasn’t known to be open. This could happen if you close a statement twice.

2008.06.19 23:28:11 609686 (03/15/00) - Connection #9 tested: OK

2008.06.19 23:28:11 609687 (03/15/00) - Connection #13 tested: OK

2008.06.19 23:28:11 609688 (03/15/00) - Connection #10 tested: OK

2008.06.19 23:28:11 609688 (02/15/00) - #9 registered a statement as closed which wasn’t known to be open. This could happen if you close a statement twice.

2008.06.19 23:28:11 609688 (02/15/00) - #13 registered a statement as closed which wasn’t known to be open. This could happen if you close a statement twice.

2008.06.19 23:28:11 609688 (02/15/00) - Connection #9 tested: OK

2008.06.19 23:28:11 609688 (01/15/00) - Connection #13 tested: OK

2008.06.19 23:28:11 609688 (01/15/00) - Connection #14 tested: OK

2008.06.19 23:28:11 609689 (01/15/00) - #14 registered a statement as closed which wasn’t known to be open. This could happen if you close a statement twice.

2008.06.19 23:28:11 609689 (01/15/00) - Connection #14 tested: OK

2008.06.19 23:28:11 609689 (01/15/00) - Connection #15 tested: OK

2008.06.19 23:28:11 609690 (02/15/00) - #15 registered a statement as closed which wasn’t known to be open. This could happen if you close a statement twice.

2008.06.19 23:28:11 609690 (02/15/00) - Connection #16 tested: OK

2008.06.19 23:28:11 609691 (02/15/00) - Connection #15 tested: OK

2008.06.19 23:28:11 609691 (02/15/00) - #16 registered a statement as closed which wasn’t known to be open. This could happen if you close a statement twice.

2008.06.19 23:28:11 609691 (02/15/00) - Connection #17 tested: OK

2008.06.19 23:28:11 609692 (02/15/00) - Connection #16 tested: OK

2008.06.19 23:28:11 609692 (02/15/00) - #17 registered a statement as closed which wasn’t known to be open. This could happen if you close a statement twice.

2008.06.19 23:28:11 609692 (02/15/00) - Connection #2 tested: OK

2008.06.19 23:28:11 609693 (03/15/00) - Connection #17 tested: OK

2008.06.19 23:28:11 609693 (02/15/00) - #2 registered a statement as closed which wasn’t known to be open. This could happen if you close a statement twice.

2008.06.19 23:28:11 609693 (02/15/00) - Connection #1 tested: OK

2008.06.19 23:28:11 609694 (02/15/00) - Connection #2 tested: OK

2008.06.19 23:28:12 609694 (01/15/00) - #1 registered a statement as closed which wasn’t known to be open. This could happen if you close a statement twice.

2008.06.19 23:28:12 609694 (02/15/00) - Connection #1 tested: OK

2008.06.19 23:28:12 609694 (01/15/00) - Connection #3 tested: OK

2008.06.19 23:28:12 609695 (03/15/00) - #3 registered a statement as closed which wasn’t known to be open. This could happen if you close a statement twice.

2008.06.19 23:28:12 609695 (03/15/00) - Connection #4 tested: OK

2008.06.19 23:28:12 609696 (04/15/00) - Connection #5 tested: OK

2008.06.19 23:28:12 609696 (04/15/00) - Connection #3 tested: OK

2008.06.19 23:28:12 609697 (03/15/00) - #4 registered a statement as closed which wasn’t known to be open. This could happen if you close a statement twice.

2008.06.19 23:28:12 609697 (03/15/00) - Connection #7 tested: OK

2008.06.19 23:28:12 609698 (03/15/00) - #5 registered a statement as closed which wasn’t known to be open. This could happen if you close a statement twice.

2008.06.19 23:28:12 609698 (03/15/00) - Connection #4 tested: OK

2008.06.19 23:28:12 609698 (02/15/00) - #7 registered a statement as closed which wasn’t known to be open. This could happen if you close a statement twice.

2008.06.19 23:28:12 609698 (02/15/00) - Connection #5 tested: OK

2008.06.19 23:28:12 609698 (01/15/00) - Connection #7 tested: OK

2008.06.19 23:28:12 609698 (01/15/00) - Connection #10 tested: OK

2008.06.19 23:28:12 609699 (02/15/00) - #10 registered a statement as closed which wasn’t known to be open. This could happen if you close a statement twice.

2008.06.19 23:28:12 609699 (02/15/00) - Connection #9 tested: OK

2008.06.19 23:28:12 609700 (03/15/00) - Connection #10 tested: OK

2008.06.19 23:28:12 609700 (02/15/00) - #9 registered a statement as closed which wasn’t known to be open. This could happen if you close a statement twice.

2008.06.19 23:28:12 609700 (02/15/00) - Connection #13 tested: OK

2008.06.19 23:28:12 609701 (03/15/00) - Connection #9 tested: OK

2008.06.19 23:28:12 609701 (02/15/00) - #13 registered a statement as closed which wasn’t known to be open. This could happen if you close a statement twice.

2008.06.19 23:28:12 609701 (02/15/00) - Connection #12 tested: OK

2008.06.19 23:28:12 609702 (03/15/00) - Connection #13 tested: OK

2008.06.19 23:28:12 609702 (02/15/00) - #12 registered a statement as closed which wasn’t known to be open. This could happen if you close a statement twice.

2008.06.19 23:28:12 609702 (02/15/00) - Connection #14 tested: OK

2008.06.19 23:28:12 609703 (03/15/00) - Connection #12 tested: OK

2008.06.19 23:28:12 609703 (03/15/00) - #14 registered a statement as closed which wasn’t known to be open. This could happen if you close a statement twice.

2008.06.19 23:28:12 609703 (03/15/00) - Connection #15 tested: OK

2008.06.19 23:28:12 609704 (03/15/00) - Connection #16 tested: OK

2008.06.19 23:28:12 609705 (03/15/00) - Connection #14 tested: OK

Do the connections refer to DB connections? We are using PostgreSQL as our backend DB.

More interesting information:

I turned on database query statistics and after approx. 5 minutes I see

SELECT Query Statistics

Query Count Total Time Avg. Time

SELECT DISTINCT serviceID FROM pubsubNode WHERE serviceID=? 8,146 73,686 9

SELECT xmppHash, legacyIdentifier, createDate, lastUpdate, imageType FROM gatewayAvatars WHERE jid=? 116 1,175 10

SELECT registrationID FROM gatewayRegistration WHERE jid=? AND transportType=? 96 1,001 10

In just 5 minutes the SELECT DISTINCT serviceID FROM pubsubNode WHERE serviceID=? has been executed about 8000 times.

Is there any known bug in the PubSub service? Where can I turn it off?

Hi,

are you using the embedded-db? This could explain why the memory increases so much and it seems that someone is using pubsub. You may want to enable the audit log to look for pubsub XMPP packages.

I wonder how one can disable pubsub - maybe one can block pubsub requests using a filter plugin.

LG

LG,

Support to disable pubsub was added in 3.5.0: JM-1262

xmpp.pubsub.enabled= false

daryl

Thanks daryl,

I always look in Openfire Properties for such properties and did add it now there.

LG

I found that flag and disabled PubSub but this did not seem to make any difference. The flag seems to only change the capabilities send to the client. If a client does a PubSub request it gets a response anyway.

By further investigating it seems like this problem does not come from the PubSub service but from the IM gateway. We had enabled the XMPP transport and allowed connection to any XMPP domain (i.e. entered wildcard as domain). There seems to be some kind of loopback when people try to connect back to our own Openfire server using the XMPP transport.

I am not sure whether this only happens when someone tries to register the transport to the same account where the transport is in (this would obviously lead to a loopback), or this even happens when a user registers to another account on the same server.

We turned XMPP off and now the server is running ok since nearly 3 days.

XMPP wildcard domain has been added just lately to the IM gateway. I think there should be some checking for these loopbacks as these could bring any Openfire easily down.

Henrik

for most XMPP servers you do not need to use the gateway plugin if you enable Server to Server. I connect to igniterealtime.org and another server in england this way.

I know that it is possible to have a contact in your account that resides on another server and Openfire will connect using server-to-server.

I want users to be able to add any Jabber account they may have on another server and import their contacts. Just like they are able to enter i.e. their ICQ account. Maybe I am missing something? Is there a way to do that without the transport?

Henrik

unfortunately no. I also do not think the gateway plugin was designed to work as you configured it. It was meant, I believe, for a single server config.

The really sad thing is when I saw the title of this thread I thought; great, someone has finally figured out how to tune Openfire down to a reasonable amount of RAM. I’d really love to use Openfire but I couldn’t keep the memory utilization below 600MB (RSS btw) with any amount of uptime. It’s in need of some serious profiling.

Howdy,

I share your concerns. The openfire devels are currently rapidly working on Clearspace integration and open sourcing lots of goodies. So there is currently some feature creep at the moment. I am sure once the dust settles, some performance work will be done like it was last summer I believe.

daryl