Openfire 3.5.1 / 100% CPU / DoS on it self

Since a few days I saw a extremely high count of XMPP-packets send by Openfire in my statistics. The is no abnormal traffic on the network device, nothing in the logs, only 100% CPU load.

Today I wrote a small plugin, which is able to count packets per user per time. I found out, that Openfire does send 7000-8000 packets per second from an JID that is offline to another unavailable JID.

  • The first JID is a normal user account which is offline. I’m sure about this, because its my own account and I’m online with another resource.

  • The second JID is chat.yahoo.jabber.rwth-aachen.de, which belongs to our PyYIMt transport. This JID does is also unavailable, because it’s the yahoo MUC, which does not work.

Server runs stable, so I will wait until tomorrow morning (08:00 CEST) until I try a restart.

I will try to investigate more about this weird bug.

Thats the packet:

<iq type="get" to="chat.yahoo.jabber.rwth-aachen.de" id="aad0a" from="martin.weusten *at* jabber.rwth-aachen.de/Psi-Laptop">
<query xmlns="http://jabber.org/protocol/disco#info"></query>
</iq>

I’m currently online with martin.weustenatjabber.rwth-aachen.de/Psi only, and Psi-Laptop is not shown in sessions list.

Also I tried to kill the yahoo transport, which has no effect.

Hrm. Does the packet appear to be dancing back and forth between the two entities? (even though they aren’t available) I’ve seen this occur before, but I was never able to track down a way to reproduce it, and it vanished and never showed up for me again. Are you able to easily reproduce it?

Does the packet appear to be dancing back and forth between the two entities?

As it looks like, no. Only in this one direction.

Are you able to easily reproduce it?

I could not stop it by now.

Its possible that this has something to do with unstable network connections. That day it started I was online with that account using an unstable WLAN connection. (the network connection crashed every 10 seconds)

I will now try to drop the packets explicitly using a filter rule.

=) But with a fresh restart, are you able to easily trigger it “going nuts” so to speak again?

So one way… so its’ as if it tries to route the packet, fails, tries again, fails, over and over again?

But with a fresh restart

I don’t want to do a restart of the server. It’s my production server, currently 170+ users online. I will wait until tomorrow morning, when much less where connected.

it tries to route the packet, fails, tries again, fails, over and over again?

Seems so.

Ah, of course =) (not restarting right now) I misjudged that this was on a test server.

Throwing a PacketRejectedException at the packet had success. It seems I could break this endless loop without restart.

Packet count per minute is back to normal, but CPU load is still high.

Hi Coolcat,

if you are on Linux you may be able and want to identify the thread which is causing the high cpu usage. See JVM Settings and Debugging.

LG

Ok, I identified the TID of the thread. But what does

You should find somewhere in the javacore a line with “… nid=0x207A …” - this should be the looping thread.

mean? Do I need this stacktrace tool for this? According to the screenshots, it seems the tool needs an XServer. I have no XServer installed, since it is a production server, not a desktop system.

I tried to find the thread with “jstack” (part of JDK), but had no success. “jstack” seems to use other TIDs.

Hi,

you need a stack trace, kill -3 PID should produce one. Windows guys may want to use the Stacktrace tool. One may want to take a look at http://www.adaptj.com/main/tracehowtos#ht1

LG

I tried it on my test server, but “kill -3 OpenfirePID” has no effect?

(I’m using Java 6…googling for “kill -3” finds only Java 1.3, 1.4, …?)

Hi,

my JVM 1.6.0_02 writes the stack trace to standard out, Openfire will likely log this in nohup.out.

LG

Openfire will likely log this in nohup.out.

ah…yes

Here the first four threads, sorted by time:

TIME      PID      TID     HexTID  PRI
03:33:24  17465  17467     443B    24
00:26:37  17465  17515     446B    24
00:04:59  17465  17530     447A    24
00:03:11  17465  17470     443E    21

Now the stack traces from nohup.out:

"VM Thread" prio=10 tid=0x09618c00 nid=0x443b runnable
"SocketAcceptorIoProcessor-1.0" prio=10 tid=0x09d6c800 nid=0x446b runnable [http://0x9f815000..0x9f8160b0|http://0x9f815000..0x9f8160b0]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
        - locked <0xa2e9a200> (a sun.nio.ch.Util$1)
        - locked <0xa2e9a1f0> (a java.util.Collections$UnmodifiableSet)
        - locked <0xa2e9a0a8> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
        at org.apache.mina.transport.socket.nio.SocketIoProcessor$Worker.run(SocketIoProcessor.java:480)
        at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:51)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
        at java.lang.Thread.run(Thread.java:619)
"SocketAcceptorIoProcessor-2.0" prio=10 tid=0x0993d000 nid=0x447a runnable [http://0x9f37f000..0x9f37fe30|http://0x9f37f000..0x9f37fe30]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
        - locked <0xa2f0e8a0> (a sun.nio.ch.Util$1)
        - locked <0xa2f0e8b0> (a java.util.Collections$UnmodifiableSet)
        - locked <0xa2f0e860> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
        at org.apache.mina.transport.socket.nio.SocketIoProcessor$Worker.run(SocketIoProcessor.java:480)
        at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:51)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
        at java.lang.Thread.run(Thread.java:619)
"Finalizer" daemon prio=10 tid=0x0961d000 nid=0x443e in Object.wait() [http://0xa08fe000..0xa08fefb0|http://0xa08fe000..0xa08fefb0]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0xa2a708e0> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
        - locked <0xa2a708e0> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

I hope you can do something with that.

I should mention that CPU load is slowly decreasing: approx 0.5% per hour.

Hi,

do you have enough memory available? The JVM thread with 3:33 hours seems to be much to high.

LG

I supply VM Parameters “-Xms256m -Xmx256m” at start. Currently are 137 users online. We had no problems with that setting, even with 220 users online.

Hi,

do you write a GC log? I wonder if one thread is allocation and releasing objects in a loop and thus the garbage collector runs every second or more often. Of if your PermSize is too small causing also a lot of useless garbage collections.

LG

do you write a GC log?

No, can I enable that without restarting the server?