Load Testing & java.lang.OutOfMemoryError

Folks,

I am stress testing our OPF 3.6.2 setup, and finding at around concurrent 250 logged in users in a chat room with Chat Room Logging enabled and Message Archiving enabled, that over time the server begins to degrade and throw OutOfMemoryErrors as follows. I increased the memory allocated to the OPF Java PID by adding the approp args to the startup (init.d) script as follows, but still the same result. Any ideas how to fix?

Thanks,

BEA

java.lang.OutOfMemoryError: Java heap space

2008.12.18 17:56:14 [org.jivesoftware.openfire.nio.ConnectionHandler.exceptionCaught(ConnectionHand ler.java:110)]

java.lang.OutOfMemoryError: Java heap space

2008.12.18 17:57:24 [org.jivesoftware.openfire.nio.ConnectionHandler.exceptionCaught(ConnectionHand ler.java:110)]

java.lang.OutOfMemoryError: GC overhead limit exceeded

Prepare openfire command line

OPENFIRE_OPTS="${OPENFIRE_OPTS} -DopenfireHome=${OPENFIRE_HOME} -Dopenfire.lib.dir=${OPENFIRE_LIB} -Djava.net.preferIPv4Stack=true -Xms512m -Xmx2048m -Xss128k -Xoss128k -XX:ThreadStackSize=128"

Folks,

FYI, my load testing consisted of creating a Java Client based on Smack. Each new instance of the client joined the same MUC, and send a message every 11 seconds while logging the chat sessions as a server setting. I used three hosts to create three test scenarios by 100, 75, and 60 unique client connections. I found at 100 (x3 = 300 concurrent) and 75 (x3 = 225) connections, the java memory grew and failed. Scaling back to 60 (@ 180) is presently showing a stable loading test. I ended up adjusting the JVM params in /etc/sysconfig/openfire as follows and so far is working fine after researching more in the forum based on similar posts.

OPENFIRE_OPTS="-Djava.net.preferIPv4Stack=true -Xss128k -Xms32m -XX:+HeapDumpOnOutOfMemoryError -XX:ThreadStackSize=128 -XX:MaxPermSize=128m -XX:+PrintGCDetails -Xloggc:/tmp/gc.log"

Thanks,

BEA

Hi,

this could be a queue which is growing faster than Openfire can read. I think that Gato did post a plugin to monitor the queues and to size them. Anyhow I can’t find the thread or document right now.

LG

Hi LG,

Indeed, what you posted here was very helpful. I found his post and he refers to, “the load stats plugin to collect information about the MINA queues/buffers” as well as analyzing the memory dump. Even with the reduced load, the server crashes within the hour. I deleted all non-essential plugins and am retesting and reviewing this info.

Thanks,

BEA

*1 http://www.igniterealtime.org/community/message/184704#184704

Deleted the Search and Monitor plugins, now the memory profile is running normally. Will let it run and see how it holds up.

Thanks,

BEA

Server is not holding up against stress testing, and is dropping clients as usage goes up. My latest test case shows 150 concurrent clients, sending messages at 11 second intervals, will eventually degrade the server, to about 37 connections with high memory overhead (~75%). The clients processes running from external nodes are not exiting. Server log messages as follows.

2008.12.19 18:43:08 No ACK was received when sending stanza to: org.jivesoftware.openfire.nio.NIOConnection@eb3e01 MINA Session: (SOCKET, R: /192.168.0.212:39288, L: /192.168.1.226:5222, S: 0.0.0.0/0.0.0.0:5222)

Hi BEA,

Out of curiosity, can you provide some more details on your Openfire setup? What sort of machine you’re running it on, which database you’re using, etc. Also, is there a particular reason you seem to be focusing your tests on MUCs (multi-user chats)? Having 150 users in a single room sending messages every eleven seconds would generate 110,000+ messages a minute, so if you’re machine isn’t able to process all those messages in a timely manner they’re going to back up and will consume all your machines memory.

Regards,

Ryan

Greets,

Sure, our application is for user support. Typically users will create a trouble shooting ticket, and login to a MUC, where one or more support staff can then collaborate and message while recording conversations for a log transcript. My aim is to be confident many concurrent MUC sessions can be supported without crashing. Also to know how well OPF will scale performance wise as more and more MUC sessions are initiated, and what are the upper limits that may cause the server to crash. Knowing these details will be very important. I am a little fatigued mentally honesty, and now that you explained what my test case is doing, it makes sense to revise my load testing scenario.

Server H/W Spec:

Dual Intel Xeon CPU E5430 @ 2.66GHz

3.3GB Ram

Adaptec RAID 5

CentOS 5.1 x32 bit

S/W Spec:

OPF 3.6.2

MySQL

Thanks,

BEA

p.s. I’ll post the client tester based on smack.

Load test program. Note it is not handling any messages coming back, but only sends messages. Run by, "java XMPPLoadTester "

import java.lang.;
import java.util.Date;
import org.jivesoftware.smack.
;
import org.jivesoftware.smack.packet.;
import org.jivesoftware.smackx.muc.
;
import org.jivesoftware.smack.PacketListener;
import org.jivesoftware.smack.filter.*;

public class XMPPLoadTester {
static public class ChatListener implements PacketListener {
public void processPacket(Packet packet) {
System.out.println(packet.toXML());
}
}

public static void main(String args[]) {
    try {
        //args[0] - host
        //args[1] - uid
        //args[2] - pwd
        //args[3] - chat room     

         PacketFilter myFilter = new PacketFilter() {
           public boolean accept(Packet packet) {
             return true;
           }
        };

        ConnectionConfiguration config = new ConnectionConfiguration(args[0], 5222);
        XMPPConnection conn1 = new XMPPConnection(config);
        ChatListener cl = new ChatListener();
        conn1.connect();
        conn1.addPacketListener(cl, myFilter);
        conn1.login(args[1], args[2]);
        MultiUserChat chat = new MultiUserChat(conn1, args[3] + "@conference.experts-exchange.com");
        chat.join(args[1]);

        String msg;
        for (int i = 0; i < 10000; ++i) {
            msg = (new Date()).toString() + " " + (new Double( Math.random())).toString();
            chat.sendMessage(msg);
            Thread.sleep(1100);
        }
        conn1.disconnect();

    } catch (Exception e) { e.printStackTrace(); }
}

}

[updated 12/30/08 to handle incoming messages]

Hi,

with this code you will have problems to generate useful load.

1st you only send messages, so you can not guarantee that the messages are also received. So it produces more or less useful results.

2nd you will have problems to scale. Which hardware do you use to start this program 100 times? Even if you reuse objects you are still using a client library which does not scale very good. You may have some difficulties to use a Smack based load test program to cause significant load for Openfire. Using one MUC and not p2p is probably a way to generate significant load but I wonder if this is a normal setup.

You may want to use a standard client and join a MUC during your load test. With more than one message per second you may have problems to keep track of the ongoing conversation.

LG

I took a look at Tsung load testing, and boy was I not too thrilled with their Erlang dependency. I modified the Smack tester to add a incoming message sink, and that did produce useful load test, without the server crashing. I just hope some malicious community member does not decide to write a client that fakes a known good agent to crash production servers.

import java.lang.;
import java.util.Date;
import org.jivesoftware.smack.
;
import org.jivesoftware.smack.packet.;
import org.jivesoftware.smackx.muc.
;

public class XMPPLoadTester {
static public class ChatListener implements PacketListener {
public void processPacket(Packet packet) {
System.out.println(packet.toXML());
}
}

public static void main(String args[]) {
    try {
        //args[0] - host
        //args[1] - uid
        //args[2] - pwd
        //args[3] - chat room     

        ConnectionConfiguration config = new ConnectionConfiguration(args[0], 5222);
        XMPPConnection conn1 = new XMPPConnection(config);
        conn1.connect();
        conn1.login(args[1], args[2]);
        MultiUserChat chat = new MultiUserChat(conn1, args[3] + "@conference.mydomain.com");
        chat.join(args[1]);

        ChatListener cl = new ChatListener();
        chat.addMessageListener(cl);

        String msg;
        for (int i = 0; i < 10000; ++i) {
            msg = (new Date()).toString() + " " + (new Double( Math.random())).toString();
            chat.sendMessage(msg);
            Thread.sleep(1100);
        }
        conn1.disconnect();

    } catch (Exception e) { e.printStackTrace(); }
}

}

A modest P4 was connecting that many times, with good loading numbers, but this unit accounts for one of three and is the slowest unit. I only need to show this will handle a decent load with long term stability.

Thanks

OK, fyi 100 connections on a P4, with load average: 0.04, 0.21, 0.22. I’ll let it run for as long as possible and be watching for loading spikes or breakage.

Ah, loading did spike up to 250+, … looks like I need a new game plan =. [Follow-up note: reducing the number of clients to 50 works. I there’s now a total of 144 connections from three machines in this manner (e.g. 50, 50, 44).]

Message was edited by: slicer321

OK I took a look at the JHat Heap Histogram resulting from the heap dump obtained by -XX:+HeapDumpOnOutOfMemoryError Java flag, and found it is due to turning on Conversation Log for MUC and was the culpurit as follows.

Class
Instance Count
Total Size
class [C
476751
47412974
class org.jivesoftware.openfire.muc.spi.ConversationLogEntry
424379
11882612
class [B
16237
11616574
class java.lang.String
477956
7647296

Where ‘class [C’ :

References to this object:

com.sun.jmx.mbeanserver.OpenConverter$IdentityConverter@0xacc31988 (20 bytes) : field openClass
java.util.WeakHashMap$Entry@0xacc31948 (36 bytes) : field referent
java.io.ObjectStreamField@0xacc84678 (29 bytes) : field type

Any ideas why this is not flushing or how to force a flush (e.g. garbage collection)?

Thanks,

BEA

Indeed there is a way to flush the conversation logs objects. Group Chat->Group Chat Settings->Other Settings-> See Flush interval (seconds) & Batch size settings. I’m testing to see if this resolves the issue.

[Update: seeing this activity in the logs when the server gets to 97% usage. After, mem usage drops to 45% percentiles]

5991.176: [Full GC [PSYoungGen: 400K->0K(13568K)] [PSOldGen: 116200K->42253K(116544K)] 116600K->42253K(130112K) [PSPermGen: 31620K->31620K(34560K)], 0.3613140 secs]

Latest update: OPF crashed reporting, “java.lang.OutOfMemoryError: Java heap space” as before. JHat heap dump reports 6343271 instances (consuming 50746168 bytes memory or 48M) of “class java.util.concurrent.ConcurrentLinkedQueue$Node”. Test case, based on smack library, ran approx 4 hours with 144 connections, 50 MUC, with approx 2 to 3 users per room.

nohup.log:

java.lang.OutOfMemoryError: Java heap space
72677206 [client-18] WARN org.jivesoftware.openfire.nio.ClientConnectionHandler - [/10.0.0.2:57053] Unexpected exception from exceptionCaught handler.

java.lang.OutOfMemoryError: GC overhead limit exceeded
72725723 [client-18] WARN org.jivesoftware.openfire.nio.ClientConnectionHandler - [/10.0.0.2:57053] Unexpected exception from exceptionCaught handler.

What is this class java.util.concurrent.ConcurrentLinkedQueue$Node and how can I set OPF to flush it? See also Java Thread Dump.

Thanks
JavaThreadDump.txt (27117 Bytes)

OK looks like I need to try Tsung load testing. I was a little bit stubborn, but my load tester program may be the issue.

Thanks

Bah … seems I discovered setting ‘Don’t Show History’ was necessary for my load testing. I am guessing every time a new message was sent, the history queue ring needed updating, and this was eventually hosing the server. Am retesting to verify. I see the load, java CPU and memory usage have dropped considerably!

P.S. Still not using tsung, but using the Smack based test programs mentioned earlier.

Issue resolved by:

  • Flush the conversation logs objects. Group Chat->Group Chat Settings->Other Settings-> See Flush interval (seconds) & Batch size settings.

  • Setting ‘Don’t Show History’.

  • Modifying posted load tester by creating a threaded Smack Java app, and piping stdout to a FIFO. (See http://www.igniterealtime.org/community/thread/36560)

Message was edited by: slicer321