Hazelcast Clustering Plugin Readme

Overview

The Hazelcast plugin adds support for running multiple redundant Openfire servers together in a cluster. By running Openfire as a cluster, you can distribute the connection load among several servers, while also providing failover in the event that one of your servers fails. This plugin is a drop-in replacement for the original Openfire clustering plugin, using the open source Hazelcast In-Memory Data Grid data distribution framework in lieu of an expensive proprietary third-party product.

The current version of the plugin uses Hazelcast release 5.5.0

Clustering vs. Federation

XMPP is designed to scale in ways that are similar to email. Each Openfire installation supports a single XMPP domain, and a server-to-server (S2S) protocol as described in the specification is provided to link multiple XMPP domains together. This is known as federation. It represents a powerful way to "scale out" XMPP, as it allows an XMPP user to communicate securely with any user in any other such federated domain. These federations may be public or private as appropriate. Federated domains may exchange XMPP stanzas across the Internet (WAN) and may even discover one another using DNS-based service lookup and address resolution.

By contrast, clustering is a technique used to "scale up" a single XMPP domain. The server members within a cluster all share an identical configuration. Each member will allow any user within the domain to connect, authenticate, and exchange stanzas. Clustered servers all share a single database, and are also required to be resident within the same LAN-based (low latency) network infrastructure. This type of deployment is suitable to provide runtime redundancy and will support a larger number of users and connections (within a single domain) than a single server would be able to provide.

For very large Openfire deployments, a combination of federation and clustering will provide the best results. Whereas a single clustered XMPP domain will be able to support tens or even hundreds of thousands of users, a federated deployment will be needed to reach true Internet scale of millions of concurrent XMPP connections.

Installation

To create an Openfire cluster, you should have at least two Openfire servers, and each server must have the Hazelcast plugin installed. To install Hazelcast, simply drop the hazelcast.jar into $OPENFIRE_HOME/plugins along with any other plugins you may have installed. You may also use the Plugins page from the admin console to install the plugin. Note that all servers in a given cluster must be configured to share a single external database (not the Embedded DB).

By default during the Openfire startup/initialization process, the servers will discover each other by exchanging UDP (multicast) packets via a configurable IP address and port. However, be advised that many other initialization options are available and may be used if your network does not support multicast communication (see Configuration below).

After the Hazelcast plugin has been deployed to each of the servers, use the radio button controls located on the Clustering page in the admin console to activate/enable the cluster. You only need to enable clustering once; the change will be propagated to the other servers automatically. After refreshing the Clustering page you will be able to see all the servers that have successfully joined the cluster.

Note that Hazelcast and the earlier clustering plugins (clustering.jar and enterprise.jar) are mutually exclusive. You will need to remove any existing older clustering plugin(s) before installing Hazelcast into your Openfire server(s).

With your cluster up and running, you will now want some form of load balancer to distribute the connection load among the members of your Openfire cluster. There are several commercial and open source alternatives for this. For example, if you are using the HTTP/BOSH Openfire connector to connect to Openfire, the Apache web server (httpd) plus the corresponding proxy balancer module (mod_proxy_balancer) could provide a workable solution. Some other popular options include the F5 LTM (commercial) and HAProxy (open source), among many more.

Basic load balancing can be achieved through DNS. Openfire's Network Configuration Guide describes how DNS SRV records are used to resolve the servers that are part of your Openfire XMPP domain. A DNS SRV record has 'priority' and 'weight' attributes that can easily be used to configure load balancing support. This does require client support. Most TCP-based XMPP client libraries provide such support, although DNS SRV support for web-based clients is typically lacking.

A simple round-robin DNS configuration can help distribute XMPP connections across multiple Openfire servers in a cluster. While popular as a lightweight and low-cost way to provide basic scalability, note that this approach is not considered adequate for true load balancing nor does it provide high availability (HA) from a client perspective. If you are evaluating these options, you can read more here.

Upgrading the Hazelcast Plugin

The process of upgrading the Hazelcast plugin requires a few additional steps when compared with a traditional plugin due to the cross-server dependencies within a running cluster. Practically speaking, all the members of the cluster need to be running the same version of the plugin to prevent various errors and data synchronization issues.

Before upgrading, ensure that the configuration of your system is still compatible with that of the new plugin. When the new Openfire plugin is based on a newer version of the Hazelcast project than the previous plugin that you used, your conf/hazelcast-local-config.xml configuration file might no longer be compatible. This is notably the case when upgrading to the Openfire Hazelcast plugin version 3.0.0 from an earlier plugin (as this version updates Hazelcast from 3 to 5). In such cases, it is advisable to compare the content of conf/hazelcast-local-config.xml with that of a fresh copy. Refer to the FAQ at the bottom of this document for known changes.

Option 1: Offline

NOTE: This upgrade procedure is neat and tidy, but will incur a brief service outage.

Shut down Openfire on all servers in the cluster.
For the first server in the cluster, perform the following steps:
1. Remove the existing plugins/hazelcast.jar
2. Remove (recursively) the plugins/hazelcast directory
3. Copy the updated hazelcast.jar into the plugins directory
4. Restart Openfire to unpack and install the updated plugin
Repeat these steps for the remaining servers in the cluster.

Option 2: Online

NOTE: Using this approach you should be able to continue servicing XMPP connections during the upgrade.

Shut down Openfire on all servers except one.
Using the Plugins page from the online server, remove the existing Hazelcast plugin.
Upload the new Hazelcast plugin and confirm it is installed (refresh the page if necessary)
Use the "Offline" steps above to upgrade and restart the remaining servers.

Option 3: Split-Brain

NOTE: Use this approach if you only have access to the Openfire console. Note however that users may not be able to communicate with each other during the upgrade (if they are connected to different servers).

From the Clustering page in the Openfire admin console, disable clustering. This will disable clustering for all members of the cluster.
For each server, update the Hazelcast plugin using the Plugins page.
After upgrading the plugin on all servers, use the Clustering page to enable clustering. This will activate clustering for all members of the cluster.

Configuration

There are several configuration options built into the Hazelcast plugin as Openfire system properties:

hazelcast.startup.retry.count (1): Number of times to retry initialization if the cluster fails to start on the first attempt.
hazelcast.startup.retry.seconds (10): Number of seconds to wait between subsequent attempts to start the cluster.
hazelcast.max.execution.seconds (30): Maximum time to wait when running a synchronous task across members of the cluster.
hazelcast.config.xml.filename (hazelcast-cache-config.xml): Name of the Hazelcast configuration file. By overriding this value you can easily install a custom cache configuration file in the Hazelcast plugin /classes/ directory, in the directory named via the hazelcast.config.xml.directory property (described below), or in the classpath of your own custom plugin.
hazelcast.config.xml.directory ({OPENFIRE_HOME}/conf): Directory that will be added to the plugin's classpath. This allows a custom Hazelcast configuration file to be located outside the Openfire home directory.
hazelcast.config.jmx.enabled (false): Enables JMX support for the Hazelcast cluster if JMX has been enabled via the Openfire admin console. Refer to the Hazelcast JMX docs for additional information.

Note: The default hazelcast-cache-config.xml file included with the plugin will include a file conf/hazelcast-local-config.xml that will be preserved between plugin updates. It is recommended that local changes are kept in this file.

The Hazelcast plugin uses the XML configuration builder to initialize the cluster from the XML file conf/hazelcast-local-config.xml. By default the cluster members will attempt to discover each other via UDP multicast at the following location:

IP Address: 224.2.2.3
Port: 54327

These values can be overridden in the conf/hazelcast-local-config.xml file via the multicast-group and multicast-port elements. Many other initialization and discovery options exist, as documented in the Hazelcast configuration docs noted above. For example, to set up a two-node cluster using well-known DNS name/port values, try the following alternative:

...
<join>
    <multicast enabled="false"/>
    <tcp-ip enabled="true">
      <member>of-node-a.example.com</member>
      <member>of-node-b.example.com</member>
    </tcp-ip>
</join>
...

Please refer to the Hazelcast reference manual for more information.

A Word About Garbage Collection

Hazelcast is quite sensitive to delays that may be caused by long-running GC cycles which are typical of servers using a Serial garbage collector. In most cases it will be preferable to activate a garbage collector that reduces the pause time and latency introduced by garbage collectors. The concurrent garbage collector (CMS), G1 garbage collector and Z Garbage Collector (ZGC) are examples of collectors that minimize blocking within the JVM.

When using CMS, you may be able to counter the effects of heap fragmentation by using JMX to invoke System.gc() when the cluster is relatively idle (e.g. overnight). This has the effect of temporarily interrupting the concurrent GC algorithm in favor of the default GC to collect and compact the heap.

In addition, the runtime characteristics of your Openfire cluster will vary greatly depending on the number and type of clients that are connected, and which XMPP services you are using in your deployment. However, note that because many of the objects allocated on the heap are of the short-lived variety, increasing the proportion of young generation (eden) space may also have a positive impact on performance. As an example, the following OPENFIRE_OPTS have been shown to be suitable in a three-node cluster of servers (four CPUs each), supporting approximately 50k active users:

OPENFIRE_OPTS="-Xmx4G -Xms4G -XX:NewRatio=1 -XX:SurvivorRatio=4 
               -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+UseParNewGC
               -XX:+CMSParallelRemarkEnabled -XX:CMSFullGCsBeforeCompaction=1 
               -XX:CMSInitiatingOccupancyFraction=80 -XX:+UseCMSInitiatingOccupancyOnly 
               -XX:+PrintGCDetails -XX:+PrintPromotionFailure"

This GC configuration will also emit helpful GC diagnostic information to the console to aid further tuning and troubleshooting as appropriate for your deployment. Please refer to the documentation of your Java runtime environment to learn about the available collectors and their configuration.

Configuring Cache expiry times and sizes

Core Openfire caches

When clustering is enabled, the only way to change the size of standard Openfire caches is by editing plugins/hazelcast/classes/hazelcast-cache-config.xml on every node in the cluster, shutting down every node in the cluster, and then restarting the nodes. Dynamic configuration is not currently possible, and using different configurations on different nodes is liable to lead to odd behaviour.

This is different to non-clustered caches, where it is sufficient to edit the cache.[cache name].maxLifetime and cache.[cache name].size System Properties

Plugin defined caches

A plugin can create its own Cache without the requirement to edit any configuration files. For example;

    final String cacheName = "my-test-cache";
    CacheFactory.setMaxSizeProperty(cacheName, maxCacheSize);
    CacheFactory.setMaxLifetimeProperty(cacheName, maxLifetime);

    final Cache<JID, Instant> cache = CacheFactory.createCache(cacheName);

Notes;

CacheFactory.setMaxSizeProperty/CacheFactory.setMaxLifetimeProperty will set Openfire System Properties that are used to configure the Cache when it is created.
If no Openfire System Properties are set, the default expiry values are used (unlimited size and six hours, respectively).
The first node in the cluster to call CacheFactory.createCache will use the configured expiry values, subsequent calls to CacheFactory.createCache will simply retrieve the existing Cache with the previously configured expiry values. It is not possible to change the expiry values after the Cache has been created.

Q&A for upgrade hazelcast.jar from 2.6.1 to 3.0.0 (which upgrades Hazelcast from 3 to 5)

When upgrading to hazelcast.jar 3.0.0 from an earlier version, a major upgrade of the Hazelcast library is introduced (from 3.x to 5.x). The configuration in conf/hazelcast-local-config.xml will require modification for clustering functionality to be restored.

It is advisable to backup conf/hazelcast-local-config.xml, remove it, and have it generated anew, to then manually apply previously applied configuration changes.

Known configuration changes with this particular upgrade are listed below.

Group definition

Hazelcast 5 no longer recognized the 'group' configuration element. When your configuration in conf/hazelcast-local-config.xml contains a snippet like the following, you can remove these lines from the file.

<group>
    <name>openfire</name>
    <password>mysecret</password>
</group>

With the offending lines in your configuration, the Hazelcast plugin will fail to start up, logging errors like these:

org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory - Unable to start clustering - continuing in local mode
com.hazelcast.config.InvalidConfigurationException: cvc-complex-type.2.4.a: Invalid content was found starting with element '{"http://www.hazelcast.com/schema/config":group}'.

CP Subsystem

CP Subsystem is a component of a Hazelcast cluster that builds a strongly consistent layer for a set of distributed data structures. As well as network partitions, the CP Subsystem withstands server and client failures.

If missed the configuration of CP Subsystem, you will occur logs like this:

2024.06.28 13:44:23 WARN [PluginMonitorTask-2]: com.hazelcast.cp.CPSubsystem - [10.42.0.165]:5701 [openfire-cluster-by-hazelcast] [5.3.7] CP Subsystem is not enabled. CP data structures will operate in UNSAFE mode! Please note that UNSAFE mode will not provide strong consistency guarantees.

You can add the below config to your conf/hazelcast-local-config.xml

<cp-subsystem>
    <cp-member-count>3</cp-member-count>
    <group-size>3</group-size>
</cp-subsystem>

Q&A for upgrade hazelcast.jar from 3.1.0 to 5.5.0 Release 1 (which upgrades Hazelcast from 5.3.7 to 5.5.0)

When upgrading to hazelcast.jar 5.5.0 Release 1 from an earlier version, a minor upgrade of the Hazelcast library is introduced (from 5.3.7 to 5.5.0). The configuration in conf/hazelcast-local-config.xml will require modification for clustering functionality to be restored.

It is advisable to backup conf/hazelcast-local-config.xml, remove it, and have it generated anew, to then manually apply previously applied configuration changes.

Support for the CP Subsystem has been removed from the community edition of Hazelcast. As a result, the corresponding configuration (the cp-subsystem element) must be removed from your conf/hazelcast-local-config.xml file. If the CP Subsystem configuration is not removed, the cluster will fail to be initialized, and the following error will be added to the logs:

2024.11.05 18:59:24.051 ERROR [PluginMonitorTask-2]: com.hazelcast.instance.impl.Node - Node creation failed java.lang.IllegalStateException: CP subsystem is a licensed feature. Please ensure you have an Enterprise license that enables CP.