This thread is archived
18 Replies Last post: Jul 19, 2006 2:44 PM by Ryan Graham  
Guy Martin Bronze 48 posts since
Jun 22, 2005
Currently Being Moderated

Jun 30, 2006 12:32 PM

Wildfire 2.6.2 startup times

Hello all,

 

This may wander partly into a DB/OS tuning discussion, so forgive me, but it does have a fair amount to do with Wildfire.

 

I have a large installation (~9500 registered users - with upwards of 3100 online at any one time).  I''ve tuned the Java VM options properly to handle this load, on a linux box running RHEL 3, with 8GB of RAM and plenty of disk.  We just upgraded to Wildfire 2.6.2 from 2.5.1, and are running against PostgreSQL 7.4.8, and authenticating against our corporate LDAP server.

 

The problem we have is that on server startup, we have a TON of clients who have ''auto-reconnect'' set to on.  Because of some issues previously with memory, I have a DB connection pool limit of 550 connections, so, when we bring the server back online, we see thousands of connection requests, and obviously some get blocked until a free DB connection is available.

 

Our machine sits with ~550 postmaster processes chugging away, and it takes a long long time for anyone to log in.  Once you do login, it usually takes upwards of 15-20 minutes for your roster to show up in the client, obviously because of database slowness.

 

Yesterday we increased the max shmem segment size in the linux kernel from 32MB to 128MB, and that appeared to help briefly.  From what I can see from monitoring the system, this problem isn''t a Java or Jive problem per se, as the java process isn''t taking up much CPU load.  Do I have any other options other than increasing the max # of DB connections in the pool?

 

Is PostgreSQL the right DB to handle this kind of Jive load?  Does anyone out there have a similiarly large installation?  Any pointers would be appreciated at this point.  My users are about ready to shoot me..   Thanks.

 

-Guy

LG KeyContributor 4,983 posts since
Dec 13, 2005
Currently Being Moderated
Jun 30, 2006 12:53 PM in response to: Guy Martin
Re: Wildfire 2.6.2 startup times

Hi,

 

for me it sounds really evil to open 500 database connections, I don''t want to know how many CPU-seconds PostgreSQL needs just to open them and they may use all some memory and thus PostgreSQL has less memory for its cache.

 

Wifi 3.0 offers Connection Managers, which may help to reduce some load from Wildfire itself, but your database may still suffer so I can''t recommend to get 3.0 to fix it.

 

There is some code available to make Wildfire handle the connection pool a little bit faster (about 10x) but as far as I know Gato did not include it in Wifi 3.0 so you will have no benefit in using 3.0 regarding the DB connections. I assume you don''t want to test this code (it compiles fine with Wifi 2.6.2 anyhow) in your environment.

 

LG

LG KeyContributor 4,983 posts since
Dec 13, 2005
Currently Being Moderated
Jun 30, 2006 1:43 PM in response to: Guy Martin
Re: Wildfire 2.6.2 startup times

Hi,

 

I wonder how many connections you are using during normal operation. You are running all  products on one server so monitoring the memory usage will be not as easy as if PostGreSQL would be running on another server, but top and Fn (sort memory) may be your friend to do this manually, "ps -someoptions" within cron may be better to monitor everything automatically. If you have during normal operation much more "free" cached memory than during startup then I''d decrease the maxDB connection value. You may have read already http://revsys.com/writings/postgresql-performance.html or similar articles "max_connections = ... Use this feature to ensure that you do not launch so many backends that you begin swapping to disk and kill the performance of all the children. Depending on your application it may be better to deny the connection entirely rather than degrade the performance of all of the other children.".

 

LG

Matt Tucker Jiver 3,151 posts since
Jun 28, 2001
Currently Being Moderated
Jul 2, 2006 1:28 PM in response to: Guy Martin
Re: Wildfire 2.6.2 startup times

Guy,

 

Ok, here''s the idea that''s taking shape in my head. We already know the last login time of users (I think). It''s probably a valid assumption that the users with the most recent login time will be the ones most likely to login first when the server starts up (based on auto-reconnect, etc). Therefore, we could have a new property "cache.readAheadUsers" or something. When set to a value of say 500, that would mean that the server would read the data of those 500 users into cache before the server starts accepting requests.

 

It should be possible to make the database queries to load the read-ahead users quite efficient. For example, a single query to load all roster data using an IN clause. So, instead of thousands of queries, the server would only need to make a couple when it''s first starting up.

 

Does this seem like the right approach? The nice thing is that individual implementations will be able to tune the readAhead value. Or, maybe we can even auto-tune the read-ahead size based on how many users connect to your server? It would probably only take about a day to implement this logic.

 

Regards,

Matt

LG KeyContributor 4,983 posts since
Dec 13, 2005
Currently Being Moderated
Jul 2, 2006 4:03 PM in response to: Matt Tucker
Re: Wildfire 2.6.2 startup times

Hi Matt,

 

as far as I know Wildfire currently tracks only "lastActivity" when a user logs out normally. While "lastLogin" should work just fine there are alot of other caches where one can''t use "lastLogin" to determine the last state of the cache. I prefer a logic which is usable for every cache.

 

LG

Matt Tucker Jiver 3,151 posts since
Jun 28, 2001
Currently Being Moderated
Jul 2, 2006 4:46 PM in response to: LG
Re: Wildfire 2.6.2 startup times

Ahh, ok. So, lastLogin would be something we need to add at the same time. Seems like a useful bit of information to store anyway.

 

I''m not sure what you mean with reference to the cache. What caches are you thinking of?

 

Regards,

Matt

LG KeyContributor 4,983 posts since
Dec 13, 2005
Currently Being Moderated
Jul 3, 2006 1:58 PM in response to: Matt Tucker
Re: Wildfire 2.6.2 startup times

Hi Matt,

 

let my try to explain it with an example:

*jiveVCard *cache has 1 MB, it contains *username *and value.

So one could dump the whole cache (1 MB) or "references" (~50k) - in this case the username.

 

It may be enough to get and dump the "references" every 30 (make it configurable) minutes, so the overhead is not too big.

 

LG

Matt Tucker Jiver 3,151 posts since
Jun 28, 2001
Currently Being Moderated
Jul 7, 2006 2:56 AM in response to: Guy Martin
Re: Wildfire 2.6.2 startup times

Guy,

 

I filed JM-764 -- please feel free to add comments. I''ve started doing some initial profiling and it looks like a user login requires about 12 database queries. That''s actually down from about 15: I already optimized away a few of the queries, for example as described in JM-762.

 

I''d like to get some more insight into which database queries are slow (if any). There are basically three scenarios I can think of:

 

1) The sheer number of database queries when thousands of users are logging in takes a long time to process, even if none of them is very expensive.

2) We''re not blocking on the database at all, but some other part of Wildfire code. For example, do you use LDAP?

3) There are a few database calls that are very expensive.

 

I''m adding in a database profiling tool to Wildfire that will help us answer these questions. In the meantime, it would be great if you could gather some more detailed information from your database. What''s the ordering of most common queries (top 20)? What''s the average length of time those queries run, total time?

 

Thanks,

Matt

LG KeyContributor 4,983 posts since
Dec 13, 2005
Currently Being Moderated
Jul 18, 2006 2:30 PM in response to: Guy Martin
Re: Wildfire 2.6.2 startup times

Hi,

 

I agree that tuning the cache is not easy as the available counters are not so helpful.

I miss a counter where one can see how often the cache was purged because it was full. And how often also new objects had to be purged.

I did talk to Gato and did create JM-693 as a note so one could take a look at the cache classes and improve performance of the cache itself. Especially with a lot of objects (where''s the #objects counter) the cache may slow down.

 

Do you have any hints or a small cache-tuning howto?

 

LG

Ryan Graham KeyContributor 1,729 posts since
Jan 17, 2003
Currently Being Moderated
Jul 19, 2006 2:44 PM in response to: Guy Martin
Re: Wildfire 2.6.2 startup times

Hi Guy,

 

Thanks for the kind words.

 

I actually found it very interesting tuning a Wildfire installation the size of yours. While working with you the two cache related items that really jumped out at me were:

 

1. Bumping the Roster cache from 0.5 MB to 5 MB increased its effictiveness from barely 20% to nearly 80% (that precentage is probably higher now with the size set to 10 MB).

 

2. Being able to reduce the number of database connection by 40% (550 -> 330) simply by allocating a total of ~25 MB to the User, Roster and vCard caches.

 

I think for a lot of situations little has to be done (if anything) when installing Wildfire beyond maybe increasing the memory you give it; I certainly wouldn''t start fiddling with the caches unless the effectiveness rate was real low after I had had Wildfire up and running for awhile. But, being able to view the various cache data in the Admin Console was extremely helpful in being able to tune Wildfire in Guy''s situation.

 

If anyone else has done some tweaking to their Wildfire installation feel free to share it here or in another thread.

 

Cheers,

Ryan

Matt Tucker Jiver 3,151 posts since
Jun 28, 2001
Currently Being Moderated
Jun 30, 2006 1:25 PM in response to: Guy Martin
Re: Wildfire 2.6.2 startup times

Guy,

 

It sounds like we need to implement a few optimizations to handle this case. I''m guessing that the right caching or pre-loading logic could help a ton. Do you have any insight into which database queries are taking the longest, or which ones are being executed the most often? That would provide some clues as to the best place to optimize.

 

For example, we could load blocks of roster data into memory at a time instead of one at a time if that turned out to be a database hotspot.

 

Regards,

Matt

LG KeyContributor 4,983 posts since
Dec 13, 2005
Currently Being Moderated
Jun 30, 2006 4:09 PM in response to: Matt Tucker
Re: Wildfire 2.6.2 startup times

Hi Matt,

 

we could load blocks of roster data into memory at a time sounds a lot like a read-ahead feature with the risk of reading too much and the wrong information and thus decreasing the performance.

Anyhow filling the Wifi database cache very fast would be great. Currently the cached objects (or references to them) are stored only in memory so Wildfire can not use this information after a restart but it would be nice if it would store it either in a file or the database so it can restore the cache after startup very fast.

 

LG