Tuning QueueMetrics memory settings

This article was contribuited by Emile Coetzee of Clarotech Consulting (South Africa), who spent a significant time working with Loway to tune the JVM memory and garbage collection policies. This article is a summary of his findings and explains how to tune the JVM in your own environment.

''I’ve spent a good few months working with the Loway team trying to track down a performance problem in QueueMetrics and it looks like we have finally made a breakthrough. I’m currently testing a "beta" version which is looking to be very promising. I thought I would post some of the history and some of the useful information I’ve gathered over time. Even though I believe the improvements Loway have made mostly contribute to the overall solution, your Java performance settings play a key role as well.''

Prerequisites

In order to understand this article fully, it is advisable to have a basic knowledge of JVM monitoring and tuning; it would be advisable to read Advanced JVM monitoring.

This article applies to recent version of QueueMetrics running on Java 6 or 7 JVMs. QueueMetrics 12.09 includes significant performance increases so it is highly recommended that you upgrade to this version in addition to following the steps detailed below.

Usage scenario

For simplicity I will be referring to QM as the application. Obviously it is served by Tomcat which uses Java. Between Tomcat and Java is where most of the troubleshooting and setting changes need to happen, but the action of running QM is what causes Tomcat and Java to become unstable.

Typical symptoms I was experiencing were either all or a combination of the following:

  • QueueMetrics GUI becomes terribly slow or inaccessible

  • High CPU usage caused by Java

  • Out of memory errors in catalina.out

  • High run time values recorded in catalina.out

  • XML-RPC queries time out

For a number of clients simply setting up a cron job to restart Tomcat once a day was generally enough to prevent slowdowns from occurring (might still happen once or twice a month). This unfortunately did not work for the larger sites with 400+ agents, where I’d often have to restart Tomcat multiple times during office hours.

Monitoring basics: Java Visual VM

So where does one start? The first thing you want to do is get your Java Visual VM monitoring working. This is detailed earlier in this manual: Advanced JVM monitoring.

The 3 things you want to look at on the Monitor page are:

  • CPU

  • (Memory) Heap

  • (Memory) PermGen

Memory Settings - Heap

After discussion with Loway they require 5/6Mb of RAM in the Heap per agent accessing the GUI. On top of that you need to allow overhead for Java as well as your reporting. At one client site I had about 400 agents. So 400 x 6 = 2400. I’m not sure how much to allocate for reports so I played it safe and rounded up to 4096 as they do pull large reports. You then use this value to set your Xms and Xmx values.

Loway suggested that I set the Xms and Xmx values the same. Thus I used: -Xms4096M -Xmx4096M. You also want to make sure to add -server as this changes the compiler in Java. Read more here: http://stackoverflow.com/questions/198577/real-differences-between-java-server-and-java-client

Note: Be sure that your memory settings are within the limits of your physical RAM (bearing in mind that your OS and other applications like MySQL also need resources). I have 12GB of RAM in my 400 Agent server of which 8GB is in use (mostly Tomcat and MySQL).

Memory Settings - PermGen

Next thing to look at is PermGen. Often the OutOfMemory events are in fact not from Heap, but PermGen. You might see this in the catalina.out log:

Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: PermGen space.

I hadn’t realised that just like the Heap you can also set the PermGen size. By default this seems to be about 80Mb. I experimented with 256Mb and eventually settled on 512Mb. So add these settings to your config for tomcat:

-XX:PermSize=512M -XX:MaxPermSize=512M

This change made a significant difference to the stability of QM. You can read more about PermGen here: https://blogs.oracle.com/jonthecollector/entry/presenting_the_permanent_generation

Garbage Collection

Next up is Garbage Collection. When you start reading about Garbage Collection there is a lot of information and a lot of it differs between Java versions, so make sure your reading matches your Java version. The default collector in Java 6 is selected based on your hardware and OS, but you can force which collector to use by adjusting your tomcat settings. For single CPU setups use a serial collector: -XX:+UseSerialGC for multi CPU servers use a parallel (aka throughput collector): -XX:+UseParallelGC. Before I discovered my PermGen size problem I also tried a concurrent collector: -XX:+UseConcMarkSweepGC. This seems to perform better where PermGen size is limited. Once I increased my PermGen size I went back to UseParallelGC as Loway recommended this. My server has 2 x quad core CPUs with HT, so it makes sense to use it.

While we are talking about GC let’s also look at some additional logging you can turn on for GC. You can add the following to your tomcat settings:

-verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails

This adds additional logging to your catalina.out file. Often when QM was in a hung state I would only see GC log events in catalina.out this generally coincided with Heap being maxed out. Later when I paid more attention to PermGen and CPU I would see the same effects when they were maxed. You can also add settings to alert you when Java runs out of memory.

Add the following to tomcat settings:

-XX:OnError=/bin/javaerrormailgen.sh
-XX:OnOutOfMemoryError=/bin/javaerrormailmem.sh

The scripts can contain anything you like (you could for instance trigger a restart of Tomcat). In my case I just used them to send me email.

This is ''javaerrormailgen.sh'':

#!/bin/sh
#

echo `date` | mail -s "SITENAME Java Error: General" me@domain.com

An this is ''javaerrormailmem.sh'':

#!/bin/sh
#

echo `date` | mail -s "SITENAME Java Error: OutOfMemory" me@domain.com

Troubleshooting: taking thread and memory dumps

Once you have these things in place you can now start monitoring JavaVM and Tomcat logs and capture details for feedback to Loway. Capturing jstack and jmap is detailed in the same document and the JVM setup, but I will list some changes to these commands which I found worked better.

jstack -F -l 21472

Parameters are:

  • -F Forces the thread dump. I often found that in a hung state I was unable to get a thread dump without this.

  • -l Prints a long listing with more info

  • 21472 Is the Java (Tomcat) PID

jmap -F -dump:live,format=b,file=heap.bin 21472

Parameters are:

  • -F Forces the thread dump.

  • -dump Dumps into a binary format file called heap.bin. Make sure you have disk space available as this file can get very large. It does compress reasonably well using bz2 if you need to upload it somewhere for Loway.

  • 21472 Is the Java (Tomcat) PID

I have found that both these commands will pause Tomcat while the information is extracted, so running this on a working system will cause it to stop while it executes. Obviously if the system is already hung, it doesn’t matter

Once I had a larger PermGen set I did see an improvement in the sense that no longer would QM simply hang, but it would still slow down. This was evident in the JVM where you could see as PermGen usage climbed so did the CPU. In the past when PermGen was maxed out it would eventually cause QM to become completely unresponsive. Once you have more overhead in PermGen it can actually recover.

QueueMetrics release 12.09 and greater requires less PermGen space for string handling, but may still require a sizeable quantity that exceeds the JVM defaults.

Final Settings

For a quick copy and paste here are my final settings for a 400+ Agent server with 2 x Quad CPU and 12GB RAM running Tomcat, MySQL & Apache. These settings must be set in the JAVA_OPTS property in '/etc/init.d/qm-tomcat6'.

Bare essentials:

-Xms4096M -Xmx4096M -server -XX:+UseParallelGC -XX:PermSize=512M -XX:MaxPermSize=512M

With extra logging, JVM and Java alerts:

-Xms4096M -Xmx4096M -server
-Dcom.sun.management.jmxremote.port=9003
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails
-XX:+UseParallelGC -XX:PermSize=512M -XX:MaxPermSize=512M
-XX:OnError=/bin/javaerrormailgen.sh
-XX:OnOutOfMemoryError=/bin/javaerrormailmem.sh

Though the finer details of tuning your own JVM depend on the total system memory, whether you have a multi-core machine or not and whether you run a 32 or 64 bit server, the process described in this article will offer you data you can work with and wil be a reasonable start for large sites looking for real-life QueueMetrics implementations.

Quick JVM cheatsheet

Memory size

It is better to set the default and maximum memory setting to the same amount, so that memory can be efficiently allocated from the starts.

  • '-Xmx=1000M -Xmx=1000M' set the total heap to 1000 Megabytes - this does not include Permgen

  • '-XX:PermSize=512M -XX:MaxPermSize=512M' set the total PermGen size to 512M - this does not include the heap

Garbage collection models

You must choose only one of these options, based on your hardware and throughput specifications:

  • '-XX:+UseSerialGC' - this is the default model but may cause large application pauses

  • '-XX:+UseParallelGC' - this tries collecting memory in parallel, ideal for large heaps and multi-CPU systems. uses a parallel version of the young generation collector

  • '-XX:+UseConcMarkSweepGC' - the Concurrent Low Pause Collector may offer better throughput at the price of some additional heap usage.

  • '-XX:+UseParNewGC' - runs parallel GC on the New generation, avoididing promotion of useless objects to the old generation

  • '-XX:+CMSParallelRemarkEnabled' - lowers remarking pauses when running with Concurrent Mark Sweep

64-bit servers

  • '-XX:+UseCompressedOops' will use less heap on 64-bit systems that have less than 32G installed. May speed things significantly up.

Debugging

The following options may be helpful in understanding what is going on:

  • '-verbose:gc' - logs garbage collections

  • '-XX:+PrintGCTimeStamps' - prints the thimestamp of a garbage collection

  • '-XX:+PrintGCDetails' - print details of a garbage collection

  • '-XX:OnError=…​' - runs a script on errors

  • '-XX:OnOutOfMemoryError=…​'' - runs a script on memory errors

Misc

  • '-server' mode should always be turned on for QueueMetrics systems.