Getting useful information out of JMX data

The data coming from Liberty WebServer through the JMX interface  provides some data, but it is not very useful, and it may become inaccurate over time.

I’ll cover

  1. Getting a useful mean value
  2. Getting a more accurate long term mean
  3. Data gets more inaccurate over time
  4. Standard deviation (this may only be of interest to a few people)

For example from JMX, the reported  mean time for mqconsole transactions  was 9.9 milliseconds – this is for all requests since the mqweb server was started.   Over the previous minute the average measured time, for a 10 second period was 7, or 8 milliseconds, so well below the 9.9 reported.

This is because the mean time includes any initial start up time.   The maximum transaction time, at the start of the run, was over 2 seconds.   This will bias the mean.

You can process the data to extract some useful information, and I show below how to get out useful response time values.

You get the following data (and example values) from mqweb through the JMX interface.

ResponseTimeDetails/count (Long) = 20
ResponseTimeDetails/description (String) = Average Response Time for servlet
ResponseTimeDetails/maximumValue (Long) = 3060146565 
– in nanoseconds (see below for the unit)
ResponseTimeDetails/mean (Double) = 4.336789965E8
– in nanoseconds
ResponseTimeDetails/minimumValue (Long) = 2474556
– in nanoseconds
ResponseTimeDetails/standardDeviation (Double) = 9.089057964078983E8
– in nanoseconds
ResponseTimeDetails/total (Double) = 8.67357993E9
– used in calculations
ResponseTimeDetails/unit (String) = ns
– the unit ns = nanoseconds
ResponseTimeDetails/variance (Double) = 8.319076762335653
– used in calculations

Getting a useful mean value

To produce these numbers, the count of the response times and the sum of the transaction response times are accumulated within the Liberty Server.  To calculate the mean value you calculate sum/count.   This gives you the overall mean time.  If you obtain the data periodically you can manipulate the data to provide some useful statistics.

Let the count and sum at time T1 be Count1, and Sum1, and at time T2 Count2, and Sum2.
You can now calculate (Sum2- Sum1)/(Count2 – Count1) to get the average for that period.  For example the reported mean was 0.016 ms, but the calculated value gave 0.008 ms.  You can also calculate (Count2 – Count1)/(T2-T1) to give a rate of requests per second.   These are much more useful than the raw data.  I suggest collecting the data every minute.

Getting a more accurate long term mean

The first rest request and console request take a long time because the java code has to be loaded in etc.  In one test the duration of the first request was 50 times the duration of the other requests.  A better “mean” value is to ignore the duration of the first request.

The improved mean is (JMX mean * JMX count  – JMX Maximum value) /(JMX Count-1), or JMXMean – (JMXMaximum/JMXCount) .

Data gets more inaccurate over time

The total time is stored as a floating point double.  As you add small numbers to big numbers, the small numbers may be ignored.  Let me try to explain.

Consider a float which has precision of 3, so you could have 1.23 E2 = 1230.  Add 1 to this, and you get 1231 which is 1.23 E2 with a precision of 3 – the number we started with.

The total time is in nanoseconds so 1 second is stored as 1.0 E9.  With 100 of these a second, and 1 hour( 3600 seconds) for 100 hours is 360,000,000, or 3.6 E8 seconds.  * 1.0 E9 nano seconds. = 3.6E17 nano seconds.   The precision  of most float numbers is 16, so with this 3.6 E17 we have lost the odd nanosecond.    I do not think this is a big enough problem to worry about – unless you are running for years without restarting the server.

The variance uses the time**2 value.  So with the maximum time above 599482097 nano seconds. Time **2 is 3.593787846×10¹⁷ and you are already losing data.  I expect the variance will not be used very often, so I think this can be ignored.

If the times were in microseconds instead of nano seconds, this would not be a problem.

Getting a useful standard deviation (this may only be of interest to a few people)

The standard deviation gives a measure of the spread of the data, a small standard deviation means the data is in a narrow band, a larger standard deviation means it is spread over a wider band.  Often 95% of the values are within plus or minus 3 * standard deviations from the mean, so anything outside this range would considered an outlier, or unusual statistic.

I set up some data, a mixture of  10  values 9, 10, 11,  the standard deviation was 0.73.    I changed one value to 20, and the standard deviation changed to 3.3, indicating a wide spread of values.

With a mixture of 100 values 9,10,11, the standard deviation was 0.71.   I changed one value to 20, and the standard deviation changed to 1.2, so a much smaller value, most of the data was still around 10 – just one outside the range.

With a lot of data, the standard deviation converges on a value, and “unusual” numbers make little difference to the value.  I think that the standard deviation over an extended period is not that useful, especially if you get periodic variations such as busy time, and overnight.

You calculate the standard deviation as the square root of the variance.   The variance is (Sum of (values**2) – (mean ** 2)) /number of measurements.

With data

ResponseTimeDetails/count (Long) = 203
ResponseTimeDetails/mean (Double) = 6420785.187192118 nanoseconds
ResponseTimeDetails/variance (Double) = 1.7113341125320868E15 – used in calculations

Variance  = 1.7113341125320868E15 =  ( (Sum of (values**2) – (6420785.187192118 ** 2)) / 203

So (Sum of (values**2)) =   3.474420513264337e+17

You can now do the sums as with the mean, above:

At time T1, the ssquares1 is the sum of (values**2)   at time T2, the ssquares2 is the sum of (values**2).

You can now calculate ssquares2 – ssquares2, and used that to estimate the variance, and so the standard deviation of the data in the range T1 to T2, I’ll leave the details to the end user.

For the advanced user,  you can use the mean for the interval – rather than the long term mean.  Good luck.

 

4 thoughts on “Getting useful information out of JMX data

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s