Should I monitor MQ – if so what for ?

I’ve been talking to someone about using the MQ SMF data, and when would it be useful. There is a lot of data. What are the important things to watch for, and what do I do when there is a problem?

Why monitor?

From a performance perspective there are a couple of reasons why you should monitor

Today’s problems

Typical problems include

  1. “Transaction slow down”, people using the mobile app are timing out.
  2. Our new marketing campaign is a great success – we have double the amount of traffic, and the backend system cannot keep up.
  3. The $$ cost of the transactions has gone up.   What’s the problem.

With problems like transaction slow down, the hard part is often to determine which is the slow component.  This is hard when there may be 20 components in the transaction flow, from front end routers, through CICS, MQ, DB2, IMS, and WAS, and the occasional off-platform request.

You can often spot problems because work is building up, (there are transactions or messages queued up), or response times are longer.  This can be hard to interpret because “the time to put a message and get the reply from MQ is 10 second” may at first glance be an MQ problem – but MQ is just the messenger, and the problem is beyond MQ.  I heard someone say that the default position was to blame MQ, unless the MQ team could prove it wasn’t them.

Yesterday’s problem

Yesterday/last week you had a problem and the task force is looking into it.  They now want to know how MQ/DB2/CICS/IMS etc was behaving when there was a problem.  For this you need Historical Data.  You may have summary data recorded on a minute by minute basis, or you may have summary data recorded over an hour.   If the data is averaged over an hour you may not see any problems. A spike in workload may be followed by no work, and so on average every thing is OK.
It is useful to have “maximum value in this time range”. So if your maximum disk I/O time was 10 seconds in this interval at 16:01:23:45 and the problem occurred around this time, it is a good clue to the problem.

Tomorrow’s problem.

You should be able to get trending information.  If your disk can sustain an I/O rate of 100MB a second, and you can see that every week at peak time, you are logging an extra 5MB/second, this tells you that you need to so something to fix it, either get faster disks, or split the work to a different queue manager.

Monitoring is not capacity planning.

Monitoring is how is it performing in the current configuration.  Monitoring may show a problem, but it is up to the capacity and tuning team to fix it.  For example – how big a buffer pool do we need is a question for the capacity team.  You could make the buffer pools use GB of buffers – or keep the buffer pools smaller and let MQ “do the paging to and from disk”.

How do you know when a ‘problem’ occurs.

I remember going to visit a customer because they had a critical problem with the performance on MQ.  There were so many things wrong it was hard to know where to start.  The customer said that the things I pointed out were always bad – so they were not the current problem.  Eventually we found the (application) problem.  The customer was very grateful for my visit – but still did not fix the performance problems.

One thing to learn from this story is that you need to compare a bad day with a good day, and see what is different.  This may mean comparing it with the data from the same time last week, rather than from an hour ago.  I would expect that last week’s profile should be a good comparison to this week.   One hour ago there may not have been any significant load.

With MQ, there is a saying “A good buffer pool is an empty buffer pool”.  Does a buffer pool which has filled up, and causing lots of disk I/O mean there is a problem?  Not always.  It could be MQ acting a queueing system and if you wait for half an hour for the remote system to restart all of the messages will flow, and the buffer pool become empty.  If you know this can happen, it it good to be told it is happening, but the action may be “watch it”.  If this is the first time it has happened, you may want to do a proper investigation, and find out which queues are involved, which channels are not working, and what remote system are down.

What information do I need?

It depends on what you want.  If you are sitting in the operations room watching the MQ monitor while sipping a cup of your favourite brew, then you want something like modern cars.  If there is a problem, a red light on the dashboard light up meaning “You need to take the car to the garage”.   The garage can then put the diagnostic tools onto the engine and get the reason.

You want to know if there is a problem or not.  You do not need to know you have a problem to 3 decimal places – yes, maybe, or no is fine.

If you are investigating a problem from last week, you, being the role of the garage mechanic, need the detailed diagnostics.

When do you need the data?

If you are getting the data from SMF records you may get a record every 1 minute, or every half an hour.  This may not be frequent enough while there is a problem.  For example if you have a problem with logging, you want to see the data on a second by second basis, not once every 30 minutes.

Take the following scenario.  It is 10:59 – 29 minutes into the period when you get an SMF (or online monitor) data.

So far in this interval, there have been 100,000  I/Os.   The total time spent doing I/Os is 100 seconds,  By calculation the average time for an IO is 1 millisecond.  This is a good response time.

You suddenly hit a problem, and the IO response time goes up to 100 ms, 10 more I/Os are done.

The total number of I/Os is now 100,010 , the time spent doing I/OS is now 101 seconds.  By calculation the average I/O time is now 1.009899 milliseconds.  This does not show there is a problem as this is within typical variation.

If you can get the data from a few seconds ago and now you can calculate the differences

  1. number of IOs 100,010 – 100,000 = 10
  2. time spent doing I/O 101 -100 = 1 second
  3. Average I/O time 100 ms – wow this really shows the problem, compared to calculating the value from the 30 minute set of statistics which showed the time increasing from 1 ms to 1.01 ms.
This shows you need granular data perhaps every minute – but this means you get a lot of data to manage.