For the monitoring of MQ on z/OS, there are a couple of key metrics you need to keep an eye on for the buffer pools, as you sit watching the monitoring screen.
A quick overview of MQ buffer pools
An inefficient queue manager could have all messages stored on disk, in page sets. An efficient queue manager would cache hot pages in memory to they can be accessed without doing any I/O.
The MQ Buffer Manager component does this caching, by using buffers (so no surprise there).
A simplified view of the operation is as follows
- Messages are broken down into 4KB pages.
- Getting the contents of a pageset page, if the page is not in the buffers, the buffer manager reads it from disk into a buffer.
- If a page is requested and the contents are not required (for example it is about to be overwritten as part of a put) it does not need to read it from disk.
- If the page is updated, for example a new message, or a field in the message is updated during get processing, the page is not usually written to disk immediately. The write to disk is deferred (using a process called Deferred Write Process – another non surprise in the naming convention). This has the advantage that there can be many updates to a page before it is written to disk.
- Any buffer page which has not been changed, and is not currently being used, is a free page.
If the system crashes, non persistent messages are thrown away, and persistent messages can be rebuilt from the log.
To reduce the restart time, pages containing persistent data are written out to the page set periodically. This is driven by the log data set filling up which causes a checkpoint. Updates which have been around for more than 2 checkpoints are written to disk. During restart the page set knows how far back in the log restart needs to go.
In both cases, checkpoint and buffer pool filling up (when there are less than 15 % of free =85% in use) , once a page has been successfully written to the page set, the buffer is marked as free.
Pages for non-persistent messages can be written out to disk.
If the buffer pool is critically short of free buffers, and there are less than 5% free buffers, then pages are written to the page set immediately rather than use the deferred write process. This allows some application work to be done while the buffer manger is trying to make more free pages available.
What is typical behaviour?
The buffer pool is working well.
When a message is got, the buffer is already in the buffer pool so there is no I/O to read from the page set.
The buffer pool is less than 85% full (has more than 15% free pages), there is periodic I/O to the page set because pages with persistent data are written to the page set at checkpoint.
The buffer pool fills up and has more 85% in-use pages.
This can occur if the number and size of the messages being processed is bigger than the size of the buffer pool. This could be a lot of messages in a unit of work, big messages using lots of pages, or lots of transactions putting and getting messages. It can also occur when there are lots of puts, but no gets.
If the buffer pool has between 85% and 95% of in-use pages( between 15% and 5% free pages), the buffer manager is working hard to keep free pages available.
There will be I/O intermittently at checkpoints, and a steady I/O as the buffer manager writes the pages to the page set.
If messages are being read from disk, there will be read activity from the pageset, but the buffer pool page can be reused as soon as the data has been copied from the buffer pool page.
The buffer pool has less than 5% free pages.
The buffer manager is in overdrive. It is working very hard to keep free pages in the buffer pool. There will be pages written out to the page set as it tries to increase the number of free pages. Gets may require reads from page set I/O. All of this I/O can cause I/O contention and all the page set I/Os slow down, and so MQ API request using this buffer pool slow down.
What should be monitored
Most cars these days have a “low fuel light” and a “take me to a garage” light. For monitoring we can provide similar.
Monitor the % buffer pool full.
- If it is below 85% things are OK
- If it is between 85% and 95% then this needs to be monitored, it may be “business as usual”.
- If it is >=95% this needs to be alerted. It can be caused by applications or channels not processing messages
Monitoring the number of pages written does not give much information.
It could be caused by a checkpoint activity, or because the buffer pool is filling up.
Monitoring the number of pages read from the page set can provide some insight.
If you compare today with this time last week you can check the profile is similar.
If the buffer pool is below 85% used,
- Messages being got are typically in the buffer pool so there is no read I/O.
- If there is read I/O this could be for messages which were not in the buffer pool – for example reading the queue after a queue manager restart.
If the buffer pool than 85% in-use and less than 95% in-use this can be caused by a large message work load coming in, and MQ is being “elastic” and queuing the messages. Even short lived messages may be read from disk. The number of read I/Os give an indication of throughput. Compare this with the previous week to see if the profile is similar.
If the buffer pool is more than 95% in-use this will have an impact on performance, as every page processed is likely to have I/O to the page set, and elongated I/O response time due to the contention.
What to do
You may want “operators notes” either on paper or online which describe the expected status of the buffer pools on individual queue managers.
- PROD QMPR
- BP 1 green less than 85% busy
- BP 2 green less than 85% busy
- BP3 green except for Friday night when it goes amber read I/O rate 6000 I/O per minute.
- TEST QMTE
- BP 1 green less than 85% busy
- BP 2 green less than 85% busy
- BP 3 usually amber – used for bulk data transfer
What to do the buffer statistics mean?
There are statistic on the buffer pool usage.
- Buffer pool number.
- Size of buffer pool – when the data was produced.
- Low – the lowest number of free pages in the interval. 100* (Size – log)/Size gives you % full.
- Now – the current number of free pages in the interval.
- Getp – the number of requests ‘get page with contents’. If the page is not in the buffer pool then it is read from the page set.
- Getn. A new page is needed. The contents are not relevant as it is about to be overwritten. If the page is not in the buffer pool, just allocate a buffer, and do not read from the page set.
- STW – set write intent. This means the page was got for update. I have not seen a use for this. For example
- A put wants to put part of a message on the page
- A get is being done and it wants to set the “message has been got” flag.
- The message has been got, and so pointers to a page need to be updated.
- RIO -the number of read requests to the page set. If this is greater than zero
- The request is for messages which have not been used since the queue manager started
- The buffer pool had reached 85%, pages had been moved out to the page set, and the buffer has been reused.
- WIO the number of write I/Os that were done. This write could be due to a checkpoint, or because the buffer pool filled up.
- TPW total pages written, a measure of how busy the buffer pool was. This write could be due to a checkpoint, or because the buffer pool filled up.
- IMW – immediate write. I have not used this value, sometimes I observe it is high, but it is not useful. This can be caused by
- the buffer pool being over 95% full, so all application write I/O is synchronous,
- or a page was being updated during the last checkpoint, and it needs to be written to the page set when the update has finished. This should be a small number. Frequent checkpoints (eg every minute) can increase this value.
- DWT – the number of times the Deferred Write processor was started. This number has little value.
- The DWP could have started and been so busy that it never ended, so this counter is 1.
- The DWP could have started, written a few pages and stopped – and done this repeatedly, in this case the value could be high.
- DMC. The number of times the buffer pool crossed the 95% limit. If this is > 0 it tells you the buffer pool crossed the limit
- This could have crossed just once, and stayed over 95%
- This could have gone above 95%, then below 95% etc.
- SOS – the buffer pool was totally short on storage – there were no free pages. This is a critical indicator. You may need to do capacity planning, and make the buffer pool bigger, or see if there was a problem where messages were not being got.
- STL – the number of times a “free buffer” was reused. A buffer was associated with a page of a page set. The buffer has been reused and is now for a different page. If STL is zero, it means all pages that were used were in the buffer pool
- STLA – A measure of contention when pages are being reused. This is typically zero.
Now you know as much as I do about buffer pools you’ll see that the %full (or %free) is the key measure. If the buffer pool is > 85% used pages, then the I/O rate is a measure as to how hard the buffer manager is running.