For the monitoring of MQ on z/OS, there are a couple of key metrics you need to keep an eye on for the logging component, as you sit watching the monitoring screen.
I’ll explain how MQ logging works, and then give some examples of what I think would be key metrics.
Quick overview of MQ logging
- MQ logging has a big(sequential) buffer for logging data, it wraps.
- Application does an MQPUT of a persistent message.
- The queue manager updates lots of values (eg queue depth, last put time) as well as move data into the queue manager address space. This data is written to log buffers. A 4KB page can hold data from many applications.
- An application does an MQCOMMIT. MQ takes the log buffers up and including the current buffer and writes it to the current active log data set. Meanwhile other applications can write to other log buffers.
- The I/O finishes and the log buffers just written can be reused.
- MQ can write up to 128 pages in an I/O. If there are more than 128 buffers to write there will be more than 1 I/O.
- If application 1 commits, the IO starts, and then application 2 commits. The I/O for the commit in application 2 has to wait for the first set of disk writes to finish, before the next write can occur.
- Eventually the active log data set fills up. MQ copies this active log to an archive data set. This archive can be on disk or tape. This archive data set may never be used again in normal operation. It may be needed for recovery of transactions or after a failure. The Active log which has just been copied can now be reused.
What is interesting?
Displaying how much data is logged per second.
Last week XXXXXXXXXXXXXXXXXXXXXXXXXXXXX
0 100MB/Sec 200 MB/Sec
This shows that the logging rate today is lower than last week. This could be caused by
- Today is just quieter than last week
- There is a problem and there are fewer requests coming into MQ. This could be caused by problems in another component, or a problem in MQ. When using persistent messages the longest part of a transaction is the commit and waiting for the log disk I/O. If this I/O is slower it can affect the overall system throughput.
- You can get the MQ log IO response times from the MQ log data.
Displaying MQ log I/O response time
You can break down where time is spent in doing I/O into the following area
- Scheduling the I/O – getting the request into the I/O processor on the CPU
- Sending the request down to the Disk controller(eg 3990)
- Transferring data
- The I/O completes, and send an interrupt to z/OS, z/OS has to catch this interrupt and wake up the requester.
Plotting the I/O time does not give an entirely accurate picture, as the time to transfer the data depends on the amount of data to transfer. On a well run system there should be enough capacity so the other times are constant. (I was involved in a critical customer situation where the MQ logging performance “died” every Sunday morning. They did backups, which overloaded the I/O system).
In the MQ log statistics you can calculate the average I/O time. There are two sets of data for each log
- The number of requests, and sum of the times of the requests to write 1 page. This should be pretty constant, as the data is for when only one 4KB was transferred
- The number of requests, and sum of the times of the requests to more than 1 page. The average I/O time will depend on the amount of data transferred.
- When the system is lightly loaded, there will be many requests to write just one page.
- When big messages are being processed (over 4 KB) you will see multiple pages per I/O.
- If an application processes many messages before it commits you will get many pages per I/O. This is typical of a channel with a high achieved batch size.
- When the system is busy you may find that most of the I/O write more than one page, because many requests to write a small amount of data fills up more than one page.
I think displaying the average I/O times would be useful. I haven’t done tried this in a customer environment (as I dont have customer environment to use). So if the data looks like
Today XXXXXXXXXXXXXXXXXXXXXXXX Last week XXXXXXXXXXXXXXXXXXXXXXXXXXXXX
One hour ago XXXXXXXXXXXXXXXXXXX
time in ms 0 1 2 3
it gives you a picture of the I/O response time.
- The dark green is for I/O with just one page, the size of the bar should be constant.
- The light green is for I/O with more than one page, the size of the bar will change slightly with load. If it changes significantly then this indicates a problem somewhere.
Of course you could just display the total I/O response time = (total duration of I/Os) / (total number of I/Os), but you lose the extra information about the writing of 1 page.
Reading from logs
If an application using persistent messages decides to roll back:
- MQ reads the log buffers for the transaction’s data and undoes any changes.
- It may be the data is old and not in the log buffers, so the data is read from the active log data sets.
- It may be that the request is really old (for example half an hour or more), MQ reads from the archive logs (perhaps on tape).
Looking at the application doing a roll back, and having to read from the log.
- Reading from buffers is OK. A large number indicates application problem or a DB/2 deadlock type problem. You should investigate why there is so much rollback activity
- Reading from Active logs … . this should be infrequent. It usually indicates an application coding issue where the transaction took too long before commit. Perhaps due to a database deadlock, or bad application design (where there is a major delay before the commit)
- Reading from Archive logs… really bad news….. This should never happen.
Displaying reads from LOGS
Today XXXXXXXXXXXXXXXXXXXXXXXX Last week X One hour ago XXXXX rate 0 10 20 40
Where green is “read from buffer”, orange is “read from active log”, red is “read from Archive log. Today looks a bad day”.