Real time monitor for z/OS TCPIP interface statistics

Ive put up on Github a program which provides real time statistics for z/OS TCPIP interfaces.  For example

Interface       ,HH:MM:SS,IBytes,OBytes,IB/sec,OB/sec,,IUP   ,OUP   ,,...
TAP0 ,09:11:12, 0, 0, , ,, 0, 0,,...
TAP0 ,09:11:22, 0, 0, 0, 0,, 0, 0,,
TAP0 ,09:11:32, 0, 0, 0, 0,, 0, 0,,
TAP0 ,09:11:42, 348, 348, 34, 34,, 3, 3,,
TAP0 ,09:11:52, 0, 0, 0, 0,, 0, 0,,
TAP0 ,09:12:02, 0, 0, 0, 0,, 0, 0,,
TAP0 ,09:12:12, 0, 0, 0, 0,, 0, 0,,
TAP0 ,09:12:22, 0, 0, 0, 0,, 0, 0,,
TAP0 ,09:12:32, 0, 0, 0, 0,, 0, 0,,
TAP0 ,09:12:42, 0, 0, 0, 0,, 0, 0,,

Where the values are deltas from the previous time interval

  • IBytes is input bytes during that interval
  • OBytes is output bytes
  • IB/Sec is rate of input bytes
  • OB/Sec is rate of output byte
  • IUP is count of Input Unicast packets
  • OUP is count of Output Unicast packets.

You get one output file per interface.

The output file can be downloaded and used as input to a spread sheet, to graphically display the data.

I’m working on some code to display the global TCPIP stats, but I do not think they are very interesting.  If they are interesting to you, please let me know and I’ll add my code.

What should I monitor for MQ on z/OS – logging statistics

For the monitoring of MQ on z/OS, there are a couple of key metrics you need to keep an eye on for the logging component, as you sit watching the monitoring screen.

I’ll explain how MQ logging works, and then give some examples of what I think would be key metrics.

Quick overview of MQ logging

  1. MQ logging has a big(sequential) buffer for logging data, it wraps.
  2. Application does an MQPUT of a persistent message.
  3. The queue manager updates lots of values (eg queue depth, last put time) as well as move data into the queue manager address space.  This data is written to log buffers. A 4KB page can hold data from many applications.
  4. An application does an MQCOMMIT.  MQ takes the log buffers up and including the current buffer and writes it to the current active log data set.  Meanwhile other applications can write to other log buffers.
  5. The I/O finishes and the log buffers just written can be reused.
  6. MQ can write up to 128 pages in an I/O. If there are more than 128 buffers to write there will be more than 1 I/O.
  7. If application 1 commits, the IO starts,  and then application 2 commits. The I/O for the commit in application 2 has to wait for the first set of disk writes to finish, before the next write can occur.
  8. Eventually the active log data set fills up.  MQ copies this active log to an archive data set.  This archive can be on disk or tape.   This archive data set may never be used again in normal operation.  It may be needed for recovery of transactions or after a failure.   The Active log which has just been copied can now be reused.

What is interesting?

Displaying how much data is logged per second.

Today       XXXXXXXXXXXXXXXXXXXX
Last week XXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Yesterday XXXXXXXXXX          
      0                     100MB/Sec    200 MB/Sec

This shows that the logging rate today is lower than last week.   This could be caused by

  1. Today is just quieter than last week
  2. There is a problem and there are fewer requests coming into MQ.   This could be caused by problems in another component, or a problem in MQ.    When using persistent messages the longest part of a transaction is the commit and waiting for the log disk I/O.  If this I/O is slower it can affect the overall system throughput.
  3. You can get the MQ log IO response times from the MQ log data.

Displaying MQ log I/O response time

You can break down where time is spent in doing I/O into the following area

  1. Scheduling the I/O – getting the request into the I/O processor on the CPU
  2. Sending the request down to the Disk controller(eg 3990)
  3. Transferring data
  4. The I/O completes, and send an interrupt to z/OS, z/OS has to catch this interrupt and wake up the requester.

 Plotting the I/O time does not give an entirely accurate picture, as the time to transfer the data depends on the amount of data to transfer.  On a well run system there should be enough capacity so the other times are constant.    (I was involved in a critical customer situation where the MQ logging performance “died” every Sunday morning.   They did backups, which overloaded the I/O system).

In the MQ log statistics you can calculate the average I/O time.  There are two sets of data for each log

  1. The number of requests, and sum of the times of the requests to write 1 page.  This should be pretty constant, as the data is for when only one 4KB was transferred
  2. The number of requests, and sum of the times of the requests to more than 1 page.  The average I/O time will depend on the amount of data transferred.
  • When the system is lightly loaded, there will be many requests to write just one page. 
  • When big messages are being processed (over 4 KB) you will see multiple pages per I/O.
  • If an application processes many messages before it commits you will get many pages per I/O.   This is typical of a channel with a high achieved batch size.
  • When the system is busy you may find that most of the I/O write more than one page, because many requests to write a small amount of data fills up more than one page.

I think displaying the average I/O times would be useful.   I haven’t done tried this in a customer environment (as I dont have customer environment to use).    So if the data looks like

Today         XXXXXXXXXXXXXXXXXXXXXXXX
Last week     XXXXXXXXXXXXXXXXXXXXXXXXXXXXX  
One hour ago XXXXXXXXXXXXXXXXXXX
time in ms 0 1 2 3

it gives you a picture of the I/O response time.

  • The dark green is for I/O with just one page, the size of the bar should be constant.
  • The light green is for I/O with more than one page, the size of the bar will change slightly with load.  If it changes significantly then this indicates a problem somewhere.

Of course you could just display the total I/O response time = (total duration of I/Os) / (total number of I/Os), but you lose the extra information about the writing of 1 page.

Reading from logs

If an application using persistent messages decides to roll back:

  • MQ reads the log buffers for the transaction’s data and undoes any changes.
  • It may be the data is old and not in the log buffers, so the data is read from the active log data sets.
  • It may be that the request is really old (for example half an hour or more), MQ reads from the archive logs (perhaps on tape).

Looking at the application doing a roll back, and having to read from the log.

  • Reading from buffers is OK.   A large number indicates application problem or a DB/2 deadlock type problem.  You should investigate why there is so much rollback activity
  • Reading from Active logs … . this should be infrequent.  It usually indicates an application coding issue where the transaction took too long before commit.  Perhaps due to a database deadlock, or bad application design (where there is a major delay before the commit)
  • Reading from Archive logs… really bad news…..  This should never happen.

Displaying reads from LOGS

Today         XXXXXXXXXXXXXXXXXXXXXXXX
Last week     X
One hour ago  XXXXX
rate          0        10    20     40

Where green is “read from buffer”, orange is “read from active log”, red is “read from Archive log. Today looks a bad day”.

Using the monitoring data provided via publish in MQ midrange.

In V9, MQ provided monitoring data, available in a publish/subscribe programming model. This solved the problem of the MQ Statistics and Accounting information being written to a queue, and only one consumer could use the data.

You can get information on the MQ CPU usage, log data written, as well as MQ API statistics.

A sample is provided (amqsruaa) to subscribe to and print the data, but this is limited and not suitable for an enterprise environment. See /opt/mqm/samp/ amqsruaa.c for the source program and bin/amqsrua bin/amqsruac for the executables, bindings mode and client mode.

I tried to use this new method in my mini enterprise, and found it very hard to use, and I think some of the data is of questionable value.

Overall, I found

  1. The documentation missing or incomplete
  2. The architecture is poor, it is hard to use in a typical customer environment
  3. The implementation is poor, it does not follow PCF standards and has the same id for different data types.
  4. Some of the data provided is not explained, and some data is not that useful.

I’ve written several pages on the Monitoring data in MQ midrange, I was going to blog it all – but I did not think there would be a big audience for it.