How long will it take my queue manager to fail over and restart on midrange?

Following on from my blog post and making sure your file systems are part of a consistency group – so the data is consistent after a restart, the next question is

“how long will it take to fail over?”.

There are two areas you need to look at

  1. The time to detect and outage.  This can be broken down into the time the active queue manager releases the lock, and the time taken for the release of the lock to be reflected to the standby system.  You need to test and measure this time.  For example you may need to adjust your network configuration.
  2. The time taken to restart the queue manager.  There is an excellent blog post on this from Ant at IBM.
    1. The blog post talks about the rate at which clients can connect to MQ.  Yes MQ can support 10,000 client connections.  But if it takes a significant time to reconnect them all, you may want multiple queue managers, and have parallelism
    2. Avoid deep queues.  In my time at IBM I saw many customers with thousands of messages on a queue with an age over 1 year old!  You need to clean up the queue.  Your applications team should have a process that runs perhaps once a day which cleans up all old messages.  For example there was a getting application instance which timed out and terminated, then the reply arrived.
    3. During normal running most of the IO is write – and this often goes into cache, so very fast.   During recovery the IO is reading from disk which might be from rotating disks rather than solid state.

One lesson from this is you need to test the recover.   You need to test a realistic scenario – take the worst case of number of connections, with peak throughput, and then pull the plug on the active system.

Another lesson is you need to do this regularly – for example monthly as configurations can change.

Sorting out the MQ application trace knotted spaghetti.

You can turn on report = MQRO_ACTIVITY to get activity traces sent to a queue.   This shows the hops and activity of your message.
You can create your own trace route messages to be sent to a remote queue, and get back the hops to get to the queue, or you can use the dspmqrte command to do this for you.

Which ever way you do it, the result is a collection of messages in your specified reply queue.  The problem is how do you untangle the messages.  It is not easy with for a single message.  If you are getting these activity messages every 10 seconds from multiple transactions, you quickly  get knotted spaghetti!   To entangle the spaghetti even more, you could have a central site processing these data from many queue managers, so you get data from multiple messages, and multiple queue managers.

You can get a message for part of your application or transaction.  For example,

  • a message with information about the first 10 MQ verbs your program uses.
  • a message for the sender channel with the MQGET and the send for the local queue manager, and the remote queue manager will send a message with the channel’s receive and MQPUT.

The easy bit – messages for activity on your queue manager.

The event message has a header section.  This has information including

QueueManager: ‘QMA’
Host Name: ‘colinpaice’
SeqNumber: 0
ApplicationName: ‘amqsact’
ApplicationPid: 28683
UserId: ‘colinpaice’

From  QueueManager: ‘QMA‘ and Host Name: ‘colinpaice‘, you know which machine and queue manager you are on.

From ApplicationPid: 28683 SeqNumber: 0, you can see the records for this applications Process ID, and the sequence number.   This happens to be for a program ApplicationName: ‘amqsact’ and UserId: ‘colinpaice’.  I dont know when the sequence number wraps.  If the application ends, and the same process is reused,  I would expect the sequence number to be reset to 0.

You may have many threads running in a process , such as for a web server.  For each MQ operation  there is information for example

MQI Operation: 0
Operation Id: MQXF_PUT
ApplicationTid: 81
OperationDate: ‘2019-05-25′
OperationTime: ’14:28:18’
High Res Time: 1558790898843979
QMgr Operation Duration: 114

We can see that this is for Task ID 81.

So to tie up all of the activity for a program, you have to select the records with the same ApplicationPid, and check the SeqNumber to make sure you are not missing records.  Then you can locate the record with the same TID.
You also need to remember that a thread behaviour can be complex ( like adding meat balls to the spaghetti).  Because of thread pooling, an application may finish with the thread, and the thread can be reused.  If a thread is not being used, it can be deleted, so you will get MQBACK and MQDISC occurring after a period of time.

It is similar for channels

For a sending channel you get the following fields.

QueueManager: ‘QMA’
Host Name: ‘colinpaice’
SeqNumber: 1723
ApplicationName: ‘runmqchl’
Application Type: MQAT_QMGR
ApplicationPid: 5157
ConnName: ‘127.0.0.1(1416)’
Channel Type: MQCHT_CLUSSDR

MQI Operation: 0
Operation Id: MQXF_GET
ApplicationTid: 1

For a  receiving channel you get

QueueManager: ‘QMA’
Host Name: ‘colinpaice’

SeqNumber: 1746
ApplicationName: ‘amqrmppa’
ApplicationPid: 4509
UserId: ‘colinpaice’

Channel Name: ‘CL.QMA’
ConnName: ‘127.0.0.1’
Channel Type: MQCHT_CLUSRCVR

MQI Operation: 0
Operation Id: MQXF_OPEN
ApplicationTid: 5
….
MQI Operation: 1
Operation Id: MQXF_PUT
ApplicationTid: 5
….

As there can be many receiver channels with the same name for example an Receiver MCA channel, you should be able to use the CONNAME IP address to identify the channel being used.

They may have the same or different ApplicationPid.

It might be easier just search all of the channels for the messages with matching msgid and correlid!

 

 

 

 

 

 

Using the monitoring data provided via publish in MQ midrange.

In V9, MQ provided monitoring data, available in a publish/subscribe programming model. This solved the problem of the MQ Statistics and Accounting information being written to a queue, and only one consumer could use the data.

You can get information on the MQ CPU usage, log data written, as well as MQ API statistics.

A sample is provided (amqsruaa) to subscribe to and print the data, but this is limited and not suitable for an enterprise environment. See /opt/mqm/samp/ amqsruaa.c for the source program and bin/amqsrua bin/amqsruac for the executables, bindings mode and client mode.

I tried to use this new method in my mini enterprise, and found it very hard to use, and I think some of the data is of questionable value.

Overall, I found

  1. The documentation missing or incomplete
  2. The architecture is poor, it is hard to use in a typical customer environment
  3. The implementation is poor, it does not follow PCF standards and has the same id for different data types.
  4. Some of the data provided is not explained, and some data is not that useful.

I’ve written several pages on the Monitoring data in MQ midrange, I was going to blog it all – but I did not think there would be a big audience for it.