Whoops my QM emergency recovery procedures did not recover QM in an emergency!

I was working with someone, and we managed to kill a test queue manager on midrange.  I suggested we test out the “emergency procedures” as if this was production and see if we could get “production” back in 30 minutes.

We learned so much from this exercise, so we are now working on a new emergency recovery procedure.

What killed the queue manager

The whole experience started when we thought had better clean up some of the MQ recovery logs.   With circular logging the when the last log fills up it overwrites the first one.  This is fine for many people but it means you may not be able to recover a queue if it is damaged.
We had linear logging, where the logs are not overwritten, MQ just keeps creating new logs.   You can recover queues if they are damaged, because you can go back through the logs.
As our disk was filling up someone deleted some old logs – which were over a week old – and were “obviously” not needed.

MQ was shut down, and restarted – and failed to start.

Lesson 1:  With linear logging you are meant to use the rcdmqimg command which copies queue contents to the log.   You get a message telling you which logs are needed for object recovery, and which logs are needed for restart.   This information is also in the  AMQERRxx.LOG.  You cannot just delete old logs as they may still be needed.

Issue the command at least daily.

Lesson 2:  HA disks do not give you HA.   The disks were mirrored to the backup site – also to the DR site.  The delete file command was reliably done on all disk copies.  We could not start MQ on any site because of the missing file.  We should have had a second queue manager.

These HA disk solutions and active/standby give you faster access to your data, in  a multi site environment, they do not give you High Availability

Initial panic – what options do we have

Lesson 3: Your instructions on how to recover from critical situations need to be readily available.  They should be tested regularly.    We could not find any. You need a process to follow which works, and you have timings for.  So you do not have a half hour discussion “should we restore from backup?”, “how long will it take?”, “will it work?”, “how do we restore from backup”.   The optimum solution may be to shoot  the queue manager and recreate it.  This may be the optimum route to getting MQ “production” back.  You should not have to make critical decisions under pressure, the decision path should have been documented when you have the luxury of time.

Lesson 4: you need to capture the output of every command you are doing.  Support teams will ask “please send me the error logs”.  You do not want to have to copy an paste all of your terminal data.  Linux has a “script” command which does this.  They could not email me the log showing the problems, so we had to have a conference call, and “share screens” to see what was going on, which made it hard for me to look at the problem “up a bit, down a bit – too far”.  All of which extended the recovery period.

Lesson 5: “Let’s restore from the backups”  These backups were taken  12 hours before and were not current, and we did not know how to restore them.
(Little thought, should backups be taken when a QM is down, or do you get integrity issues because files and logs were backed up at different times? – I know z/OS can handle this – Feedback from Matt at IBM.  Yes the queue manager should be shut down for backups – so you need two or three queue managers in your environment.

Make sure you backup your SSL keystore.

Let’s recreate the queue manager

Lesson  6: Do you have any special actions to delete multi instance queue managers.

Do you need linux people to do anything with the shared disks?

Lesson 7: Save key queue manager configuration files.  When you delete a queue manager instance – it deletes any qm.in and MQAT.ini files – you need them as they may have additional customising, for example SSL information.
Of course you are backing these files up -and of course you (personally)  have tested that you can  recover them from the backup.

Copy qm.in and MQAT.ini to a safe location before you delete the queue manager.

Lesson 8:  Ensure people have enough authority to be able to do all the tasks – or have an emergency “break glass userid”.  Many sites only allow operations people to access production with change capability.

Lesson 9:  You need to know how the  create queue manager command and parameters used to create the queue manager.
Some queue manager options can be changed after the queue manager has been started.  Others cannot – for example linear logging|circular logging.  Size of log files etc.

You need to have saved the original command used with all of the options.   Do not forget that when you did it the first time it was MQ V7.5 – you are now migrated to MQ V9, so it should work OK!

Lesson 10: Copy the qm.ini files etc and overwrite the newly created ones.

Start the queue manager.

Lesson 11:  Customize the queue manager.  You need to have a file of all of your objects queues and channels etc.  You may have a file which you use to create all new queue managers, but this may not  be up to date.  It is better to run dmpmqcfg every day to dump the definitions to get the “current” state of the objects which you can reload.
The -o 1line option is useful as then you can use grep to select objects with all the parameters.

Lesson 12: In your emergency recreate document  note how long each stage takes.  One step, closing down the queue manager took several minutes.  We were discussing if was looping or not – and should we cancel it.  Eventually it shut down.  It would have been better to know this stage takes 5 minutes.

Lesson 13: Document the expected output from each stage – and highlight any stage which gives warnings or errors.  We ran runmqsc with a file of definitions, and it reported 7 errors.  We wasted time checking these out.  Afterward we were told “We always get those”.

Lesson 14: Do you need to do work for your multi instance queue managers?

Getting the queue manager back into “production”. 

Lesson 15: Resetting receiver channel sequence numbers.   Sender and receiver channels will have the wrong sequence number.  You can reset the sender channels yourself.  Receiver channels are a bit harder, as the “other end” has to reset the sequence number. You can  either

  • Contact the people responsible for the other end (you do have their contact details dont you?) and
  • ask them to reset the channel,

or you wait till their queue manager sends you a message – and then you get notified of the sequence number mismatch, and can use reset channel to reset your number to the expected number.   The channel will retry and this time it will  work.  This means you need to sit by your computer, waiting for these events.  Maybe no messages will be sent over the weekend, and so you can logon first thing Monday morning to catch the events.

Lesson 16: Your SSL keystore is still available isnt it?

Lesson 17: Is every one who has the on-call phone familiar with this procedure, and has practiced it?

Lesson 18: People need to be familiar with the tools on the platform.  You may normally use notepad to edit files on your personal work station.  On the production box you only have “vi”.

Overall – this is one process that needs to work – and to get your queue manager up in the optimum time.  You need to practice it, and get it right.

Summary

You need to practice emergency recovery situations

I used to do Scuba diving.  You learn, and have to practice “ditch and retrieve” where you take your kit off under water and have to put it on again.   Once I needed to do this in the sea.  It was dark, I got caught in a fishing net, so I had to take my kit off, untangle it (by touch), and put it on again.  If I had not practiced this I would not be here today.

 

Checking the daisy chain around my MQ network

Young children collect flowers and chain them to make a circle and so make a daisy chain.

People also talk about daisy chaining electrical extension leads together to make a very long lead out of lots of small leads.

In MQ we can also have daisy chains.  One use is to check all of the links are working, and there are no delays on the channels.

If an application puts a message onto the Clustered Request Queue(BQ) on QMA, it goes around and the reply can be got from the Reply queue, then we have checked all the links are working; we have daisy chained the requests.

DaisyChainOnce I Once I got it working the definitions were simple.

 

On QMB

DEFINE QR(BQ) RQMNAME(QMC) RNAME(CQ) CLUSTER(MYCLUSTER)

On QMC

DEFINE QREMOTE(CQ) CLUSTER(MYCLUSTER) RQMNAME(QMA) RNAME(REPLY)

On QMA

DEFINE QL(REPLY)

Once we have set the definitions up I could can use the MQ utility dspmqrte to show us the path.  For example

  • dspmqrte puts a message to the BQ .  This is a clustered queued on QMB,  It reports the queue BQ is being used, and stores the message on SYSTEM.CLUSTER.XMIT.QUEUE.
  • On QMA the channel TO.B gets the message from the SCTQ and sends it
  • On QMB the channel TO.B says put this to BQ, which is defined as CQ, and stores it on the SCTQ.
  • On QMB the channel TO.C gets the message from the SCTQ and sends it to QMC.
  • On QMC the channel TO.C says put this to CQ, which is defined as REPLY,  and stores if on SCTQ
  • On QMC the channel TO.A gets the message from the SCTQ and sends it to QMA.
  • On QMA the channel TO.A puts it to the Reply queue.

I used dspmqrte -m QMA -q BQ …, and it worked like magic.   I requested summary information(-v summary) and I got the following output, which shows the intermediate queues used.

AMQ8653I: DSPMQRTE command started with options ‘-m QMA -qBQ -rqCP0001 -rqm QMA -v summary -d yes -w5’.
AMQ8659I: DSPMQRTE command successfully put a message on queue ‘SYSTEM.CLUSTER.TRANSMIT.QUEUE’, queue manager ‘QMA’.
AMQ8674I: DSPMQRTE command is now waiting for information to display.
AMQ8666I: Queue ‘SYSTEM.CLUSTER.TRANSMIT.QUEUE’ on queue manager ‘QMA’.
AMQ8666I: Queue ‘SYSTEM.CLUSTER.TRANSMIT.QUEUE’ on queue manager ‘QMB’.
AMQ8666I: Queue ‘SYSTEM.CLUSTER.TRANSMIT.QUEUE’ on queue manager ‘QMC’.
AMQ8666I: Queue ‘REPLY’ on queue manager ‘QMA’.
AMQ8652I: DSPMQRTE command has finished.

Note, specifying RQMNAME is not required, and clustering will pick a queue manager which hosts the queue.  This means that you may be testing a different path to what you expected.  By using it you specify the route.

When I stopped QMC and retried the dspmqrte command , I got

AMQ8653I: DSPMQRTE command started with options ‘-m QMA -qBQ -rqCP0001 -rqm QMA -v summary -d yes -w5’.
AMQ8659I: DSPMQRTE command successfully put a message on queue ‘SYSTEM.CLUSTER.TRANSMIT.QUEUE’, queue manager ‘QMA’.
AMQ8674I: DSPMQRTE command is now waiting for information to display.
AMQ8666I: Queue ‘SYSTEM.CLUSTER.TRANSMIT.QUEUE’ on queue manager ‘QMA’.
AMQ8666I: Queue ‘SYSTEM.CLUSTER.TRANSMIT.QUEUE’ on queue manager ‘QMB’.
AMQ8652I: DSPMQRTE command has finished.

It does not report that there were any problems;  it just did not report two hops.

To see if there are problem, I think the best thing to do is pipe the output into a file

dspmqrte… > today

and compare this with a good day.

diff today goodday -d  gave me the differences – so I could see there was a problem because I was missing

> AMQ8666I: Queue ‘SYSTEM.CLUSTER.TRANSMIT.QUEUE’ on queue manager ‘QMC’.
> AMQ8666I: Queue ‘REPLY’ on queue manager ‘QMA’.


I had tried to define a clustered queue alias queues instead of a remote queue.  I got responses like

Feedback: UnknownAliasBaseQ, MQRC_UNKNOWN_ALIAS_BASE_Q, RC2082.

Featured

When is activity trace enabled?

I found the documentation for activity trace was not clear as to the activity trace settings.

In mqat.ini you can provide information as to what applications (including channels) you want traced.

For example

applicationTrace:
ApplClass=USER
ApplName=progput
Trace=OFF

This file and trace value are checked when the application connects.  If you have TRACE=ON when the application connects, and you change it to TRACE=OFF, it will continue tracing.

If you have TRACE=OFF specified, and the application connects, changing it to TRACE=ON will not produce any records.

With

  • TRACE=ON, the application will be traced
  • TRACE=OFF the application will not be traced
  • TRACE= or omitted then the tracing depends on alter qmgr ACTVTRC(ON|OFF).   For a long running transaction using alter qmgr to turn it on, and then off, you will get trace records for the application from in the gap.

If you have

applicationTrace:
ApplClass=USER 
ApplName=prog* 
Trace=OFF

applicationTrace:
ApplClass=USER 
ApplName=progp*
Trace=ON

then program progput will have trace turned on because the definition is more specific.

You could have

applicationTrace:
ApplClass=USER 
ApplName=progzzz
Trace=OFF

applicationTrace:
ApplClass=USER 
ApplName=prog*
Trace=

to  be able to turn trace on for all programs beginning with prog, but not to trace progzzz.

 

Thanks to Morag of MQGEM  who got in contact with me, and said  long running tasks are notified of a change to the mqat.ini file, if the file has changed, and a queue manager attributed has been changed – even if it is changed to the same variable.

This and lots of other great info about activity trace (a whole presentation’s worth of information) is available here.

Sorting out the MQ application trace knotted spaghetti.

You can turn on report = MQRO_ACTIVITY to get activity traces sent to a queue.   This shows the hops and activity of your message.
You can create your own trace route messages to be sent to a remote queue, and get back the hops to get to the queue, or you can use the dspmqrte command to do this for you.

Which ever way you do it, the result is a collection of messages in your specified reply queue.  The problem is how do you untangle the messages.  It is not easy with for a single message.  If you are getting these activity messages every 10 seconds from multiple transactions, you quickly  get knotted spaghetti!   To entangle the spaghetti even more, you could have a central site processing these data from many queue managers, so you get data from multiple messages, and multiple queue managers.

You can get a message for part of your application or transaction.  For example,

  • a message with information about the first 10 MQ verbs your program uses.
  • a message for the sender channel with the MQGET and the send for the local queue manager, and the remote queue manager will send a message with the channel’s receive and MQPUT.

The easy bit – messages for activity on your queue manager.

The event message has a header section.  This has information including

QueueManager: ‘QMA’
Host Name: ‘colinpaice’
SeqNumber: 0
ApplicationName: ‘amqsact’
ApplicationPid: 28683
UserId: ‘colinpaice’

From  QueueManager: ‘QMA‘ and Host Name: ‘colinpaice‘, you know which machine and queue manager you are on.

From ApplicationPid: 28683 SeqNumber: 0, you can see the records for this applications Process ID, and the sequence number.   This happens to be for a program ApplicationName: ‘amqsact’ and UserId: ‘colinpaice’.  I dont know when the sequence number wraps.  If the application ends, and the same process is reused,  I would expect the sequence number to be reset to 0.

You may have many threads running in a process , such as for a web server.  For each MQ operation  there is information for example

MQI Operation: 0
Operation Id: MQXF_PUT
ApplicationTid: 81
OperationDate: ‘2019-05-25′
OperationTime: ’14:28:18’
High Res Time: 1558790898843979
QMgr Operation Duration: 114

We can see that this is for Task ID 81.

So to tie up all of the activity for a program, you have to select the records with the same ApplicationPid, and check the SeqNumber to make sure you are not missing records.  Then you can locate the record with the same TID.
You also need to remember that a thread behaviour can be complex ( like adding meat balls to the spaghetti).  Because of thread pooling, an application may finish with the thread, and the thread can be reused.  If a thread is not being used, it can be deleted, so you will get MQBACK and MQDISC occurring after a period of time.

It is similar for channels

For a sending channel you get the following fields.

QueueManager: ‘QMA’
Host Name: ‘colinpaice’
SeqNumber: 1723
ApplicationName: ‘runmqchl’
Application Type: MQAT_QMGR
ApplicationPid: 5157
ConnName: ‘127.0.0.1(1416)’
Channel Type: MQCHT_CLUSSDR

MQI Operation: 0
Operation Id: MQXF_GET
ApplicationTid: 1

For a  receiving channel you get

QueueManager: ‘QMA’
Host Name: ‘colinpaice’

SeqNumber: 1746
ApplicationName: ‘amqrmppa’
ApplicationPid: 4509
UserId: ‘colinpaice’

Channel Name: ‘CL.QMA’
ConnName: ‘127.0.0.1’
Channel Type: MQCHT_CLUSRCVR

MQI Operation: 0
Operation Id: MQXF_OPEN
ApplicationTid: 5
….
MQI Operation: 1
Operation Id: MQXF_PUT
ApplicationTid: 5
….

As there can be many receiver channels with the same name for example an Receiver MCA channel, you should be able to use the CONNAME IP address to identify the channel being used.

They may have the same or different ApplicationPid.

It might be easier just search all of the channels for the messages with matching msgid and correlid!

 

 

 

 

 

 

Beware unzipping a windows install file

I hit an interesting little problem trying to download and use the developer version of MQ V9.1 for windows.

I downloaded the zip file on my Linux box, and extracted it onto my memory stick.
I rebooted into Windows 10, and tried to install MQ from the memory stick.  I got various errors including error code 0x80004005, and “corrupt file” when trying to install Explorer.cab.  If you unzip it on Windows, and install it from the memory stick, it works!

Strange eh!

What data is there to help you manage your systems?

There is a lot of information provided by MQ to help you manage your systems, some of it is not well documented.

I’ll list the sources I know, and when they might be needed, but before that I’ll approach it from the “what do I want to do”.

What do I want to do?

I want to…

  • know when significant events happen in my system, such as channel start and stop, and security events. AMQERRnn.LOG
  • know when “events” happen, such as a queue filling up security exceptions.  Event Queues
  • be able to specify thresholds, such as when the current depth is > 10 messages, and the age of the oldest message is older than 5 seconds then do something. Display commands
  • be able draw graphs of basic metrics, such as number of messages put per hour/per day so I can do capacity planning, and look for potential capacity problems. Statistics
  • identify which queues are being used, display queue activity, number of puts, and size of puts etc  Statisticsdisplay object status
  • identify which queues (objects) are not being used, so they can be deleted.  Absence of records in Statistics.  Issue DIS QSTATUS every day and see if a message has been put to the queue.  Creating events for when a message is put to the queue.  Note an object may only be used once a year – so you need to monitor it all year.
  • identify which applications are putting to and getting from queues. Accounting
  • see what MQI verbs are being used, so we can educate developers on the corporate naming standards, and API usage.  Activity trace
  • display the topology.  Display commandsTrace routeActivity trace
  • trace where messages are going, so we can draw charts of the flow of message requests and their responses, and display the topology of what is actually being used.  Trace route
  • measure round trip times of messages – so I know if there are delays in the end to end picture. Trace route
  • Understand the impact of a problem “here” by seeing what flow through “here”.  What’s my topology

What sources are there?

AMQERRnn.LOG.

These contain information about events in the queue manager, such as channel start and channel stop. These files are in /var/mqm/qmgrs/QMA/errors/… and can be read using an editor or browser.  People often feed these into tools like SPLUNK, and then you can filter and do queries to monitor for messages that have not been seen before.

Event queues.

MQ messages are put on queues like SYSTEM.ADMIN.QMGR.EVENT.

There is a sample amqsevt which can be used to print the message in text format or json format – or you can write your own program.

Creating events.

You can configure MQ to produce events when conditions occur. For important queues you can set a high threshold, and MQ produces an event when this limit is exceeded. You can use this

  • to see if messages are accumulating in a queue
  • to see if a queue is being used – set the queue high threshold to be 1, and you will get an event if a message is put to a queue

Statistics

If you turn on statistics you information on the number of puts, gets for the system, and the number of puts and gets etc to a queue. This information is put to a queue SYSTEM.ADMIN.STATISTICS.QUEUE

The information is summarized by queue.

You can use

  1. the sample amqsevt to process these messages, you can have output in json format for input into other tools.
  2. Systems management products like Tivoli can take these messages and store the output in a database to allow SQL queries
  3. Write your own program

One problem with the data going to a queue, is that a program processing the queue may get and delete the message on the queue, so other applications cannot use it. Some programs have a browse option.

Later versions of MQ use a publish subscribe model, so you subscribe to a topic, and get the data you want sent to your queue.

Accounting

If you turn on accounting you information about what an application is doing. The the number of puts, gets for the system, and the number of puts and gets etc to a queue. This information is put to a queue SYSTEM.ADMIN.ACCOUNTING.QUEUE. The information is similar to the information provided by statistics, but it provided information about which application used the objects.

You can use

  1. the sample amqsevt to process these messages.
  2. Systems management products like Tivoli can take these messages and store the output in a database to allow SQL queries
  3. Write your own program

One problem with the data going to a queue, is that a program processing the queue may get and delete the message on the queue, so other applications cannot use it. Some programs have a browse option.

Later versions of MQ use a publish subscribe model, so you subscribe to a topic, and get the data you want sent to your queue.

You can use display commands.

You can use commands, or use the MQINQ API to display information about object. You can issue commands using runmqsc or from an application by putting command requests in PCF format to a queue, and getting the data back in PCF format. Your program has to decode the PCF data.
You can display multiple fields and have logic  to take action if values are out of the usual range.   For example
periodically display the curdepth, and the age of the oldest message on the queue and then do processing based on these value. Tivoli uses this technique to creates situations if specified conditions are met. You can easily write your own programs to do this, for example using python scripts and pymqi.

What’s my topology?

You can use the DIS CHANNEL … CONNAME command to show where a channel connects to and use this to draw up a picture of your configuration.

You can use the DIS QCLUSTER and DIS CLUSQMGR to show information about your clusters, and where cluster queues are, and use this information to draw up a picture of your configuration

You can use the traceroute to dynamically see the routes between nodes, and understand the proportion of messages going to different destinations – at that moment in time.

Displaying object status

You can use display commands to show information such as the last time a message was put to a queue, or got from a queue, or sent over a channel.

Application trace

The application trace shows you the MQ API calls, the parameters, and return codes. This data goes to the SYSTEM.ADMIN.TRACE.ACTIVITY.QUEUE queue.

You can use this to check the API options being used, for example

  • Messages persistence is correct for the application pattern (inquiry is non persistence)
  • The correct message expiry is specified (non persistent has time value)
  • The correct options are specified
  • Applications are using MQ GET with wait rather than polling a queue
  • The correct syncpoint options are being used.
  • Which queue is really used. You open one queue name but this could be an alias. You get the queue it maps to.

There is an overhead to collecting this, so you do not want to run this for extended periods of time.

Running it for just a minute or two may give you enough information. You can turn this on for an individual program.

You can use amqsevt to process the queue.

Trace route.

You can send a message “to a queue” and get back the processes involved in getting to the queue.. For example use the dspmqrte to “put” a message to a cluster queue, and you will see the sending channel get the message and send it, then the receiver channel at the remote end receive the message and “putting” it to the queue. One of the data fields is the operation time, so you can see where the delays were in the processing (for example it took seconds to be sent over a channel).  See here

By default the message is not put to the queue, but there is an option to put it to the queue for the application to process, but there is no documentation to tell you how you process this message. The dspmqrte command effectively shows you the hops between queues. It is up to you to build up the true end to end path, and manage the responses yourself.

The provided programs dspmqrte are simplistic and show you the path to the queue, and the channels used on the queue.

The data is not pure PCF, and the sample amqsevt does not format it. I have modified it to handle this.

Should we share everything or share nothing?

We have the spectrum of giving every application having its own queue manager spread across Linux images, or having a big server with a few queue managers servicing all the applications.

The are good points and bad points within the range of share nothing, share all.

What do you need to consider when considering sharing of resources within the environment.

You may want to provide isolation

  • For critical applications, so other applications cannot impact it (keep trouble out)
  • To protect applications from a “misbehaving” application and to minimise the impact, (keep trouble in).
  • For regulatory reasons
  • For capacity reasons
    • Disk IO response time and throughput.
    • Amount of RAM needed
    • Amount of virtual storage
    • Number of of TCP ports
    • Number of MQ connections
    • Number of file connections in the Operating System
  • For security. It is often easier to deny people access to an image, than to put all the controls in place within the image.
  • You have more granularity at shutting down an image – fewer applications are impacted.
  • Restart time may be shorter.

If your requirements don’t fit into the above, you should consider sharing resources.

The advantages of sharing

  • Fewer environments to manage.
    • Provisioning can be done with automation, but the large number of small images can be hard to manage.
    • Monitoring is easier – you do not have so many systems to look at. (How big a screen do you need to have every system showing up on it ?)
    • You have to do changes and ugprades less frequently
  • Removing images tends to leave information behind – for example information about a deleted clustered queue manager stays in a full repository for many days.
  • The operating system may be able to manage work better than a VM hypervisor “helping”.
  • Fewer events and situations to manage.
  • By having more work the costs can be reduced. For example a channel with 10 messages in a single batch uses much less resources than 10 batches of one message.

What else?

You may need to provide multiple queue managers for availability and resilience, but rather than provide two queue manager for applicationA, and two queue managers for applicationB, you may be able to have two queue managers shared, with each queue manager supporting applicationA and applicationB.

You can provide isolation within a queue manager by

  • Having specific application queue names, for example the queues names start with the application prefix. You can then define a security profile (authrec) based on that prefix.
  • You can use split cluster transmit queue (SCTQ) so clusters do not share the SYSTEM.CLUSTER.TRANSMIT.QUEUE – but have their own queue and their own channels.

You may think by providing multiple instances you are providing isolation. There can be interaction with others at all levels, queue manager, operating system, hypervisors, disk subsystem network controllers – you just do not see them.

I had to work on a problem where MQ in a virtualized distributed environment saw very long disk I/O response times. We could see this from an an MQ trace, and from the Linux performance data. The customer’s virtualization people said on average the response time was OK, so no issue. The people in charge of the Storage Area Network, said they could not see any problems. The customer solved this performance problem by making all messages non persistent – which solved the performance problem – but may have introduced other data problems! As my father used to say, the more moving parts, the more parts that can go wrong.