Whoops my QM emergency recovery procedures did not recover QM in an emergency!

I was working with someone, and we managed to kill a test queue manager on midrange.  I suggested we test out the “emergency procedures” as if this was production and see if we could get “production” back in 30 minutes.

We learned so much from this exercise, so we are now working on a new emergency recovery procedure.

What killed the queue manager

The whole experience started when we thought had better clean up some of the MQ recovery logs.   With circular logging the when the last log fills up it overwrites the first one.  This is fine for many people but it means you may not be able to recover a queue if it is damaged.
We had linear logging, where the logs are not overwritten, MQ just keeps creating new logs.   You can recover queues if they are damaged, because you can go back through the logs.
As our disk was filling up someone deleted some old logs – which were over a week old – and were “obviously” not needed.

MQ was shut down, and restarted – and failed to start.

Lesson 1:  With linear logging you are meant to use the rcdmqimg command which copies queue contents to the log.   You get a message telling you which logs are needed for object recovery, and which logs are needed for restart.   This information is also in the  AMQERRxx.LOG.  You cannot just delete old logs as they may still be needed.

Issue the command at least daily.

Lesson 2:  HA disks do not give you HA.   The disks were mirrored to the backup site – also to the DR site.  The delete file command was reliably done on all disk copies.  We could not start MQ on any site because of the missing file.  We should have had a second queue manager.

These HA disk solutions and active/standby give you faster access to your data, in  a multi site environment, they do not give you High Availability

Initial panic – what options do we have

Lesson 3: Your instructions on how to recover from critical situations need to be readily available.  They should be tested regularly.    We could not find any. You need a process to follow which works, and you have timings for.  So you do not have a half hour discussion “should we restore from backup?”, “how long will it take?”, “will it work?”, “how do we restore from backup”.   The optimum solution may be to shoot  the queue manager and recreate it.  This may be the optimum route to getting MQ “production” back.  You should not have to make critical decisions under pressure, the decision path should have been documented when you have the luxury of time.

Lesson 4: you need to capture the output of every command you are doing.  Support teams will ask “please send me the error logs”.  You do not want to have to copy an paste all of your terminal data.  Linux has a “script” command which does this.  They could not email me the log showing the problems, so we had to have a conference call, and “share screens” to see what was going on, which made it hard for me to look at the problem “up a bit, down a bit – too far”.  All of which extended the recovery period.

Lesson 5: “Let’s restore from the backups”  These backups were taken  12 hours before and were not current, and we did not know how to restore them.
(Little thought, should backups be taken when a QM is down, or do you get integrity issues because files and logs were backed up at different times? – I know z/OS can handle this – Feedback from Matt at IBM.  Yes the queue manager should be shut down for backups – so you need two or three queue managers in your environment.

Make sure you backup your SSL keystore.

Let’s recreate the queue manager

Lesson  6: Do you have any special actions to delete multi instance queue managers.

Do you need linux people to do anything with the shared disks?

Lesson 7: Save key queue manager configuration files.  When you delete a queue manager instance – it deletes any qm.in and MQAT.ini files – you need them as they may have additional customising, for example SSL information.
Of course you are backing these files up -and of course you (personally)  have tested that you can  recover them from the backup.

Copy qm.in and MQAT.ini to a safe location before you delete the queue manager.

Lesson 8:  Ensure people have enough authority to be able to do all the tasks – or have an emergency “break glass userid”.  Many sites only allow operations people to access production with change capability.

Lesson 9:  You need to know how the  create queue manager command and parameters used to create the queue manager.
Some queue manager options can be changed after the queue manager has been started.  Others cannot – for example linear logging|circular logging.  Size of log files etc.

You need to have saved the original command used with all of the options.   Do not forget that when you did it the first time it was MQ V7.5 – you are now migrated to MQ V9, so it should work OK!

Lesson 10: Copy the qm.ini files etc and overwrite the newly created ones.

Start the queue manager.

Lesson 11:  Customize the queue manager.  You need to have a file of all of your objects queues and channels etc.  You may have a file which you use to create all new queue managers, but this may not  be up to date.  It is better to run dmpmqcfg every day to dump the definitions to get the “current” state of the objects which you can reload.
The -o 1line option is useful as then you can use grep to select objects with all the parameters.

Lesson 12: In your emergency recreate document  note how long each stage takes.  One step, closing down the queue manager took several minutes.  We were discussing if was looping or not – and should we cancel it.  Eventually it shut down.  It would have been better to know this stage takes 5 minutes.

Lesson 13: Document the expected output from each stage – and highlight any stage which gives warnings or errors.  We ran runmqsc with a file of definitions, and it reported 7 errors.  We wasted time checking these out.  Afterward we were told “We always get those”.

Lesson 14: Do you need to do work for your multi instance queue managers?

Getting the queue manager back into “production”. 

Lesson 15: Resetting receiver channel sequence numbers.   Sender and receiver channels will have the wrong sequence number.  You can reset the sender channels yourself.  Receiver channels are a bit harder, as the “other end” has to reset the sequence number. You can  either

  • Contact the people responsible for the other end (you do have their contact details dont you?) and
  • ask them to reset the channel,

or you wait till their queue manager sends you a message – and then you get notified of the sequence number mismatch, and can use reset channel to reset your number to the expected number.   The channel will retry and this time it will  work.  This means you need to sit by your computer, waiting for these events.  Maybe no messages will be sent over the weekend, and so you can logon first thing Monday morning to catch the events.

Lesson 16: Your SSL keystore is still available isnt it?

Lesson 17: Is every one who has the on-call phone familiar with this procedure, and has practiced it?

Lesson 18: People need to be familiar with the tools on the platform.  You may normally use notepad to edit files on your personal work station.  On the production box you only have “vi”.

Overall – this is one process that needs to work – and to get your queue manager up in the optimum time.  You need to practice it, and get it right.

Summary

You need to practice emergency recovery situations

I used to do Scuba diving.  You learn, and have to practice “ditch and retrieve” where you take your kit off under water and have to put it on again.   Once I needed to do this in the sea.  It was dark, I got caught in a fishing net, so I had to take my kit off, untangle it (by touch), and put it on again.  If I had not practiced this I would not be here today.