Following on from my blog post and making sure your file systems are part of a consistency group – so the data is consistent after a restart, the next question is
“how long will it take to fail over?”.
There are two areas you need to look at
- The time to detect and outage. This can be broken down into the time the active queue manager releases the lock, and the time taken for the release of the lock to be reflected to the standby system. You need to test and measure this time. For example you may need to adjust your network configuration.
- The time taken to restart the queue manager. There is an excellent blog post on this from Ant at IBM.
- The blog post talks about the rate at which clients can connect to MQ. Yes MQ can support 10,000 client connections. But if it takes a significant time to reconnect them all, you may want multiple queue managers, and have parallelism
- Avoid deep queues. In my time at IBM I saw many customers with thousands of messages on a queue with an age over 1 year old! You need to clean up the queue. Your applications team should have a process that runs perhaps once a day which cleans up all old messages. For example there was a getting application instance which timed out and terminated, then the reply arrived.
- During normal running most of the IO is write – and this often goes into cache, so very fast. During recovery the IO is reading from disk which might be from rotating disks rather than solid state.
One lesson from this is you need to test the recover. You need to test a realistic scenario – take the worst case of number of connections, with peak throughput, and then pull the plug on the active system.
Another lesson is you need to do this regularly – for example monthly as configurations can change.
Hi Colin, you’re dead right to highlight the question of restart times, this is a critical factor in how available your MQ system is. For that reason MQ has been focusing on this recently, and a number of improvements have been made (since 9.1.1). As you say, everyone test their own systems, but if you’re interested in what we’ve seen in the lab for those “worse case” scenarios, start here https://developer.ibm.com/messaging/2019/06/12/improved-switch-fail-over-times-in-mq-v9-1-2/.
From that work we saw that another aspect to consider when looking for speedy recovery is in your channels. It’s one thing to get your queue manager back online in a couple of seconds, but you have to make sure clients and other queue managers notice as quickly as possible to get messages flowing again. For this reason you need to check your heartbeat and channel retry intervals on channels.
LikeLike