Following on from my blog post and making sure your file systems are part of a consistency group – so the data is consistent after a restart, the next question is
“how long will it take to fail over?”.
There are two areas you need to look at
- The time to detect and outage. This can be broken down into the time the active queue manager releases the lock, and the time taken for the release of the lock to be reflected to the standby system. You need to test and measure this time. For example you may need to adjust your network configuration.
- The time taken to restart the queue manager. There is an excellent blog post on this from Ant at IBM.
- The blog post talks about the rate at which clients can connect to MQ. Yes MQ can support 10,000 client connections. But if it takes a significant time to reconnect them all, you may want multiple queue managers, and have parallelism
- Avoid deep queues. In my time at IBM I saw many customers with thousands of messages on a queue with an age over 1 year old! You need to clean up the queue. Your applications team should have a process that runs perhaps once a day which cleans up all old messages. For example there was a getting application instance which timed out and terminated, then the reply arrived.
- During normal running most of the IO is write – and this often goes into cache, so very fast. During recovery the IO is reading from disk which might be from rotating disks rather than solid state.
One lesson from this is you need to test the recover. You need to test a realistic scenario – take the worst case of number of connections, with peak throughput, and then pull the plug on the active system.
Another lesson is you need to do this regularly – for example monthly as configurations can change.