- An architecture and infrastructure which can provide the resilence
- Application which are well designed, well coded and robust
- Operations that can detect problems and automatically take actions to remedy the problems.
- 1 second
- 1 minute
- 1 hour
Overview of availability options.
- Queue sharing groups on z/OS give the highest level of availability, with the highest upfront cost (preventing an outage might be worth that cost, and more and more businesses are using QSGs now)
- The data replication features in the appliance and replicated data queue managers (RDQM) are the best ways to achieve high availability of queue managers on distributed. See RDQM for HA, and RDQM for Disaster Recovery.
- Multi-instance queue managers, where you have an active and a standby queue manager, and clusters can be useful too.
- Not cause an outage, and use MQ (and other software in the stack) as efficiently as possible. Many “outages” are cause by badly written applications
- Deal well with a problem if one occurs.
- Make it easy to diagnose any problems that occur
- Do not make a change to two critical systems at once, have a day between changes.
- Make sure every change has a back-out procedure which has been tested.
- You monitor the systems, so you can quickly tell if there is abnormal behavior.
- You cannot have messages “paused” for minutes while a server is restart.
- You can tolerate a “pause” a few seconds if one QM in the QSG goes down, and the channel restarts to a different queue manager in the QSG.
- Your applications are not smart.
- There is a need for serialized message processing.
- The cost of an outage would cover the cost of running z/OS.
- Messages can be spread across any of the servers to provide scalability and availability.
- If you have a requirement for short response time, you need smart applications which can retry sending the message and handle duplicate requests and responses.
- If you can tolerate waiting for in-flight message whilst a queue manager is restarted, the applications do not need to be so smart.
These mid-range systems, can take a minute or so to restart after an outage.
RDQM queue managers are generally better than Multi Instance queue managers. See the performance report here.
You need to consider each business application and evaluate the risk.
- My applications are not smart. They are running on mid-range with 2 servers. If I had an unplanned outage which lasted for 5 minutes then with my typical message volumes, this means I could have 6000 requests stuck until queue manager was restarted. My management would not be happy with this.
- If I had an outage on these two servers… ahh that would be a problem. I need more servers.
Many thanks to Gwydion of IBM for his comments and suggestions.