What do I need to make my business applications resilient?

In the same way that a three legged seat needs three strong legs,  a business transaction has three legs.  The business transactions needs
  1. An architecture and infrastructure which can provide the resilence
  2. Application which are well designed, well coded and robust
  3. Operations that can detect problems and automatically take actions to remedy the problems.
If any one is weak, the whole business transaction is not resilient.
For the infrastructure perspective, the question of needing MQ shared queue, MQ midrange or appliance comes down to the requirements of the business and the management of risk.
For your business application you need to understand the impact to your business.  If the application was not available for
  • 1 second
  • 1 minute
  • 1 hour
The cost could be your reputation, rules and regulations of your industry, and financial cost.  For example an outage may cost you 1 million dollars a minute in fines and compensation.  Your reputation could suffer if many people reported problems on twitter if your service is not available.

Overview of availability options.

  1. Queue sharing groups on z/OS give the highest level of availability, with the highest upfront cost (preventing an outage might be worth that cost, and more and more businesses are using QSGs now)
  2. The data replication features in the appliance and replicated data queue managers (RDQM) are the best ways to achieve high availability of queue managers on distributed. See RDQM for HA, and RDQM for Disaster Recovery.
  3. Multi-instance queue managers, where you have an active and a standby queue manager, and clusters can be useful too.
The applications need to be written to be reliable and resilient, so as to:
  1. Not cause an outage, and use MQ (and other software in the stack) as efficiently as possible.  Many “outages” are cause by badly written applications
  2. Deal well with a problem if one occurs.
  3. Make it easy to diagnose any problems that occur
You need to automate your operations so errors are quickly picked up and actioned.
What availability do your business applications need?
You need to be able to handle planned outages.  These may occur once a week.  You stop work going one route, and so it flows via a different route.  Once “all the pipes are empty” you can perform shut down.  This should be transparent to the applications.
You need to be able to handle unplanned outages where messages may be in flight in the queue manager and network.  These may occur once a year.  If there is a problem, messages in flight could be stuck on a queue manager until the queue manager is restarted.  Once a problem is detected, new messages should be able to flow via an alternative route.  In this case a few seconds, or minutes worth of messages could be unavailable.
You can use clustering to automatically route traffic over available channels while a problem in one queue manager is being resolved.
Do you have a requirement for serialized transactions where the order of execution must be maintained?  For example trading stocks and shares.  The price of the second request depends on the trade of the first request.   If so, this means you can only have one back end server, no parallelism, and one route to the back end.  This does not provide a robust solution.
How smart are your applications?
If your application gets no reply within 1 second, the application could try resending the request, and it may take a different route through the network, and succeed.  For inquiry transactions, a duplicate request should have little impact.  For an update requests, the applications need logic to handle a possible duplicate request, where it detects the request has already been processed, and a negative response is sent back.
The business application may need a program to clear up possible unprocessed, duplicate,  responses and take compensating action.
Having smart applications which are resilient means the infrastructure does not need to be so smart.
Operational maturity
For the best reliability and availability you need a mature operations environment.
The infrastructure is usually reliable.  “outages” usually occur because of human intervention,  a change, or a bad application.  For example an application can continually try a failing connection, and fill up the MQ error logs.
Examples of operational maturity include
  1. Do not make a change to two critical systems at once, have a day between changes.
  2. Make sure every change has a back-out procedure which has been tested.
  3. You monitor the systems, so you can quickly tell if there is abnormal behavior.
It can take several minutes to detect a problem, shut down, and restart a queue manager (perhaps in a different place).
If you have 100 linux servers to support, it takes a lot of work to make changes on all of these servers (from making a configuration change to applying fixes).  It may be less work on z/OS.
You need to make sure that the infrastructure has sufficient capacity, and a queue manager is not short of CPU, nor has long disk response time.
Below are several configurations and configurations:
Shared queue across multiple machines, across sites
A message in a Queue Sharing Group can be processed by any queue manager in a QSG,  providing high availability.
Good for business transactions where
  1. You cannot have messages “paused” for minutes while a server is restart.
  2. You can tolerate a “pause” a few seconds if one QM in the QSG goes down, and the channel restarts to a different queue manager in the QSG.
  3. Your applications are not smart.
  4. There is a need for serialized message processing.
  5. The cost of an outage would cover the cost of running z/OS.
Multiple mid-range machines configured across multiple machines across sites (RDQM).  Use of MQ appliance
For business transactions where
  1. Messages can be spread across any of the servers to provide scalability and availability.
  2. If you have a requirement for short response time, you need smart applications which can retry sending the message and handle duplicate requests and responses.
  3. If you can tolerate waiting for in-flight message whilst a queue manager is restarted, the applications do not need to be so smart.

These mid-range systems, can take a minute or so to restart after an outage.

RDQM queue managers are generally better than Multi Instance queue managers. See the performance report here.

Single server
This is a single point of failure, and not suitable for production work.
Your enterprise may have combinations of the above patterns.

You need to consider each business application and evaluate the risk.

For example

  • My applications are not smart. They are running on mid-range with 2 servers.  If I  had an unplanned outage which lasted for 5 minutes then with my typical message volumes, this means I could have 6000 requests stuck until queue manager was restarted.  My management would not be happy with this.
  • If I had an outage on these two servers…  ahh that would be a problem.  I need more servers.

Many thanks to Gwydion of IBM for his comments and suggestions.

One thought on “What do I need to make my business applications resilient?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s