There is an article in The Register, about an outage at the Tokyo stock exchange. One of the problems was that they did not have a process for restarting the environment. The impact of restarting a system is often overlooked, and in the panic of “get it started as quickly as possible” things can go wrong. The fire brigade slowly increases the pressure in a fire hose to stop the fire crew from being knocked down with the sudden flow.
TCP/IP is good because it has a “slow start” protocol. Once a connection has been established, and is working well, the exchange can use bigger buffers, and send more buffers before waiting for the acknowledgement. This boosts the throughput. If the back-end is slow to process the data, TCP slows down the traffic, and then increases the throughput again if the connection can handle it. If the connection stops and restarts, the rate starts slowly and builds up, rather than use the rate just before the outage.
You cannot expect WAS/CICS/DB2/MQ/IMS to restart at maximum speed; it has to work up to it. Transactions may have to warm up. There can be many reasons:
- Data many need to be read from page-sets into buffers, for example read hot Db/2 data into memory.
- Java code needs to warm up to become more efficient (JITed).
- The systems need to establish a working set, for example making a buffer pool larger.
- Establishing connections may have some serialisation delays.
Restarting faster than a system can cope can cause a domino effect. A transaction server is restarted and the fire hose of data is turned on. The transaction server is still warming up, and cannot cope with the volume of requests. Work for this system is then routed to another transaction server which could handle the workload if the volume gradually increases, If it gets this additional work all at once, this instance slows down, and the work is routed to another transaction server etc.
MQ can be seen as the bad guy here. When you restart MQ, it can go to fire hose mode immediately. You should start the output channels first to start draining messages, then gradually start the input channels. If you start the input channels before the output channels, you may get queues and page sets filling up, before the output channels can process the messages.
If you have a policy that all client connects must disconnect and reconnect a random time between15 minutes and 45 minutes this should help spread the load, and gradually you should get a balanced environment.