If you have two totally different concepts, but there is a similarity between them, you can get insight by comparing them.
At first glance there is little in common between “man over board” when at sea, and enterprise computers, but there is, and we can get insight about testing and preparation.
When is the best time to learn man over board drill.
While you are learning, you want a nice calm day, so you any mistakes you make, do no damage. You need to practice it until you can recover the dummy most times.
When is the best time to practice man overboard drill?
You are more likely to go overboard when the seas are rough than when the weather is calm. You need to practice in this scenario. When the weather is rough it is hard to see the person in the water – a persons head is 1 ft high – but the waves can be 6 feet from peak to trough. It is harder to position the boat. So the best time to practice man over board drill is when the weather is rough. Do not try it in a gale, as you are likely to damage the boat, or have someone really fall over board.
When my father was in the navy, he told me about “exercises” where the ships would be under attack from planes (this was before missiles) and there was a submarine or two trying to attach you. In one exercise, the sea was a bit rough the captain of the ship sat back and let his senior officer (First Lieutenant) run the ship. Things were going well until the captain arranged for a “man over board” to happen.
The FL now had to decide to stop the ship (become a sitting duck and so be destroyed) and pick up the man over board; to leave the man in the water to die; or take people away from defending the ship and launch a boat/helicopter to rescue the man. This was a complex situation, which suddenly became more complex, but they trained for this and had tested procedures and people knew what to do.
How does this apply to enterprise systems?
You need to decide on your “man overboard” scenarios. For example this server is shut down, that network has problems. You need to practice resolving the problems, capturing information to help you identify the real cause of the problem, and the steps needed to recover. Once you have an automated or fully documented procedure, you test it out in production – this is the “testing man over board when the sea is rough”. This is where you find out holes in your processes, and find that production is configured differently to test, etc.
The “man over board when your ship is being attacked and the decision to save the man or save the ship” scenario. You often get multiple problems and you need to decide on the priority of the actions. “Messages building up on a queue” could be caused by a network problem. It is more important to fix the network than “fix mq”. You should go though scenarios to help decide what to do. It is better to create action plans in advance, and document them, rather than try to come up with a plan during an emergency. You want to avoid “if we do this then that will happen . ahhh… not a good idea”
Think things through
I did a sailing course in the Mediterranean, where the sea was warm, and people were swimming in the sea. We had spent the morning doing Man Over Board, where we had to retrieve a buoy with a pole sticking out the top. This was easy to retrieve, you lean over the side and just pick it up. Well done, tick the box, you passed. We anchored up, and were having lunch with a nice cold glass or two of wine, when I asked, “so how do you get someone back into the boat” (they do not teach you this). I then “accidentally” fell over board.
- They threw me a rope – but I said my hands were too cold – I could not grip it.
- They then made a loop in the end and threw it to me – the loop was too small to go over my head and life jacket.
- They then make a big loop which I got over my head and they dragged me to the yacht, but were unable to lift me out because I was too heavy and my clothes were full of water.
- The tutor then suggested using some of the lifting equipment from the boat, so they tied the rope over the end of the boom, and used a winch to winch me up – which worked, but they scraped me up the side of the boat, so I had a bleeding arm.
I said afterwards ( as they wiped my blood of the deck) look at the problems you had when we were at anchor. Think what it would be like in a 6 ft sea!
Think how you will recover after your outage.
For example there may have been persistent messages on the queue manager when it went down. The application retried and was successful because the traffic went to an alternative queue manager. You now have possibly duplicate requests, or orphaned replies (because the getting application reconnected to a different queue manager.
This server went down, and all of the traffic went to that server. Now this server has come back – how do you get traffic to balance over the queue managers?
You have a huge backlog of messages – what should you do – just purge them or let them be processed. (This is where you realise that using message expiry on inquiry messages would be a good technique to use)
You need to think things through, these exercises are tedious and take a lot of time. But you have no time in a crisis!