I was talking to someone about how major accidents occur, and found it very interesting. I told a friend about it, and he said that he had found the same when he was writing critical software for an aircraft control system.
How major accidents occur.
The first guy told me about an incident which started off with the smallest problem.
- On a ship, the light at the top of the stairs stopped working, and it was not reported (even though many people saw the light was not working)
- Someone carrying some oil, spilled some at the top of the steps, but because the light was not working, failed to spot this.
- The ship was coming into harbour
- Someone else came along to go down the ladder, slipped on the oil, and bumped all the way down the ladder(10 steps) on his coccyx (the bone at the bottom of his spine) and cracked it – ouch!
- There were not many crew on the boat, and every one rushed to help, including a first aider who should have been helping the ship to dock .
- The ship crunched into the harbour wall. No major damage done to the ship – but an embarrassing front page photo in the local newspaper.
From a small incident, this led to a chain of incidents of increasing severity.
Critical software for an airplane control systems.
My software developer friend said he was working on a software which controlled an airplane. They tests were 99% ish successful, but there was just one test which consistently failed (in the simulator). The symptoms were that the software would sometimes freeze for about 1 second and then recover. A lot can happen in a 1 second.
They tracked it down to one line of code. In a nested set of ‘if’ statements, an ‘else’ statement was attached to the wrong ‘if’ statement due to bad indentation! This meant a field was not initialized and had garbage in it. This in turn caused a loop to be iterated over 2 million times, instead of twice.
Once they had found the problem, they then used static analysis tools (like lint) and found they had lots of “little problems”. Who would have thought the number of spaces in a line would cause a problem. Fortunately there were no real “disasters” from these little problems, and they fixed all of the little problems.
I thought it interesting how the same sort of process problems occur in totally different fields, and how important it is to fix these small niggles.
This reminds me of the time when adding a comment to a program caused it to fail to compile. Adding one more line made the file bigger and not fit in memory, so an intermediate file was used. There was a bug in this code. This was a case of “I just added a comment” actually did cause problems.
I attended a talk by one of the US astronauts who flew on the space shuttle. he said they made a point of regularly visiting the software development teams so the teams got to know the people whose lives depended on their code. If there was a problem in the software these nice people (who brought coffee and doughnuts) might die. This tended to focus the minds of the developers.