When I left University one of my university friends went into the wine industry. We met up a few years later and said that his nose was more useful than his PhD in Chemistry. Although they had moved towards gas chromatography (which gave you a profile of all of the chemicals in the wine), this was good at telling you if there were bad chemicals in the brew, but not if it would be a good vintage, for that they needed the human nose.
My father would tune his motorbike by listening to it. He said the bike would tell you when you had tuned it just right, and got it “in the sweet spot”. These days you plug the computer in and the computer tells you what to do. A friend of mine had an expensive part replaced, because the computer said so. A week later he took the car back to the garage because the computer “knew” there was a problem with the same expensive part, and said it should be replaced. This time the more experienced mechanic cleaned a sensor and solved the problem. Computers do not always know best.
When I first started in the performance role, the RMF performance reports were bewildering. These reports were lots of numbers in a small font (so you needed your glasses). Worse than that, they had several reports on the same page, and to a novice there was a blur of numbers. Someone then helped me with comments like, you can ignore all the data, except for this number 3 inches in and 4 inches down. That should be less than 95%. On this other page – check this column is zeros, and so on. As you gain more experience in performance, you get to know the “smell” of the data. It just needs a quick sniff test to check things are OK. If not, then it takes more time to dig into the data.
There are many tools for processing the SMF data and printing out reports full of numbers, but they add little value. “The disconnect time is 140 microseconds” – is this good, or is it bad, it better than a disconnect time of 100?. If the tools were smart enough to say “The disconnect time is 140 microseconds. This value should typically be zero” then this give you useful information instead of just data.
If you think that they could control the Starship Enterprise from one operations desk, they clearly did not have all of the raw data displayed. It must have been smart enough to report “The impulse engines are running hot: colour red, suggest you reduce power”, because that is what Scottie the engineer kept saying.
If there were smart reports of the problems rather than just displaying data, it would reduce the skill needed to interpret reports, and the need for the performance analyst’s glasses. Producing these smart reports is difficult and needs experience to know what is useful, and what is just confusing.
Sometimes it feels like the statistics produced have not been thought through. One example I recently experienced; there is a counter of the number of reads+writes to disk rather than cache. For reads, there should be no reads from disk. For writes, it may be good to write directly to disk, and not flood the cache. Instead of one number for reads, and one number for write, there is one number for both. So If I had 10 disk reads, 10 disk writes and 10 disk accesses – is this good or bad ? I don’t know. This is not a head banging problem, as you usually have only reads or only writes – but not both. I just had to use my nose, 10 million would be a problem, just 10 – not a problem, and I’ll still need my glasses.