Average is not good enough

I was talking to someone in the MQ distributed change team about averages and how misleading they can be.
In Winchester England, the average number of arms that people have is 1.999 Wow – this is amazing!  Is this caused by years of in breeding  so that their left arm 1 cm shorter than their right  arm?   No,  it is because Winchester has some people with only one arm. If you add up the number of arms and divide by the number of people you get 1.999 !
The average number of children in a family is 2.5 – but you do not see many half children being pushed around in a pram.
These show averages can be misleading.
Moving on to what the change team guy was saying, from a MQ log perspective, he could see log response times of between 20 and 30 milliseconds with an average of 25 milliseconds.  From a linux iostat command, the average was 10ms.  From the SAN perspective it was an average of 2 ms. Who was right ? ….
They all were!
From a linux perspective the MQ requests were about 10% of the total requests. The other 90% had a response time of 5 ms.  On average (total time doing IO/number of IOs)this was 10 ms.
From the SAN perspective, there were many systems connected to the SAN, and they  got 1m response time.  So the average (sum of elapsed/count) came out as 2m!
What is a better measure?  This is tricky because average is easy to calculate. The median (sort the times and pick the middle one) is difficult to calculate because you need to remember the times of all IOs.  The maximum does not help. With MQ on z/OS. we capture the longest IO time.  But I did  not found this useful.
MQ uses calculations like long_ average_value =( 1023 * previous long_average_value + current)/1024.   It also uses short_average_value =  (63 * previous short_average_value + current)/64.
These are both easy to calculate and give a long term view and a short term view, so you can see a trend. I dont know how accurate or usable these are.
Perhaps the best is to have buckets; count the number in the range 0 to 1 ms;1 to 2; 2 to 5;5 to 10 10 to 20 and over 20.  However I dont think I’ll be able to persuade people to change their code.
My own experience of being confused by averages is when I had an big LPAR with 64 engines and I was testing the impact of putting just one persistent message (this was MQ 2.1).  The MVS data said I was short of CPU but the box was only 1% busy.  How can I be short of CPU with just one putter and 100 getters and 64 engines?
When I put the message it woke up all 100 of the getters. 63 of these were able to run. The other 37 had to wait for an engine to become free.  This showed we were short of CPU.  The getting application put the reply and issued a commit. During the commit, no applications were busy so the CPU usage dropped to 0!  On average (total CPU used/time) showed we were 1% busy but short of CPU!
It is hard to say what would be a better metric – as there was a spike of work for 5 microseconds and no activity for 1 millisecond.
So what does an average tell you?  It can can give an indication, but may not show what you want it to use – so be careful.

Looking back over my career

Looking back over my time at IBM somethings changed, some things have not changed much.

One of my first jobs was CICS build. These were the days when Systems Test took a year – and we had one build a month. My job was to compile all of the modules, print the listings, put the clean compiles in the rack, and take the failing listings to the developer. DASD was expensive, and I remember going to see the new double density 3340s with their 70MBs capacity! It was a big leap to be able to store listings on DASD when DASD was “cheap” – and spinning platters 2 foot across!

We ran the ( the original DOS, later renamed as VSE) and VS1under VM/360. If you wanted a new instance of DOS, you created the paging packs etc, and changed the VM exec, and started it.

We would put fixes onto the SYSRES, and give the developers a choice of the ‘old’ or ‘new’ SYSRES. it is strange that this quick deploy is now one of the “latest developments” in “cloud”.

The pendulum swings too and fro.

  •  We had green screens. If it broke, you went to a different one and logged on – easy.
  • We then had PCs, and a host emulator. If your PC had problems, you took it to someone to fix – and had to wait till it was fixed. Not easy.
  •  We then moved all of the technology down to the PC, for example Eclipse. At one point I had more than 5 versions of Eclipse on my laptop, all at different versions, and fix levels. This was hard, as you were not allowed to logon to someone else machine. Upgrading these systems was fraught with danger – in case it did not restart – or the changes you just made were incompatible.
  • We now have a web browser interface and all of the driving power is done in the back end systems – just like a green screen!