Is “how many fruit do I get per kilogram” a better metric than MIPS?

I was reading an article about the cost of a transaction and how many MIPS (Millions of Instructions Per Second) it used.  This annoyed me as it is wrong at so many levels.

  1. MIPS depends on how long the transaction ran for. Millions of Instructions used (MINU?) would be a better metric
  2. Instructions are different sizes, and the CPU used by an instruction varies depending on many things.
  3. If I was to talk about “How many fruit I can get per kilogram”, you would rightly say that  apples weigh more than grapes. You get different sorts of apples of different sizes, so you might get 12 cox’s apples per kilogram, but 10 Granny Smiths.  I’m sure you are thinking number of fruit per kilogram is a stupid measurement.   I agree – but this is the same discussion as MIPS.
  1. Obvious statement : The duration of an instruction depends on the instruction.   Instructions that access memory are often slower than instructions that just access registers.
  2. Not so obvious statement.    The same instruction (for example load register) can have different durations
  3. Not at all obvious:  The same physical instruction – for example in a loop – can have different durations
  4. Not at all obvious: A register to register instruction can be slower than a memory to register instruction.

Let’s take the covers off and get our hands under the covers.

  1. The z machines have the processors in “books”; in reality these are big cards.
  2. There can be multiple chips per book.
  3. There is memory in each book
  4. The length of the wire between two books might be 1 metre.
  5. Inside each chip are the processors.
  6. Each processor has cache for instructions, and cache for data.  This memory is within a few millimetres of the processors
  7. There is cache on the chip independent of the processors, it is within 1 cm of the processors.

Just a few more observations

  1. The Load instruction – is not an instruction – it is actually a microprogram using the underlying chip technology
  2. Some instructions were written in millicode – and then a release or so later were written in microcode.   The MVCL instruction is a good example of this.  It can move Megabytes of data in one instruction.

For the instruction L R4,20(R3).   This says load register 4 with the contents of  20 bytes off what register 3 point to.   I’ll look at how long this instruction takes to execute.

Put this load instruction in a little loop and go round the loop twice.

JUMPHERE SR    R1,R1    clear register 1
LOOP   LTR   R1,R1    test register 1
     BNZ   PAST    If it is non zero go to PAST
      L     R4,20(R3) Load register 4 from storage
      AFI   R1,1 Add 1 to register 1
    B LOOP 
PAST   DS   0H

What are the differences in the duration of the L R4,20(R3) instruction, first time and second time around the loop?

  1. If the application issued goto/jump to JUMPHERE, this block of code may not be in the processor’s instruction cache.  It may be in the chips’ cache.  If so – there will be a delay while the block of instructions are loaded into the proccessor’s cache.
  2. When the Load instruction is executed – both times it is in the processor’s cache, so there is no time difference.
  3. The first time, the instruction has to be parsed into  “Load”, “register 4″,”20 bytes from what register 3 points to”.   The second time, it can reuse this information – so it will be quicker.
  4. “Go and access the storage from what register 3 points to”.  You have convert register 3 to a real page. There are lots of calculations to Translate a virtual address to its “real address”.  For example each address space has an address 2000. It needs to use the look up tables for that address space to map 2000 to the real page.  Once it has calculated the “real” page – it keeps it in a Translation Lookaside Buffer (TLB). (LPAR#.ASID#.2000 -> 39945000).   The second time, this information is already in the TLB.
  5.  The “real” storage may be in the same chip as you are executing, or it may be on a different chip.  It could be on a chip on a different book.  There is about 1 meter of cabling between where the real storage is, and the instruction requesting it.  The speed of light determines how long it will take to access the data – 0.000001 of a meter or 1 meter.    The second time the load instruction is executed, it is already in the processors’ cache.

In this contrived example I would expect the second load to be much faster than the first load – perhaps 1000 times faster.  Apples and grapes…

The z Hardware can report where time is being spent, and allows IBM to tune the programs to reduced delays.

There was one z/OS product which had a pointer to the next free buffer in the trace table.   The code would get the buffer and update the pointer to the next free buffer.  This worked fine when there was only one thread running.  On a big box with 16 engines every processor was trying to grab and update this pointer.   Sometime it had to grab it from a different book, sometimes it had to grab it from another processor in the same chip.  From the hardware instrumentation it was obvious that a lot of time was spent waiting for this “latch”.   By giving each processor its own trace table – totally removed this contention, and the trace buffers were close to the processors – so a win win situation.

I said that a register to register could be slower than a memory to register – how?

The processors can process multiple instructions in parallel.  While it is waiting for some storage to arrive from 1 metre away, it can be decoding the next 10 instructions.  There is what I like to think of as a “commit” in all processing, and the commit processing has to be done in strict order. 

So if we had

    L  R1,20(R3)      
SR  R4,R4
  SR  R6,R6 

The two Subtract Registers may be very quick and almost completed before the Load instruction has finished. They cannot “commit” until the Load has committed.   Picture trying to debug this if the instructions had been executed out of order!

Now consider

 L    R1,20(R3)      
LTR R1,R1 Test if register 1 is zero
LH R6,100(R7)

The Load and Test Register(LTR) is using register 1 – which is being used by the Load instruction.  The LTR has to wait until  the previous Load has “Committed” before it can access the register.  In effect the LTR is having to wait until the Load’s data is retrieved from 1 metre away – so very slow.

The data for the Load Halfword R6,100(R7) may be in the data cache – and so be very fast.   We therefore have the situation that the register, register instruction LTR takes a “long” time, and the memory register Load Halfword – is blazingly fast.

Things ain’t what they used to be.

I remember when IBM changed from Silicon processors to CMOS.  We went from a machine with one engine delivering 50 MIPS to a machine with 5 processors each delivering 11 MIPS.   The bean counters said “be happy – you have more MIPS”.    The transactions were now 5 times slower, but you could now have 5 transactions  running in parallel, and so overall throughput was very similar.  I wonder if these bean counters are the ones who think that if the average family has 2.5 children, there must be a lot of half brothers and half sisters.

For those people who have to compare one box with another box, or want to know how much CPU is needed for a transaction use CPU seconds.  There are tables giving a benchmark of the same workload on different machines, and so you can make comparisons between the machines.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s