Performance tuning at the instruction level is weird

This post came out of a presentation on performance which I gave to some Computer Science students.

When I first joined the IBM MQ performance team, I was told that it was obvious that register/register instructions were fast, and storage/register were slow. Over time I found is not always true, I’ll explain why below…

For application performance there are some rules of thumb,

  • Use the most modern compilers, as they will have the best optimisation and use newer, faster instructions
  • Most single threading applications will gain from using the best algorithms, and storage structures, but they may gain very little from trying to tune which instructions to use.
  • Multi threading programs may get benefit from designing their programs so they do not interact at the instruction level.

You may do a lot of work to tune the code – and find it makes no difference. You change one field, and it can make a big difference. You think you understand it – and find you do not.

Background needed to understand the rest of the post.

Some of what I say is true. Some of what I say may be false – but you it will help you understand – even though it is false. For example, the table in front of me is solid and made out of wood. That is not strictly accurate. Atoms are mainly empty space. Saying the table is solid, is a good picture, but strictly inaccurate.

I spent my life working with the IBM 390 series and the description below is based on it – but I’ve changed some of it to keep it simple.

Physical architecture

The mainframe has an instruction “Move Character Long” (MVCL). If you get microscope and look at the processor chips, you will not find any circuitry which implements this instruction. This instruction is implemented in the microcode.

Your program running on a processor is a bit like Java byte codes. The microcode reads storage and finds an instruction and executes it.

For an instruction “move data from a virtual address into a register”, the execution can be broken down into steps

  1. Read memory and copy the instruction and any parameters
  2. Parse the data into operation-code, registers, and virtual storage address
  3. Jump to the appropriate code for the operation code
  4. Convert the virtual storage address into a real page address (in RAM). This is complex code. Every thread has its own address say, 4 000 000 , so you need the thread-look-up tables to get the “real address” for that thread. Your machine may be running virtualised, so the the “real address” needs a further calculation of the next level of indirection.
  5. “Lock” the register” and “lock” the real address of the data
  6. Go and get the data from storage
  7. Move the data into the register
  8. Unlock the register, unlock the real address of the data
  9. End.

Where is the data?

There is a large (TB) of RAM in the physical box. The processor “chips” are in books (think pluggable boards). The “chips” are about the size of my palm, one per book. There is cache in the books. Within the “chips” are multiple CPUs, storage management processors and more cache.

The speed of data access depends on the speed of light.

  • To get from the RAM storage to the CPU, the distance could be 100 cm – or 3 nanoseconds
  • To get from the book’s cache storage to the CPU this could be 10 cm or about 0.3 nanoseconds
  • The time for the CPU to access the memory on the chip is about 0.03 nano seconds.

The time for “Go and get the data from storage” (above) depends on where the data is. The first access may take 3 nano seconds when the data is read from RAM, if a following instruction uses the same data, it is already in the CPU cache, and so take 0.03 nanoseconds ( 100 times faster).

How is the program executed?

In the flow of an instruction (above) each stage is executed in a production line known as a pipeline. There are usually multiple instructions being processed at the same time.

While one instruction is in the stage “Convert the virtual storage address into a real page address”, another instruction is being parsed and so on.

If we had instructions

  1. Load register 5 from 4 000 000
  2. Load register 4 from register 5
  3. Clear register 6
  4. Clear register 7

Instruction 2 (Load register 4 from register 5) needs to use register 4 and cannot execute until the first instruction has finished. Instruction 2 has to has to wait until instruction 1 has finished. This shows that a register to register instruction may not be the fastest; it has to wait for a previous instruction to finish.

A clever compiler can reorder the code

  1. Load register 5 from 4 000 000
  2. Clear register 6
  3. Clear register 7
  4. Load register 4 from register 5

and so this code may execute faster because the clear register instructions can execute without changing the logic of the program. By the time the clear register instructions have finished, register 4 may be available.

If you look at code generated from a compiler, a register may be initialised many instructions away from where it is next used.

The hardware may be able to reorder these instructions; as long as they end in the correct order!

Smarter use of the storage and cache

Data is read and written to storage in “cache lines” these may be blocks of 256, 512 or 1024 byte blocks.

If you have a program with a big control block, you may get benefit from putting hot fields together. If your structure is spread across two cache lines, you may have processing like.

  • Load register 5 from cache block 1. This takes 3 ns.
  • Load register 6 from cache block 2. This takes 3 ns.

If the fields are adjacent you might get

  • Load register 5 from cache block 1. This takes 3 ns.
  • Load register 6 from cache block 1. This takes 0.03 ns because the cache block is already in the CPU’s cache.

Use faster instructions

Newer generations of CPUs can have newer and faster instructions. To load a register with the constant value – say 5. In the old days, it had to read this value from storage. Newer instructions may have the value(5) as part of the instruction – so no storage access is required, and so the instruction should be faster.

The second time round a loop may be faster than the first time around a loop.

The stages of the production line may cache data, for example converting a virtual address to a real page address. The stage may look in it’s cache – if not found then do the expensive calculation. If it is found then use the value directly.

If your program is using the same address (same page) the second time round the loop, the the real address of the data may already be in the CPU cache. The first time may have had to go to RAM, the second time the data is in the CPU cache.

This can all change

Consider the scenario where the first time round the loop was 100 times slower than later loop iterations – it may all suddenly change. Your program is un-dispatched to let someone else’s program run. When your program is re-dispatched, the cached values may no longer be available, so your program has a slow iteration while the real address of you virtual page is recalculated, and the data read in from RAM.

Multi programming interactions.

If you have multiple instances of your program running accessing shared fields, you can get interference at the instruction level.

Consider a program executing

  • Add value 1 to address 4 000 000

Part of the execution of this is to take a lock on the cache line with address 4 000 000. If another CPU executes the same instruction, the second CPU will have to wait until the first CPU has finished with it. If both CPUs are on the same chip the delay may be small (0.03 ns). If the CPUs are in different chips (in different books) it will take 0.3 nanoseconds to notify the second CPU.

If lots of CPUs are trying to access this field there will be a long access time.

You should design your program so each instance has its own cache line, so the CPUs do not compete for storage. I know of someone who did this and got 30% throughput improvement!

Configuring the hardware for virtual machines.

You should also consider how you configure your hardware. If you give each virtual machine CPUs on the same chip, then any interference should be small. If a virtual machine has CPUs in different books (so takes 0.3 nano seconds to talk to the other CPU) the interference will be larger because the requests take longer. Ive seen performance results vary by a couple of percentages because the CPUs allocated to a virtual machine were different on the second run.

Going deeper into the murk

If you have virtual machines sharing the same CPUs, this may affect your results, because your virtual machine may be un-dispatched, and another virtual machine is dispatched on the processor(s). The cached values for your virtual machine may be been overwritten.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s