Why oh why is my application waiting?

I’ve been working on a presentation on performance, and came up with an analogy which made one aspect really obvious…. but I’ll come to that.

This blog post is a short discussion about software performance, and what affects it.

Obvious statement #1

The statement used to be An application is either using, CPU, or waiting. I prefer to add or using CPU and waiting which is not obvious unless you know what it means.

Obvious statement #2

All applications wait at the same speed. If you are waiting for a request from a remote server, it does not matter how fast your client machine is.

Where can an application wait?

I’ll go from longest to shortest wait times

Waiting for the end user

If you have displayed a menu for an end user to complete, you might wait minutes (or hours) for the end user to complete the information and send it.

Waiting for a remote request

This can be a request to a remote server to do something. This could be to buy something, or simple web look up, or a Name Server lookup. These should all be under a second.

Waiting for disk I/O

If your application is doing database work, such as DB2 there can be many disk I/Os. Any updates are logged to disk for recovery purposes. If your disk response time is typically 1 ms, then you may have to wait several milliseconds. When your application issues a commit, and wants to log data – there will likely to be an I/O in progress – so you have to wait for that I/O to complete before any more data can be written. Typically a database can write 16 4KB pages at a time. If the database logging is very active you may have to wait until any queued data in log buffers is written, before your application’s data can be written. An I/O consists of a set up followed by data transmission. The set up time is usually pretty constant – but more data takes more time to transfer. Writing 16 * 4 KB pages will usually take longer than writing one 4KB page.

An application writing to a file may buffer up several records before writing one record to the external medium. You application wrote 10 records, but there was only one I/O.

These I/Os should be measured in milliseconds (or microseconds).

Database and record locks

If your application want to update some information in a database record it could do

  • Get record for update (this prevents other threads from updating it)
  • Display a menu for the end user to complete
  • When the form has been completed, update the record and commit.

This is an example of “Waiting for the end user”. Another application wanting to update the same record may get an “unavailable” response, or wait until the first application has finished.

You can work around this using logic like

  • Each record has a last updated timestamp.
  • Read the record note the last updated timestamp, display the menu
  • When the form has been completed..
    • Read the record for update from the database, and check the “last updated time”.
    • If the time stamp matches the saved value, update the information and commit the changes.
    • If the time stamp does not match, then the record has been updated – release it, and go to the top and try again.

Coupling Facility access

This is measured in 10s of microseconds. The busier the CF is, the longer requests take.

Latches

Latches are used for serialisation of small sections of code. For example updating storage chains.

If you have two queues of work elements, one queued work, on in-progress work. In a single threaded application you can move a work element between queues. With multiple threads you need some form of locking.

In its simplest form it is

pthread_mutex_lock(mymutex)
work = waitqueue.pop()
active.push(work)
pthread_mutex_unlock(mymutext)

You should design your code so few threads have to wait.

Waiting for CPU

This can be due to

  • The LPAR is constrained for CPU; other work gets priority, and your application is not dispatched.
  • The CEC (physical box) is constrained for CPU and your LPAR is not being dispatched.

If your LPAR has been configured to use only one CPU, and there is space capacity in the CEC your LPAR will not be able to use it.

Waiting for paging etc

In these days of lots of real storage in the CEC, waiting for paging etc is not much of an issue. If the virtual page you want is not available to you the operating system has to allocate the page, and map it to real storage.

Waiting for data – using CPU and waiting.

Some 101 education on computer Z architecture

  • The processors for the z architecture are in books. Think of a book as being a physical card which you can plug/unplug from a rack.
  • You can have multiple books.
  • Each book has one or more chips
  • Each chip has one or more CPUs.
  • There is cache (RAM) for each CPU
  • There is cache for each chip
  • There is cache for each book
  • At a hardware level, when you are updating a real page, it is locked to your CPU.
  • If another CPU wants to use the same real page, it has to send a message to the holding CPU requesting exclusive use
  • The physical distance between two CPUs on the same chip is measured in millimeters
  • The distance between two CPUs in the same book is measured in centimeters
  • The distance between two CPUs in different books could be a metre.
  • The time to send information depends on the distance it has to travel. Sharing data between data two CPUs on the same chip will be faster than sharing data between CPUs in different books.

Some instructions like compare and swap are used for serialising access to one field.

  • Load register 4 with value from data field. This could be slow if the real page has to be got from another CPU. It could be fast it the storage is in the CPU, chip or book cache.
  • Load register 5 with new value
  • Compare and swap does
    • Get the exclusive lock on the data field
    • If the value of the data field matches the value in register 4 (the compare)
    • then replace it with the value in register 5 (the swap)
    • else say mismatch
    • Unlock

These instruction (especially the first load) can take a long time, especially if the data field is “owned” by another CPU, and the hardware has to go and get the storage from another CPU in a different book, a metre away.

A common technique for Compare and Swap is to have a common trace table. Each thread gets the next free element, and sets the next free. With many CPU’s actively using the Compare and Swap, these instructions could be a major bottleneck.

A better design is to give each application thread their own trace buffer to avoid the need for a serialisation instruction, and so there is no contention.

Storage contention

We finally get to the bit with the analogy to explain storage contention

You have an array of counters with one slot for each potential thread. You have 16 threads, your array is size 16.

Each thread keeps updates its counter regularly.

Imaging you are sitting in a class room listening to me lecture about performance and storage contention.

I have a sheet of paper with 16 boxes drawn on it.
I pick a person in the front row, and ask them to make a tick on the page in their box every 5 seconds.

Tick, tick, tick … easy

Now I introduce a second person and it gets harder. The first person make a tick – I then walk the piece of paper across the classroom to the second person, who makes a tick. I walk back to the first, who makes another tick etc

This will be very slow.

It gets worse. My colleague is giving the same lecture upstairs. I now do my two people, then go up a floor, so someone in the other classroom can make a mark. I then go back down to my class room and my people (who have been waiting for me) can then make their ticks.

How to solve the contention?

The obvious answer is to give each person their own page, and there is no contention. In hardware terms it might be a 4KB page – or it may be a 256 cache line.

I love this analogy; it has many levels of truth.

Leave a comment