and many hard lessons.
I had emails from two people, with different experiences of doing performance on z/OS. One person has recently started, and is not sure what is involved. The other person has been doing lots of work with customers explaining that his product is not the cause of the performance problems.
I thought it might be interesting for people who might be tempted to work in performance, to see the route to becoming an expert.
What does “performance” mean?
Performance work covers many different areas, and once you are competent in one product area it is not too difficult to cover additional areas.
“Performance” covers
Making sure it scales close to linearly
If you double the throughput, the costs per transaction should be similar. As the throughput increases, the response time does not increase significantly. You can have many threads running concurrently.
If the workload has disk I/O then you need to have multiple threads, so while one task is waiting for I/O another task can be using the CPU.
You need a box with multiple CPUs to detect contention. If you have only one or two engines you may not detect concurrency issues.
Work to remove contention until you can drive the CPUs at 100% busy (and then you ask for a bigger box). If you cannot drive the box at 100% find out why, resolve it and repeat.
Reduce CPU
Once you have eliminated as much contention as possible, you need to investigate where the CPU is being used, and try to eliminate any hot spots. This might be
- Change algorithms – use a hash table instead of a linked list.
- Avoid unnecessary work. Do you really need to store intermediate values in a database?
- Can you tune the services being used. For example tune the database, add an index to a table.
- Rearrange the code, for example have the “hot code” located in the same few pages. Avoid lots of error handling code in the mainline code – branch out of the mainline to handle it.
- Remove debug code, or put debug code within if (debug enabled) then { debug code}.
Work with customers problems
Understand what areas the users have problems with, identify “problem areas” which take time to identify the problem.
Enhance the design
From your testing, and the experience with customer problem propose improvements to help diagnose problems for example
- Capture the number, the average time, and the maximum time of database requests. Report this as a statistics or in response to a display command.
- Record the number of times a resource, such as a lock, was not available, record total count of requests, number of blocked requests, time spent waiting. This code may never be executed, but if it is, you get useful information about the size of the problem.
Provide useful information to the end user
These are often known as “performance reports”. It is easy to produce reports that people cannot use – I have done it many times. Producing reports with nice graphs are often not easy to use, as they do not match your scenario.
You need to consider the questions the end users will have.
- I want to run an ill defined workload (I do not know all the details), how big a box do I need (how many CPUs), to support 1000 requests a second.
- What should I look at to tell me if things are running well or not.
- What are common symptoms, and what actions can I take to solve performance problems.
- What things do we need to consider to make it run well? For example table layout, how many requests per commit, how often you need to sign on.
Performance roles
The roles below are typical of the sort of activities a performance person will do
Run tests
The first tasks a person usually does when becoming a performance person is to run tests, and collect the data. This may involve writing scripts and tools so it can all be automated. For example on z/OS you might use Netview to run scripts, capture responses, and take actions when there are problems. This could all be done using Rexx scripts in TSO, and possibly using a REST interface.
Good automation will collect all of the key metrics into one place, for example a spread sheet, so the analyst can simply press a button or two to be able to display the data.
There may be a management report produced daily or weekly to show that performance overall has improved – or has not got worse.
Look at a component
You need to look at components within the whole environment, for example this week, look at the z/OSMF SDSF interface, next week the logon process.
You need to drive a high volume workload using this component. You need to focus on the component, for example with a REST requests 90% of the cost may be in the logon and establish a session. This makes it hard to focus on the other 10%. Sign on once, and have an application that just issues requests to the component.
When I was testing MQ under CICS, the duration of an MQPUT took 50 microseconds, and the cost of starting the CICS transactions was 1000 microseconds. I changed the transaction to process 1000 messages, so the transaction now took about 50 milliseconds, and most of the work was in the MQPUT area, and not in the CICS transaction overhead.
Capture the response time of the transactions and plot it over time. You should get a flat line. If the response increases over time, you might have a storage leak, and so it takes longer to get storage.
You may find it does not scale. Turning trace on can give an indication where the problem is. You often get function entry and exit trace, with time stamps, so you can post process the output to calculate the duration within the function. Trace often does not scale, so you cannot always believe the output.
You may want to instrument a private copy of the code. Obtain the time on entry and exit to the function, and across major calls to external requests. Calculate the duration of the calls, add logic to say “If duration > 10 millisecond then throw exception”, or accumulate the data in a global control block. When I did this, I found the trace code was adding significant delays, and the root cause of the problem was an insignificant line of code, which got an exclusive latch for an update!
I added code to measure the average duration of file I/O, and output this in the statistics. This made solving some problems very easy – you have an I/O problem. See here, it is taking 10 ms to write a page of data!
Unless you are testing the startup times, you should allow the system under test to “warm up”, so the hardware cache is in a steady state, database tables are in memory etc.
I found it useful to warm it up, then take 5 sets of measurements each of 1-5 minutes. When displaying the data, the results should all be similar. If not, you need to find out why. You should also run these tests once a week, and whenever you change a component, such as putting fixes onto your system, or change the hardware. Some example of things that can change your results
- Overnight the Operations Team run backups and cause a different disk response time
- The order the LPARs were ipled has changed. Last week your system had 6 CPUs in one book (so all very close to each other) this week your system has 3 CPUs in one book – and 3 CPUs in a different book – 1 metre away.
- The network between your driving system and the test system has changed, or has a different load.
Usually the performance machines have their own dedicated hardware, processors, disks, connections to the disks, network.
Develop skills in other products
My background is MQ performance on z/OS. I had to learn about the performance characteristics of z/OS, DB2, TCP/IP, IMS, and understand the tools these products provide. Once you understand one trace, other traces are basically similar. The hard part is capturing the trace.
MQ passes messages from system to system. There were several problems where the “network was slow”. This meant we had to understand what was happening under the covers. Some good problems with easy fixes included
- There was a TCP performance “improvement” where one end would delay sending a packet for a few milliseconds, as it is more efficient to send one big packet rather than several smaller packets. This meant that every MQ message sent over the network had a couple of millisecond delay. This fix was easy – disable this feature.
- TCP/IP by default uses small buffers (256 bytes). You can configure a session to have very large buffers and tell it to automatically tune the best buffer size ( up to MB sized buffers).
Work with customers on their performance problems
The work involves working on performance problems where you do not have any of your specially written code included in it. You may need to turn on the product trace for a few seconds, then turn it off, and then process the output. Many customers do not run with trace on because of the overhead and major impact on throughput.
You can acquire the skills to talk to customers on the phone about their problems. It is very good to feedback what you heard. “Let me check what you just said … when you do … you get … “
Over time you will build up a list of questions to ask.
Once the problem has been resolved, consider what would have made it easier to find the root cause. Can you get development to put in some statistics, so next time this happens, you can tell the customer to check a value.
In the early days on MQ, we used to get many problems, because the in-memory buffer was too small. Development put out a fix, so that every 10 minutes or so it would report if it had detected a buffer full problem since the last message. After this fix was rolled out, we had no more of these problems.
There is no limit as to how far you can go
Once you have skills in one component you can apply these skills to other products or components. For example I spent some time looking at MQ on Linux so I could understand (and blog) on the performance data produced. (The performance data was “here are some numbers, we are not going to tell you what they mean”).
I’ve also been looking at Java performance, which lead me to look at the zFS file system, and the statistics it provides (it provides some – but they are not very useful).
You can also go deep. I knew about z architecture instructions and how some are fast and some are slow. I attended a taskforce with lots of hardware people. I met the team leader for the “load instructions”, and found that the “load instruction” was not an instruction – it is more like a subroutine with logic, for example
- Find which CPU which currently “owns” this data in the CPU cache, and go and get it
- Lock the page
- Go and get this value from another page
- Add the two values
- Unlock both the pages
The subroutine had to communicate with other CPUs in the LPAR, worry about its own CPU cache etc. Deep Stuff!
Once you know this sort of stuff, it helps you program, for example it is better not to share a field if you do not have to. When a multi threading program uses a buffer to trace into, do not have one buffer which they all share, but give each thread its own buffer. This way the hardware will not be fighting over the buffer, and the data for each application can be kept on the same CPU as the program. This is obvious once you know!
Collect statistics at the thread level, and not at the global level. Merge them at display time. You know the reason why.
The hardware can start to execute instructions out of order – as long as they “commit” in the right order.
The z hardware has instrumentation which samples the executing system, and can tell you why instructions were delayed. For example
- Data had to be obtained from the L2 cache on the chip
- The instruction needed to be interpreted and added to the Translation Lookaside Buffer
This is a bit deep for many people, especially if they are at the level of using “printf” in their programs to display debug information.
“Me, with the brain the size of a planet ….”
This is a quote from Marvin the paranoid Android in the Hitchhiker’s guide to the galaxy. With performance work you can go deep, or you can go wide, but you would need a bigger brain than I had to go deep and wide – but it is a fascinating area.
One thought on “How to become a performance expert in 3 easy lessons”