Question: In your car how do you tell if your car has a problem? Answer: You look at the dashboard and see if there is a red light showing. You may not know how to fix it – but you know that you need to get help to fix it.
The aim of this series of blog posts is to show you what to look for in z/OS performance and if you have a problem.
I will cover
- CPU at the LPAR level.
- Synchronous I/O
- Workload Manager; or is my work achieving its goals?
For some of these you need data from z/OS. This post describes how to get the SMF data, and format it using RMF.
There are two basic things you need to check
- Has my LPAR got all the CPU it wanted – has the hyper-visor restricted the CPU?
- How busy are my CPUs?
Has my LPAR got all the CPU it wanted
An LPAR can be configured to have dedicated engines, or share a pool of engines. Dedicated engines means that the engine is always there when it is needed. If the LPAR is using a shared engine, it may not always be available when needed.
An example to explain the concept
You have a class from 10am to 11 am. You go in, and sit down. The teacher starts the class. the teacher’s phone rings and goes out of the classroom. You play with your phone until the teacher comes back after 40 minutes. (The teacher went to teach in a different class room.)
How long were you in class for and how much work did you do?
- You were in class for 1 hour.
- You did 20 minutes work.
This concept is the same as any LPAR with shared engines.
- The 1 hour class is a time slice as seen by z/OS.
- The “processor” (teacher) was used in the time slice for only 20 minutes
- For 40 minutes the “processor” was doing work elsewhere.
How do you get the report to show these figures.?
You need the RMF CPU report. It has “C P U A C T I V I T Y “ at the top of the page.
Look at the section
---CPU--- ---------------- TIME % ---------------- NUM TYPE ONLINE LPAR BUSY MVS BUSY PARKED 0 CP 100.00 46.68 46.32 0.00 1 CP 100.00 38.98 38.78 0.00 2 CP 100.00 34.91 34.62 0.00 TOTAL/AVERAGE 40.19 39.90 3 IIP 100.00 94.43 94.70 0.00 4 IIP 100.00 93.50 93.74 0.00 TOTAL/AVERAGE 93.96 94.22
LPAR BUSY is how much teacher time you got
MVS Busy is how much time you were in the classroom for.
- If MVS BUSY TIME = LPAR BUSY TIME, perfect, what you needed you got.
- If MVS BUSY TIME > LPAR BUSY TIME, MVS had to wait for an engines, the system may need more CPU, a small difference(5%) is OK.
- If MVS BUSY TIME >> LPAR BUSY TIME, For much of the time, there was no engine when MVS needed This will have a major impact on your work. If your end user work is not meeting targets, you need more CPUs, or give your LPAR a higher dispatching priority.
These values should be similar: MVS BUSY TIME 39.60 is close to LPAR BUSY 40.19, and for the ZIIP, 93.96 is close to 94.22.
When these figures are significantly different, stop, and fix the problem. This can make all other performance data look bad. For example, disk response time, and timing in application trace entries.
How busy are my CPUs?
The TOTAL/Average will be close to 100 % on a busy system. 95% busy is OK, Make a note that the system may be short of CPU.
These are average values. The individual values could be spiky. For example at 100% busy for 4 minutes, 80% busy for 1 minute, or an average of 96% busy over 5 minutes. Consider using an online monitoring to see if you have big peaks and trough.
More advanced topic for information.
The following section gives you information on how much work was waiting. It is hard to say what is good or bad, as it could look bad, but all the performance goals are being met.
How much work was waiting?
-----------------------DISTRIBUTION OF IN-READY WORK UNIT QUEUE-------------- NUMBER OF 0 10 20 30 40 50 60 70 80 90 100 WORK UNITS (%) |....|....|....|....|....|....|....|....|....|....| <= N 26.3 >>>>>>>>>>>>>> = N + 1 12.9 >>>>>>> = N + 2 10.1 >>>>>> = N + 3 10.1 >>>>>> <= N + 5 12.5 >>>>>>> <= N + 10 11.0 >>>>>> <= N + 15 6.0 >>>> <= N + 20 5.2 >>> <= N + 30 1.6 > <= N + 40 0.6 > <= N + 60 1.1 > <= N + 80 1.1 > <= N + 100 0.8 > <= N + 120 0.1 > <= N + 150 0.0 > N + 150 0.0
N is the number of CPUs. I have 5 on my system.
The data is sampled. If system was sampled 10 times a second, every 0.1 of a second RMF counts the number of tasks in the “ready to dispatch queue”, and increments the value in the appropriate box; if there were 5 tasks executing and one task waiting, increment the N+1 element;
- 26.3 % of the time, there were no tasks waiting for CPU.
- 12.9 % of the time, there was 1 task waiting for CPU. See the bold data in the data above. (N+1 12.9 >>)
- 10.1 % of the time, there were 2 tasks waiting for CPU
- 5.2 % of the time there were between 16 and 20 tasks waiting for CPU
- 0.1 % of the time there were between 101 and 120 task waiting for CPU
Remember this could be waiting for CP, or IIP.
If there are hundreds to tasks waiting for CPU you should make a note. It may not be a problem.
If there are under 50 tasks waiting for CPU, this should be OK.
On a busy system there will always be work waiting to run. Compare the pictures from a busy time and a not so busy time.
Is this important?
I once did some measurements with MQ on a machine with 16 processors, on average the engines were about 5% busy. A performance person from IBM said that my workload showed a shortage of CPU! 5 % busy on 16 processors – was I really short of CPU?
My application received some data, and posted 30 threads to come and process the data. The first 15 threads could be dispatched because there were 15 unused CPUs. 15 threads had to wait.
This showed up in the above report at line N+15 of the tasks were waiting 20% of the time.
Out of the 30 tasks that were dispatched, one processed the work,the other 29 went back to sleep.
We changed the program to post no more threads the number of CPUs (16) in the LPAR, and had a significant saving in CPU.