Question: In your car how do you tell if your car has a problem? Answer: You look at the dashboard and see if there is a red light showing. You may not know how to fix it – but you know that you need to get help to fix it.
The aim of this series of blog posts is to show you what to look for in z/OS performance and if you have a problem.
I will cover
- CPU at the LPAR level.
- Synchronous I/O
- Workload Manager; or is my work achieving its goals?
Ive written a blog post on how to understand reports from WLM
40 years ago the pilot of a commercial jet had many knobs and dials to control the performance (and speed) of the aeroplane. These days computers do most of the small tuning; the pilot sets the overall goals, and the computer does the rest.
It is the same with managing workload performance on z/OS. The systems programmer used to individually adjust the performance of jobs and transactions running on z/OS. These days the systems programmer sets the overall goals and the computer does the rest.
The work on z/OS is managed by the WorkLoad Manager (WLM). The systems programmer defined goals like
- these CICS transactions should run with an average response time of 1 second.
- The trivial TSO commands should run in under 10 milliseconds.
- Batch – I dont care… it can run in the background.
I remember one customer saying that one day when he switched on WLM (so the WLM managed the workload), he noticed that the batch workload finished early, it made every thing go faster!
This is because when the system had been manually “tuned”; the CICS transactions were finishing in under half a second (much faster than the requirements of of 1 second). WLM worked to the goals, and the CICS transactions executing with a response time of 1 second. This meant there was spare CPU, and more batch workload could be done during the day.
How to monitor work?
For short lived requests, like CICS transactions or TSO commands, the response time is the obvious metric. Typically the response time is under a second, and at the end of a minute you should have many data points to tell you if you are achieving your response time goals.
Long running jobs or started tasks may run for weeks, so the time to run the job is meaningless. It could run slower when the system is busy, or run faster when the system is lightly loaded. It needs a second by second metric to measure progress.
WLM uses the concept of how much the work can be delayed. It uses a metric called Execution velocity which in concept is “the ratio of CPU used” to “time waiting for a resource”. In simple terms
100 * CPU used /(CPU used + wait time for a resource).
WLM can periodically check this ratio and adjust the priority of the work to achieve the goals.
If the execution velocity is 100 then it is not delayed for CPU.
If the execution velocity is 1 then if the work used 1 second (or millisecond) of CPU, then it is OK for the job to be delay for I/O or waiting for CPU for 100 seconds (or milliseconds). The ratio is important – not the absolute values.
How to work out which work to dispatched?
Every 10 seconds WLM looks at the data to decide which service classes need more or less CPU
WLM looks at all the work, and if it is meeting the goals for the service class (the definition of the goals).
- If all the work is within its goals pick any waiting work to dispatch
- If any work is not within its goal, adjust the dispatching priorities. Start with work with Importance 1, when this is within its goals, look at work with Importance 2 etc..
What can you configure
You can configure the system with goals like
- CICS transactions should take 1 second elapsed time to execute.
- Quick TSO commands using less than 0.5 seconds of CPU have high velocity.
- Slow TSO command using more than 0.5, but less than 5 seconds of CPU have Importance 3.
- Expensive TSO commands using more than 5 seconds of CPU have low priority
- Colin’s TSO userid always gets high priority regardless of the commands.
- Batch jobs with this accounting information, can run with high velocity
- Long batch jobs, or those batch jobs using more than 1 second of CPU, have low priority.
Work can get tracked across the system, and if WLM detects that CICS transactions are slowing down, then when the CICS issues a DB2 request in a different LPAR in the sysplex, it makes sure the request in DB2 has a high enough priority to keep the response time goals. WLM can also prioritise I/O so that the I/O for one transaction takes precedence over the I/O for a batch job.
The systems programmer creates a few broad categories of work, and specifies the goals of the service class. These service classes control the priority of work.
Use the WLM redbook for guidance on defining WLM service classes.
Service classes define the goals.
You have reporting classes for groups of similar jobs or transactions to report WLM information on these similar jobs or transactions. So although a group of work has the same service class, you can report it different ways, for example by transaction, or by userid.
You can define CICS, IMS, or Liberty as a server, and transactions/work within the server get WLM classified. So for job CICSA, the transactions PAY1,PAY2,PAY3 have high priority; for z/OSMF, userid COLIN has high priority.
What class is my work in?
You can display which service class a job is in using the D A,jobname operator command, for example it gave
WKL=STARTED SCL=STCLOM .
The Service CLass is STCLOM.
You can use SDSF DA, and use the column SrvClass. (You need to start RMF, then go into SDSF to display the Srvclass and other WLM related parameters).
You can change the service class of a job by using the operator command
(Note which way round the letters are jobname IZUSVR1, service class SRVCLASS=…)
or, if you are authorised, overtype the field in SDSF, or from z/OSMF WLM plugin.
To change it permanently you’ll need to change the WLM definitions.
More details of how it works
I found the WLM redbook useful.
I described above that the execution velocity was 100 * CPU used /(CPU used + wait time for a resource).
The concept is correct – but the implementation is different. If my job had used 1000 seconds of CPU since it started, it is not helpful in seeing it behaviour over the last few minutes, as the execution velocity would be insensitive.
Every 250 milliseconds (4 times a second) WLM looks at every job/transaction in the system. It then updates internal control blocks for each Service Class and Report Class and increments a table.
- executing – add 1 to the active (or using) CPU
- transferring data to a device (connect time) add 1 to the active ( or using) I/O
- waiting in z/OS to start an I/O – add one to the delayed for I/O
- being paged in – add 1 to the delayed for”page in”
- waiting for the end user to enter data – do nothing.
- waiting for TCPIP data – do nothing.
Execution velocity = 100 * (Total active samples /(Total active samples + Total delayed samples).
If during a 25 second period the transaction was
- using CPU, in 20 samples,
- transferring data to disk, in 10 sample
- waiting to start an I/O, in 25 samples
- waiting for the end user to type some data, in 45 samples.
From this we can see…
- The count of active samples is 20 + 10
- The count of delayed is 25 samples.
- 45 samples are not used.
Execution velocity = 100 * ( (20 + 10) /(20+10) + 25)) = 55 .
An execution velocity of 100 means that when ever the job was sampled, it was always either dispatched and running; or transferring data to I/O.
An execution velocity says if we expect the job to use 50 seconds of CPU, and has a velocity of 10 then we expect the job to run in about 500 seconds. If it used 50 seconds of CPU, and was transferring data (connect time) of 20 seconds, the execution velocity would be 100 * (50 + 20) /((50+ 20) + 450) = 13 % which is close enough to 10% velocity.
Real goals from my system
For TSO on my z/OS there are goals
- For the first 800 service units (a systems independent measure of CPU usage)
- 80% requests to complete within 00:00:00.30
- Work has importance 2
- After this, any work has an execution velocity of 40.
For started tasks with Medium Priority the goals are
- Execution velocity of 30
- Importance 3
For started tasks with Low Priority the goals are
- Discretionary – there no goals – just do your best