I have a set of blog posts relating to getting started with z/OS performance. This blog post follows on the overview of WLM, and describes the contents of the reports, and how you can tell if work is being delayed, and why it is being delayed.
Real goals from my system
For TSO on my z/OS there are goals
- For the first 800 service units (a systems independent measure of CPU usage)
- 80% requests to complete within 00:00:00.30
- Work has importance 2
- After this, any work has an execution velocity of 40.
For started tasks with Medium Priority the goals are
- Execution velocity of 30
- Importance 3
For started tasks with Low Priority the goals are
- Discretionary – there no goals – just do your best
How do I tell what is going on and if the goals have been met?
RMF can display data in near real time (every minute or so).
RMF captures data and produces SMF records which can be processed by RMF and other products.
You can report on
- How well the service class did against its goals
- How well transactions or work did, from a reporting class.
You could have all CICS transactions in a service class, so they get the same CPU profile etc, but have different reporting classes. You can monitor CE* transaction, and PAY* transactions differently.
You could have a reporting class for work coming in from other systems, depending on the userid.
I set up a reporting class for z/OSMF. In the RMF batch report SYSRPTS(WLMGL(RCPER(ZOSMF)).
One part of the report was contained
z/OS V2R4 SYSPLEX ADCDPL DATE 06/14/2021 INTERVAL 05.00.003 RPT VERSION V2R4 RMF TIME 09.25.00 POLICY=ETPBASE REPORT CLASS=ZOSMF PERIOD=1 -TRANSACTIONS-- TRANS-TIME HHH.MM.SS.FFFFFF TRANS-APPL%-----CP-IIPCP/AAPCP-IIP/AAP ---ENCLAVES--- AVG 1.00 ACTUAL 0 TOTAL 66.25 64.20 173.99 AVG ENC 0.00 MPL 1.00 EXECUTION 0 MOBILE 0.00 0.00 0.00 REM ENC 0.00 ENDED 0 QUEUED 0 CATEGORYA 0.00 0.00 0.00 MS ENC 0.00 END/S 0.00 R/S AFFIN 0 CATEGORYB 0.00 0.00 0.00 ----SERVICE---- SERVICE TIME ---APPL %--- --PROMOTED-- --DASD I/O--- ----STORAGE---- -PAGE-IN RATES- IOC 2366K CPU 720.505 CP 66.25 BLK 0.000 SSCHRT 0.2 AVG 81420.24 SINGLE 0.0 CPU 617333 SRB 0.223 IIPCP 64.20 ENQ 0.000 RESP 0.0 TOTAL 81421.05 BLOCK 0.0 MSO 154219 RCT 0.000 IIP 173.99 CRM 0.000 CONN 0.0 SHARED 0.00 SHARED 0.0 SRB 191 IIT 0.013 AAPCP 0.00 LCK 0.889 DISC 0.0 HSP 0.0 TOT 3138K HST 0.000 AAP N/A SUP 0.000 Q+PEND 0.0 GOAL: EXECUTION VELOCITY 70.0% VELOCITY MIGRATION: I/O MGMT 28.3% INIT MGMT 28.3% RESPONSE TIME EX PERF AVG --EXEC USING%-- -------------- EXEC DELAYS % ----------- SYSTEM VEL% INDX ADRSP CPU AAP IIP I/O TOT IIP CPU S0W1 --N/A-- 28.3 2.5 1.0 8.0 N/A 20 0.0 72 53 19
This tells the duration of the requests.
POLICY=ETPBASE REPORT CLASS=ZOSMF PERIOD=1
This tells you this is a report class (rather than a service class) the name is zOSMF, and is for period 1 . When you have service classes which have more than one criteria , such as high priority for the first 0.5 seconds of CPU – then low priority, these will have multiple periods.
This says on average there was one instance running. You can have multiple transactions or jobs in a class. Add up the total duration of all jobs/transactions and divide by the interval to get the average(AVG).
MPL (multi programming level) is an advanced topic and describes how many instances were concurrently active.
No jobs/transactions ended in this interval, with a ending rate of 0 in 5 minutes.
This shows the percentage of CPU used over the interval
- 66.25 percent on GP engines
- 64.20 percent IIPCP is 64.20 % of GP engine was doing work that could have run on a ZIIP – if there had been spare ZIIP capacity. 66.25 – 64.20 = 2.05 of work on a GP that was not ZIIP eligible.
- 173.99 percent of ZIIP work running on a ZIIP engine – so nearly 2 ZIIP engines were being used
- 0 AAPCP – there was no ZAAP eligible work offloaded onto a GP
- 0 AAP there was no work running on an ZAAP
The total ZIIP used was 173.99 in ZIIP engines, +64.20 of a GP = 238 or almost 2.5 ZIIP engines worth.
It is good to run on ZIIPs where possible, because ZIIPs are cheaper ($$) than GPs, and GPs may be configured to be slower than a ZIIP.
GOAL: EXECUTION VELOCITY 70.0%
The performance goal for this work was defined as Execution Velocity of 70 %.
EX PERF AVG --EXEC USING%-- - EXEC DELAYS % - SYSTEM VEL% INDX ADRSP CPU AAP IIP I/O TOT IIP CPU S0W1 28.3 2.5 1.0 8.0 N/A 20 0.0 72 53 19
- The achieved execution velocity was 28.3% against a target of 70%
- The performance index was 2.5. The performance goal is goal/actual. A value of 1 or smaller is good. The value here shows the goal was not met. You need to consider
- Changing the goal for this work so the target goal is what you can achieve on a normal day
- Changing the importance of the work for when the system is constrained.
- If you change the goal for one set of work – it may impact other work, so you need to look at the system as a whole and decide which is your important work.
- Add more CPUs or ZIIPs – these may not help if the delays are not CPU… see below
- Average number of address spaces in this class 1.
- EXEC USING%. The figures above were for true CPU used. WLM samples activities 4 times a second. Of the samples where jobs were running or waiting for waiting for a resource.
- 8% of an CPU engine was used – this includes ZIIP work running on GP.
- 20% of a ZIIP engine
- The ratio 8:20 is similar to CPU on GP and ZIIP actually used in this period of 66.25: 173.99.
- EXEC DELAYS
- The total delay was 72% = ( 100 – (8+20) “using samples” above)
- for 53% of all the samples it was was waiting for a ZIIP engine
- for 19% of all the the samples it was waiting for a GP engine.
- You can have other delays listed here, for example paging, or your program is capped to limit how much CPU it is allowed.
Once z/OSMF had started, and settled down, there were still delays for IIP (28%). To me this looks like a lumpy workload, that perhaps there is a timer which pops and runs multiple threads. There are more threads than IIPs – so some have to wait.
Reports for transactional work
I defined a transaction so I could measure the response times (and CPU used) for a service in z/OSMF. A TSO address space is started, and z/OSMF sends a client/server request to the TSO address space. The response time is sub-second so a good candidate to demonstrate WLM for a transaction.
I configured z/OSMF to have
<httpClassification transactionClass=”ZCI3″ resource=”/zosmf/webispf/*/“/>
The collection name is passed to WLM to determine the service class and report class of the work. The default is the server name.
All ISPF (with a URL of /zosmf/webispf/*) requests were classified as ZCI3.
I then used WLM to configure
- a service class ZCI3 with Average response time of 00:00:00.010
- a classification rule for type CB, a rule for CN=MOPZCET, and sub-rule TC = ZCI4. This gave the service class and report class.
The data in the report had
21 transactions in 5 minutes is 0.07 a second.
MPL (MultiProgramming Limit is the target which represents the number of address spaces that must be in the swapped-in state for the service class period to meet its goals. I’ve never used it!
TRANS-TIME HHH.MM.SS.FFFFFF ACTUAL 140526 EXECUTION 139950 QUEUED 575
The average time was 0.140 seconds.
GOAL: RESPONSE TIME 000.00.00.010 AVG
That was the specification in WLM (note the specified value of 0.010 is very different to the 0.140 achieved)
RESPONSE TIME EX PERF AVG --EXEC USING%-- - EXEC DELAYS % - SYSTEM HHH.MM.SS.FFFFFF VEL% INDX ADRSP CPU AAP IIP I/O TOT IIP S0W1 000.00.00.140526 66.7 14.1 0.0 0.0 N/A 18 0.0 9.1 9.1
This shows the average response time was 0.140 seconds, used 18% on a ZIIP, and waited 9% of the time for a ZIIP
To the right of the data in the report was
--- DELAY % --- UNK IDL CRY CNT 64 0.0 0.0 0.0
Which says there was 64% of the delay was unknown. This could be
- waiting for end user input
- waiting for TCP/IP data
- the program sent off a request and is waiting for a response.
For example the ISPF transaction in z/OSMF had sent a request to an address space running TSO. This address space processed the request and sent the response back. I am guessing that the 64% delay was waiting for TSO to process the request and send back the response.
You also get a response time profile based on the service class
----------RESPONSE TIME DISTRIBUTION---------- -----TIME------ # TRANS 0 10 20 30 40 50 60 70 80 90 100 HH.MM.SS.FFFFFF IN BUCKET |....|....|....|....|....|....|....|....|....|....| <= 00.00.00.014000 0 > <= 00.00.00.015000 0 > <= 00.00.00.020000 2 >>>>>> <= 00.00.00.040000 5 >>>>>>>>>>>>> > 00.00.00.040000 14 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
This shows that out of the 21 requests, 7 were below 0.040 seconds, and 14 were over 0.040 seconds.
From the service class, it was specified as GOAL: RESPONSE TIME 000.00.00.010 AVG so this goal is very badly specified. It would be better set to average of 0.140 seconds.
I changed the service class to a goal of 0.140 seconds and activated it. After I had run some tests the output was
RESPONSE TIME EX PERF AVG --EXEC USING%-- - EXEC DELAYS % SYSTEM HHH.MM.SS.FFFFFF VEL% INDX ADRSP CPU AAP IIP I/O TOT S0W1 000.00.00.097733 100 0.7 0.0 0.0 N/A 50 0.0 0.0
Which showed no delays
and a response time profile
---RESPONSE TIME DISTRIBUTION--- -----TIME------ --# TRANS 0 10 20 30 40 50 60 HH.MM.SS.FFFFFF IN BUCKET |....|....|....|....|....|....|. <= 00.00.00.070000 0 > <= 00.00.00.084000 5 >>>>>>>>>>>>>>> <= 00.00.00.098000 9 >>>>>>>>>>>>>>>>>>>>>>>>>> <= 00.00.00.112000 1 >>>> <= 00.00.00.126000 0 > <= 00.00.00.140000 1 >>>> <= 00.00.00.154000 1 >>>> <= 00.00.00.168000 0 > <= 00.00.00.182000 0 > <= 00.00.00.196000 0 > <= 00.00.00.210000 1 >>>> <= 00.00.00.280000 0 >
An average of 0.10 seconds, with some taking up to 0.210 seconds.
Real time information
You can get the information in near real time from RMF (or other monitors)
For example for processor delays
Service CPU DLY USG EAppl ----------- Holding Job(s) --------- Jobname CX Class Type % % % % Name % Name % Name IZUSVR1 SO STCHIM CP 2 35 56.53 91 IZUSVR1 4 JES2MON 2 TCPIP IIP 94 95 183.1 89 IZUSVR1
This shows that job IZUSVR1
- Was delayed for 2% of the time on a GP
- Used 35% of the GP engines
- Was delayed 94% of the time on a ZIIP
- and used 95% of the available ZIIP resource
- The jobs using CPU were IZUSVR1 (using 91%) JES2MON and TCPIP
- The jobs using ZIIP were IZUSVR1
What to do now?
You need to identify the goals of your work, and set sensible goals. This may take several iterations. You also need to understand the priorities of the work, and userid.
Once you have configured your system to report on response times of your business critical work, you can adjust the service classes so your work achieves it goals.
Define reporting classes so you can monitor different groups of work and that they are meeting their goals.