How do I extract the WLM definitions to print, and to compare?

WLM on z/OS has a good ISPF interface, and old technology way of printing the configuration to an ISPF listing.

I wanted to compare the WLM configuration from one system with another system. It took a while to get this working.

I’ve put some code on github to process an XML file. It is basic but it works. If people find this useful, I could expand and improve it.

Output the WLM report as XML

Go to the main WLM ISPF panel.

   File  Utilities  Notes  Options  Help
 ---------------------------------------
 Functionality LEVEL011         Definiti
 Command ===> __________________________
                                        
 Definition data set  . . : none
                                        
 Definition name  . . . . . ETPwlm    (R
 Description  . . . . . . . ETP WLM Poli
                                        
 Select one of the following options.   
 __  1.  Policies                      
     2.  Workloads

Select the FILE pull down.
Select option 4 Save as
It may display a screen of Errors / warnings. Press PF3
It displays a pop-up “Save to…”. Specify a data set name and select Save format 1 (for XML)

The file is a sequential FB 80 file.

Download this to your work station (or run the Python on z/OS).

Create JSON format data from the file

I used the Python code

import xmltodict
import json
file="wlm.xml"
with open(file,"r") as myfile:
    data=myfile.read()
data = data.replace('\n',"")
book_dict = xmltodict.parse(data)
json_data = json.dumps(book_dict,indent=1,sort_keys=True)
# print(json_data)
with open("data.json", "w") as json_file:
        json_file.write(json_data)

This reads the file “wlm.xml” and creates a file “data.json”

This created

{
 "ServiceDefinition": {
  "@xmlns": "http://www.ibm.com/xmlns/prod/zwlm/2000/09/ServiceDefinition.xsd",
  "ApplicationEnvironments": {
   "ApplicationEnvironment": [
    {
     "Description": "WebSphere Application Server",
     "Limit": "NoLimit",
     "Name": "BBOC001",
     "ProcedureName": "BBO5ASR",
     "StartParameter": "JOBNAME=&IWMSSNM.S,ENV=ADCD.ADCD.&IWMSSNM",
     "SubsystemType": "CB"
    },
...
"ClassificationGroups": {
   "ClassificationGroup": [
    {
     "CreationDate": "1999/11/15 19:19:26",
     "CreationUser": "TODD",
     "Description": "Production Batch High - Med",
     "ModificationDate": "1999/11/16 11:17:21",
     "ModificationUser": "TODD",
     "Name": "BATHIM",
     "QualifierNames": {
      "QualifierName": [
       {
        "Description": "All SMP/E jobs",
        "Name": "SMP*"
       },

...
"Workloads": {
   "Workload": [
    {
     "CreationDate": "1999/11/15 16:55:08",
     "CreationUser": "TODD",
     "Description": "All batch workloads",
     "ModificationDate": "1999/11/15 16:55:08",
     "ModificationUser": "TODD",
     "Name": "BATCH",
     "ServiceClasses": {
      "ServiceClass": [
       {
        "CPUCritical": "No",
        "CreationDate": "1999/11/15 17:17:54",
        "CreationUser": "TODD",
        "Description": "High Batch - med",
        "Goal": {
         "Velocity": {
          "Importance": "2",
          "Level": "60"
         }
        },
...

You can now use your favourite tools for extracting data and formatting.

Using Pandas

The Python code below produced simple reports.

import pandas as pd
import xmltodict
import json
file="wlm.xml"
with open(file,"r") as myfile:
    data=myfile.read()
data = data.replace('\n',"")
book_dict = xmltodict.parse(data)

rg = book_dict["ServiceDefinition"]["Workloads"]["Workload"]

dd = pd.DataFrame.from_records(rg)
#pd.set_option('display.max_rows', 500)
#pd.set_option('display.max_columns', 500)
#pd.set_option('display.width', 1000)
print(dd)

Gave me

       Name  ...                                     ServiceClasses
0     BATCH  ...  {'ServiceClass': [{'Name': 'BATHIM', 'Descript...
1    DB2RES  ...  {'ServiceClass': {'Name': 'STPCDDF', 'Descript...
2  IZUGWORK  ...  {'ServiceClass': [{'Name': 'IZUGHTTP', 'Descri...
3   SERVERS  ...  {'ServiceClass': [{'Name': 'SRVHIM', 'Descript...
4   STARTED  ...  {'ServiceClass': [{'Name': 'NOTRUN', 'Descript...
5  TSOOTHER  ...  {'ServiceClass': [{'Name': 'OTHER01', 'Descrip...

This shows that with a little Python code you can produce useful reports. People with experience of Pandas can format it better.

See code on github to produce basic reports

My WLM definitions were not behaving as I expected.

I had configured WLM so the MQ started tasks (CSQ*) were defined as a low priority STC.

  Subsystem-Type  Xref  Notes  Options  Help                              
--------------------------------------------------------------------------
                 Modify Rules for the Subsystem Type    Row 22 to 25 of 25
Command ===> ___________________________________________ Scroll ===> CSR
                                                                          
Subsystem Type . : STC         Fold qualifier names?   Y  (Y or N)        
Description  . . . All Started Tasks                                      
                                                                          
Action codes:   A=After     C=Copy        M=Move     I=Insert rule        
                B=Before    D=Delete row  R=Repeat   IS=Insert Sub-rule   
                                                             More ===>    
         -------Qualifier--------                 -------Class--------    
Action   Type      Name     Start                  Service     Report     
                                         DEFAULTS: STCLOM      ________   
 ____  1 TN        CSQ9WEB  ___                    STCLOM      MQ         
 ____  1 TN        CSQ9CHIN ___                    STCLOM      MQ         
 ____  1 TN        CSQ9ANG  ___                    STCLOM      MQ

But I could see from SDSF, that the CSQ9CHIN’s SrvClass was STCHIM, and CSQ9WEB’s was STCHIM. It took me a couple of hours digging to find out why.

Higher up the list, the WLM definitions had

         -------Qualifier--------                 -------Class--------    
Action   Type      Name     Start                  Service     Report     
                                         DEFAULTS: STCLOM      ________   
 ____  1 TN        %MASTER% ___                    SYSTEM      MASTER     
 ____  1 SPM       SYSTEM   ___                    SYSTEM      ________   
 ____  1 SPM       SYSSTC   ___                    SYSSTC      ________   
 ____  1 TNG       STCHI    ___                    SYSSTC      ________   
 ____  1 TNG       STCMD    ___                    STCMDM      ________   
 ____  1 TNG       MONITORS ___                    ________    MONITORS   
 ____  1 TNG       SERVERS  ___                    STCMDM      ________   
 ____  1 TNG       ONLPRD   ___                    STCHIM      ________

There is a definition for ONLPRD (online production), a group of transaction names (Transaction Name Group).

From option 5 Classification Groups, of the main WLM panel it displays

                         Classification Group Menu                        
 Select one of the following options.                                       
 __  1.  Accounting Information Groups   14. Plan Name Groups              
     2.  Client Accounting Info Groups   15. Procedure Name Groups         
     3.  Client IP Address Groups        16. Process Name Groups           
     4.  Client Transaction Name Groups  17. Scheduling Environment Groups 
     5.  Client Userid Groups            18. Subsystem Collection Groups   
     6.  Client Workstation Name Groups  19. Subsystem Instance Groups     
     7.  Collection Name Groups          20. Subsystem Parameter Groups    
     8.  Connection Type Groups          21. Sysplex Name Groups           
     9.  Correlation Information Groups  22. System Name Groups            
     10. LU Name Groups                  23. Transaction Class Groups      
     11. Net ID Groups                   24. Transaction Name Groups       
     12. Package Name Groups             25. Userid Groups                 
     13. Perform Groups                  26. Container Qualifier Groups

Most of these had no definition, but option 24. Transaction Name Groups gave me

                           Group Selection List                Row 1 to 5 of 5
Command ===> ____________________________________________________________     
                                                                              
Qualifier type . . . . . . . : Transaction Name                               
                                                                              
Action Codes: 1=Create, 2=Copy, 3=Modify, 4=Browse, 5=Print, 6=Delete,        
              /=Menu Bar                                                      
                                                    -- Last Change ---        
Action  Name      Description                       User      Date            
  __    MONITORS  Online System Activity monitors   TODD      1999/11/16      
  __    ONLPRD    Online Production Subsystems      IBMUSER   2023/01/10      
  __    SERVERS   Server Address Spaces             TODD      1999/11/16      
  __    STCHI     High STC's                        TODD      1999/11/16      
  __    STCMD     Medium STC's                      TODD      1999/11/16

and these names match what is in the classification rules section above.

Option 3 to modify ONLPRD, gave

                              Modify a Group                   Row 1 to 8 of 8
Command ===> ____________________________________________________________     
                                                                              
Enter or change the following information:                                    
                                                                              
Qualifier type . . . . . . . : Transaction Name                               
Group name . . . . . . . . . : ONLPRD                                         
Description  . . . . . . . . . Online Production Subsystems                   
Fold qualifier names?  . . . . Y  (Y or N)                                    
                                                                              
Qualifier Name  Start  Description                                            
%%%%DBM1        ___    DB2 Subsystems                                         
%%%%MSTR        ___    DB2 Subsystems                                         
%%%%DIST        ___    DB2 Subsystems                                         
%%%%SPAS        ___    DB2 Subsystems                                         
CICS*           ___    CICS Online Systems                                    
IMS*            ___    IMS Online Systems                                     
CSQ*            ___    MQ Series

and we can see that MQ started tasks starting with CSQ are in this group.

As this definition is higher in the classification rules list – it will take precedence over any definitions I had defined lower down.

Because there was a definition (within the Started classification)

____  1 TNG       ONLPRD   ___                    STCHIM      ________

Started tasks in the group ONLPRD are classified as STCHIM, and so this explains why the classification of the MQ address spaces were “wrong”.

I had several options

Change the groups and put MQ in its own group with STCLOM
Move my CSQ9* specific definitions above the group.

One Minute MVS performance – Work Load Manager – looking at WLM reports.

I have a set of blog posts relating to getting started with z/OS performance. This blog post follows on the overview of WLM, and describes the contents of the reports, and how you can tell if work is being delayed, and why it is being delayed.

Real goals from my system

For TSO on my z/OS there are goals

For the first 800 service units (a systems independent measure of CPU usage)
1. 80% requests to complete within 00:00:00.30
2. Work has importance 2
After this, any work has an execution velocity of 40.

For started tasks with Medium Priority the goals are

Execution velocity of 30
Importance 3

For started tasks with Low Priority the goals are

Discretionary – there no goals – just do your best

How do I tell what is going on and if the goals have been met?

RMF can display data in near real time (every minute or so).

RMF captures data and produces SMF records which can be processed by RMF and other products.

You can report on

How well the service class did against its goals
How well transactions or work did, from a reporting class.

You could have all CICS transactions in a service class, so they get the same CPU profile etc, but have different reporting classes. You can monitor CE* transaction, and PAY* transactions differently.

You could have a reporting class for work coming in from other systems, depending on the userid.

I set up a reporting class for z/OSMF. In the RMF batch report SYSRPTS(WLMGL(RCPER(ZOSMF)).

One part of the report was contained


         z/OS V2R4               SYSPLEX ADCDPL             DATE 06/14/2021           INTERVAL 05.00.003   
                                 RPT VERSION V2R4 RMF       TIME 09.25.00
POLICY=ETPBASE                        REPORT CLASS=ZOSMF                                   PERIOD=1 
 -TRANSACTIONS--  TRANS-TIME HHH.MM.SS.FFFFFF  TRANS-APPL%-----CP-IIPCP/AAPCP-IIP/AAP  ---ENCLAVES--- 
 AVG        1.00  ACTUAL                    0  TOTAL        66.25       64.20  173.99  AVG ENC   0.00 
 MPL        1.00  EXECUTION                 0  MOBILE        0.00        0.00    0.00  REM ENC   0.00 
 ENDED         0  QUEUED                    0  CATEGORYA     0.00        0.00    0.00  MS ENC    0.00 
 END/S      0.00  R/S AFFIN                 0  CATEGORYB     0.00        0.00    0.00 
                                                                                                                
 ----SERVICE----   SERVICE TIME  ---APPL %---  --PROMOTED--  --DASD I/O---  ----STORAGE----  -PAGE-IN RATES- 
 IOC        2366K  CPU  720.505  CP     66.25  BLK    0.000  SSCHRT    0.2  AVG    81420.24  SINGLE      0.0 
 CPU      617333   SRB    0.223  IIPCP  64.20  ENQ    0.000  RESP      0.0  TOTAL  81421.05  BLOCK       0.0 
 MSO      154219   RCT    0.000  IIP   173.99  CRM    0.000  CONN      0.0  SHARED     0.00  SHARED      0.0 
 SRB         191   IIT    0.013  AAPCP   0.00  LCK    0.889  DISC      0.0                   HSP         0.0 
 TOT        3138K  HST    0.000  AAP      N/A  SUP    0.000  Q+PEND    0.0 
 GOAL: EXECUTION VELOCITY 70.0%     VELOCITY MIGRATION:   I/O MGMT  28.3%     INIT MGMT 28.3% 
                                                                                                                
          RESPONSE TIME    EX   PERF  AVG   --EXEC USING%--  -------------- EXEC DELAYS % -----------  
 SYSTEM                    VEL% INDX ADRSP  CPU AAP IIP I/O  TOT IIP CPU                                
 S0W1        --N/A--       28.3  2.5   1.0  8.0 N/A  20 0.0   72  53  19

Key fields:

INTERVAL 05.00.003

This tells the duration of the requests.

POLICY=ETPBASE REPORT CLASS=ZOSMF PERIOD=1

This tells you this is a report class (rather than a service class) the name is zOSMF, and is for period 1 . When you have service classes which have more than one criteria , such as high priority for the first 0.5 seconds of CPU – then low priority, these will have multiple periods.

-TRANSACTIONS–
AVG 1.00
MPL 1.00
ENDED 0
END/S 0.00

This says on average there was one instance running. You can have multiple transactions or jobs in a class. Add up the total duration of all jobs/transactions and divide by the interval to get the average(AVG).

MPL (multi programming level) is an advanced topic and describes how many instances were concurrently active.

No jobs/transactions ended in this interval, with a ending rate of 0 in 5 minutes.

—APPL %—
CP 66.25
IIPCP 64.20
IIP 173.99
AAPCP 0.00
AAP N/A

This shows the percentage of CPU used over the interval

66.25 percent on GP engines
64.20 percent IIPCP is 64.20 % of GP engine was doing work that could have run on a ZIIP – if there had been spare ZIIP capacity. 66.25 – 64.20 = 2.05 of work on a GP that was not ZIIP eligible.
173.99 percent of ZIIP work running on a ZIIP engine – so nearly 2 ZIIP engines were being used
0 AAPCP – there was no ZAAP eligible work offloaded onto a GP
0 AAP there was no work running on an ZAAP

The total ZIIP used was 173.99 in ZIIP engines, +64.20 of a GP = 238 or almost 2.5 ZIIP engines worth.

It is good to run on ZIIPs where possible, because ZIIPs are cheaper ($$) than GPs, and GPs may be configured to be slower than a ZIIP.

GOAL: EXECUTION VELOCITY 70.0%

The performance goal for this work was defined as Execution Velocity of 70 %.

 
         EX   PERF  AVG   --EXEC USING%--  - EXEC DELAYS % -
 SYSTEM  VEL% INDX ADRSP  CPU AAP IIP I/O  TOT IIP CPU      
 S0W1    28.3  2.5   1.0  8.0 N/A  20 0.0   72  53  19

The achieved execution velocity was 28.3% against a target of 70%
The performance index was 2.5. The performance goal is goal/actual. A value of 1 or smaller is good. The value here shows the goal was not met. You need to consider
- Changing the goal for this work so the target goal is what you can achieve on a normal day
- Changing the importance of the work for when the system is constrained.
- If you change the goal for one set of work – it may impact other work, so you need to look at the system as a whole and decide which is your important work.
- Add more CPUs or ZIIPs – these may not help if the delays are not CPU… see below
Average number of address spaces in this class 1.
EXEC USING%. The figures above were for true CPU used. WLM samples activities 4 times a second. Of the samples where jobs were running or waiting for waiting for a resource.
- 8% of an CPU engine was used – this includes ZIIP work running on GP.
- 20% of a ZIIP engine
- The ratio 8:20 is similar to CPU on GP and ZIIP actually used in this period of 66.25: 173.99.
EXEC DELAYS
- The total delay was 72% = ( 100 – (8+20) “using samples” above)
- for 53% of all the samples it was was waiting for a ZIIP engine
- for 19% of all the the samples it was waiting for a GP engine.
- You can have other delays listed here, for example paging, or your program is capped to limit how much CPU it is allowed.

Once z/OSMF had started, and settled down, there were still delays for IIP (28%). To me this looks like a lumpy workload, that perhaps there is a timer which pops and runs multiple threads. There are more threads than IIPs – so some have to wait.

Reports for transactional work

I defined a transaction so I could measure the response times (and CPU used) for a service in z/OSMF. A TSO address space is started, and z/OSMF sends a client/server request to the TSO address space. The response time is sub-second so a good candidate to demonstrate WLM for a transaction.

I configured z/OSMF to have

<zosWorkloadManager collectionName=”MOPZCET”/>
<wlmClassification>
<httpClassification transactionClass=”ZCI3″ resource=”/zosmf/webispf/*/“/>
</wlmClassification>

The collection name is passed to WLM to determine the service class and report class of the work. The default is the server name.

All ISPF (with a URL of /zosmf/webispf/*) requests were classified as ZCI3.

I then used WLM to configure

a service class ZCI3 with Average response time of 00:00:00.010
a classification rule for type CB, a rule for CN=MOPZCET, and sub-rule TC = ZCI4. This gave the service class and report class.

The data in the report had

-TRANSACTIONS–
AVG 0.01
MPL 0.01
ENDED 21
END/S 0.07

21 transactions in 5 minutes is 0.07 a second.

MPL (MultiProgramming Limit is the target which represents the number of address spaces that must be in the swapped-in state for the service class period to meet its goals. I’ve never used it!

TRANS-TIME HHH.MM.SS.FFFFFF
ACTUAL               140526
EXECUTION            139950
QUEUED                  575

The average time was 0.140 seconds.

GOAL: RESPONSE TIME 000.00.00.010 AVG

That was the specification in WLM (note the specified value of 0.010 is very different to the 0.140 achieved)


          RESPONSE TIME    EX   PERF  AVG   --EXEC USING%--  - EXEC DELAYS % -
 SYSTEM   HHH.MM.SS.FFFFFF VEL% INDX ADRSP  CPU AAP IIP I/O  TOT IIP 
 S0W1     000.00.00.140526 66.7 14.1   0.0  0.0 N/A  18 0.0  9.1 9.1

This shows the average response time was 0.140 seconds, used 18% on a ZIIP, and waited 9% of the time for a ZIIP

To the right of the data in the report was

--- DELAY % --- 
UNK IDL CRY CNT                 
 64 0.0 0.0 0.0

Which says there was 64% of the delay was unknown. This could be

waiting for end user input
waiting for TCP/IP data
the program sent off a request and is waiting for a response.

For example the ISPF transaction in z/OSMF had sent a request to an address space running TSO. This address space processed the request and sent the response back. I am guessing that the 64% delay was waiting for TSO to process the request and send back the response.

You also get a response time profile based on the service class

                              ----------RESPONSE TIME DISTRIBUTION---------- 
   -----TIME------  # TRANS   0    10   20   30   40   50   60   70   80   90   100 
   HH.MM.SS.FFFFFF  IN BUCKET |....|....|....|....|....|....|....|....|....|....| 
<= 00.00.00.014000          0  > 
<= 00.00.00.015000          0  > 
<= 00.00.00.020000          2  >>>>>> 
<= 00.00.00.040000          5  >>>>>>>>>>>>> 
>  00.00.00.040000         14  >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

This shows that out of the 21 requests, 7 were below 0.040 seconds, and 14 were over 0.040 seconds.

From the service class, it was specified as GOAL: RESPONSE TIME 000.00.00.010 AVG so this goal is very badly specified. It would be better set to average of 0.140 seconds.

I changed the service class to a goal of 0.140 seconds and activated it. After I had run some tests the output was

          RESPONSE TIME    EX   PERF  AVG   --EXEC USING%--  - EXEC DELAYS %
 SYSTEM   HHH.MM.SS.FFFFFF VEL% INDX ADRSP  CPU AAP IIP I/O  TOT            
 S0W1     000.00.00.097733  100  0.7   0.0  0.0 N/A  50 0.0  0.0

Which showed no delays

and a response time profile

                                ---RESPONSE TIME DISTRIBUTION--- 
    -----TIME------  --# TRANS  0    10   20   30   40   50   60
    HH.MM.SS.FFFFFF  IN BUCKET  |....|....|....|....|....|....|.
 <= 00.00.00.070000          0  > 
 <= 00.00.00.084000          5  >>>>>>>>>>>>>>> 
 <= 00.00.00.098000          9  >>>>>>>>>>>>>>>>>>>>>>>>>> 
 <= 00.00.00.112000          1  >>>> 
 <= 00.00.00.126000          0  > 
 <= 00.00.00.140000          1  >>>> 
 <= 00.00.00.154000          1  >>>> 
 <= 00.00.00.168000          0  > 
 <= 00.00.00.182000          0  > 
 <= 00.00.00.196000          0  > 
 <= 00.00.00.210000          1  >>>> 
 <= 00.00.00.280000          0  >

An average of 0.10 seconds, with some taking up to 0.210 seconds.

Real time information

You can get the information in near real time from RMF (or other monitors)

For example for processor delays

            Service  CPU  DLY USG EAppl  ----------- Holding Job(s) ---------
Jobname  CX Class    Type  %   %    %     %  Name      %  Name      %  Name 
IZUSVR1  SO STCHIM   CP     2  35 56.53   91 IZUSVR1    4 JES2MON    2 TCPIP 
                     IIP   94  95 183.1   89 IZUSVR1

This shows that job IZUSVR1

Was delayed for 2% of the time on a GP
Used 35% of the GP engines
Was delayed 94% of the time on a ZIIP
and used 95% of the available ZIIP resource
The jobs using CPU were IZUSVR1 (using 91%) JES2MON and TCPIP
The jobs using ZIIP were IZUSVR1

What to do now?

You need to identify the goals of your work, and set sensible goals. This may take several iterations. You also need to understand the priorities of the work, and userid.

Once you have configured your system to report on response times of your business critical work, you can adjust the service classes so your work achieves it goals.

Define reporting classes so you can monitor different groups of work and that they are meeting their goals.

One Minute MVS performance – Work Load Manager – background

Question: In your car how do you tell if your car has a problem? Answer: You look at the dashboard and see if there is a red light showing. You may not know how to fix it – but you know that you need to get help to fix it.

The aim of this series of blog posts is to show you what to look for in z/OS performance and if you have a problem.

I will cover

Ive written a blog post on how to understand reports from WLM

Managing workload

40 years ago the pilot of a commercial jet had many knobs and dials to control the performance (and speed) of the aeroplane. These days computers do most of the small tuning; the pilot sets the overall goals, and the computer does the rest.

It is the same with managing workload performance on z/OS. The systems programmer used to individually adjust the performance of jobs and transactions running on z/OS. These days the systems programmer sets the overall goals and the computer does the rest.

The work on z/OS is managed by the WorkLoad Manager (WLM). The systems programmer defined goals like

these CICS transactions should run with an average response time of 1 second.
The trivial TSO commands should run in under 10 milliseconds.
Batch – I dont care… it can run in the background.

I remember one customer saying that one day when he switched on WLM (so the WLM managed the workload), he noticed that the batch workload finished early, it made every thing go faster!

This is because when the system had been manually “tuned”; the CICS transactions were finishing in under half a second (much faster than the requirements of of 1 second). WLM worked to the goals, and the CICS transactions executing with a response time of 1 second. This meant there was spare CPU, and more batch workload could be done during the day.

How to monitor work?

For short lived requests, like CICS transactions or TSO commands, the response time is the obvious metric. Typically the response time is under a second, and at the end of a minute you should have many data points to tell you if you are achieving your response time goals.

Long running jobs or started tasks may run for weeks, so the time to run the job is meaningless. It could run slower when the system is busy, or run faster when the system is lightly loaded. It needs a second by second metric to measure progress.

WLM uses the concept of how much the work can be delayed. It uses a metric called Execution velocity which in concept is “the ratio of CPU used” to “time waiting for a resource”. In simple terms

100 * CPU used /(CPU used + wait time for a resource).

WLM can periodically check this ratio and adjust the priority of the work to achieve the goals.

If the execution velocity is 100 then it is not delayed for CPU.

If the execution velocity is 1 then if the work used 1 second (or millisecond) of CPU, then it is OK for the job to be delay for I/O or waiting for CPU for 100 seconds (or milliseconds). The ratio is important – not the absolute values.

How to work out which work to dispatched?

Every 10 seconds WLM looks at the data to decide which service classes need more or less CPU

WLM looks at all the work, and if it is meeting the goals for the service class (the definition of the goals).

If all the work is within its goals pick any waiting work to dispatch
If any work is not within its goal, adjust the dispatching priorities. Start with work with Importance 1, when this is within its goals, look at work with Importance 2 etc..

What can you configure

You can configure the system with goals like

CICS transactions should take 1 second elapsed time to execute.
Quick TSO commands using less than 0.5 seconds of CPU have high velocity.
Slow TSO command using more than 0.5, but less than 5 seconds of CPU have Importance 3.
Expensive TSO commands using more than 5 seconds of CPU have low priority
Colin’s TSO userid always gets high priority regardless of the commands.
Batch jobs with this accounting information, can run with high velocity
Long batch jobs, or those batch jobs using more than 1 second of CPU, have low priority.

Work can get tracked across the system, and if WLM detects that CICS transactions are slowing down, then when the CICS issues a DB2 request in a different LPAR in the sysplex, it makes sure the request in DB2 has a high enough priority to keep the response time goals. WLM can also prioritise I/O so that the I/O for one transaction takes precedence over the I/O for a batch job.

The systems programmer creates a few broad categories of work, and specifies the goals of the service class. These service classes control the priority of work.

Use the WLM redbook for guidance on defining WLM service classes.

Service classes define the goals.

You have reporting classes for groups of similar jobs or transactions to report WLM information on these similar jobs or transactions. So although a group of work has the same service class, you can report it different ways, for example by transaction, or by userid.

You can define CICS, IMS, or Liberty as a server, and transactions/work within the server get WLM classified. So for job CICSA, the transactions PAY1,PAY2,PAY3 have high priority; for z/OSMF, userid COLIN has high priority.

What class is my work in?

You can display which service class a job is in using the D A,jobname operator command, for example it gave

WKL=STARTED SCL=STCLOM .

The Service CLass is STCLOM.

You can use SDSF DA, and use the column SrvClass. (You need to start RMF, then go into SDSF to display the Srvclass and other WLM related parameters).

You can change the service class of a job by using the operator command

RESET IZUSVR1,SRVCLASS=STCMDM

(Note which way round the letters are jobname IZUSVR1, service class SRVCLASS=…)

or, if you are authorised, overtype the field in SDSF, or from z/OSMF WLM plugin.

To change it permanently you’ll need to change the WLM definitions.

More details of how it works

I found the WLM redbook useful.

I described above that the execution velocity was 100 * CPU used /(CPU used + wait time for a resource).

The concept is correct – but the implementation is different. If my job had used 1000 seconds of CPU since it started, it is not helpful in seeing it behaviour over the last few minutes, as the execution velocity would be insensitive.

Every 250 milliseconds (4 times a second) WLM looks at every job/transaction in the system. It then updates internal control blocks for each Service Class and Report Class and increments a table.

executing – add 1 to the active (or using) CPU
transferring data to a device (connect time) add 1 to the active ( or using) I/O
waiting in z/OS to start an I/O – add one to the delayed for I/O
being paged in – add 1 to the delayed for”page in”
etc
waiting for the end user to enter data – do nothing.
waiting for TCPIP data – do nothing.

Execution velocity = 100 * (Total active samples /(Total active samples + Total delayed samples).

If during a 25 second period the transaction was

using CPU, in 20 samples,
transferring data to disk, in 10 sample
waiting to start an I/O, in 25 samples
waiting for the end user to type some data, in 45 samples.

From this we can see…

The count of active samples is 20 + 10
The count of delayed is 25 samples.
45 samples are not used.

Execution velocity = 100 * ( (20 + 10) /(20+10) + 25)) = 55 .

An execution velocity of 100 means that when ever the job was sampled, it was always either dispatched and running; or transferring data to I/O.

An execution velocity says if we expect the job to use 50 seconds of CPU, and has a velocity of 10 then we expect the job to run in about 500 seconds. If it used 50 seconds of CPU, and was transferring data (connect time) of 20 seconds, the execution velocity would be 100 * (50 + 20) /((50+ 20) + 450) = 13 % which is close enough to 10% velocity.

Real goals from my system

For TSO on my z/OS there are goals

For the first 800 service units (a systems independent measure of CPU usage)
1. 80% requests to complete within 00:00:00.30
2. Work has importance 2
After this, any work has an execution velocity of 40.

For started tasks with Medium Priority the goals are

Execution velocity of 30
Importance 3

For started tasks with Low Priority the goals are

Discretionary – there no goals – just do your best

The bear traps when using enclaves

I hit several problems when trying to use the enclave support.

In summary

The functions to set up and use an enclave are available from C, but the functions to query and display usage are not available from C (and so not available from Java).
Some functions caused an infinite loop because they overwrote the save area.
Not all classify functions are available in C. For example ClientIPAddr
I had problems in 64 bit mode.
Various documentation problems
It is not documented that you need to pass the connection token to __server_classify(_SERVER_CLASSIFY_CONNTKN, (char * ) connToken. You get errno2 errno2=0x0330083B. Home address space does not own the connect token
from the input parameter list.
You can query the CPU used by your enclave using the IWMQTME macro (in supervisor state!). I had to specify CURRENT_DISP=YES to cause the dispatcher to be called to update the CPU figures. By default the CPU usage figures are updated at the end of a dispatch cycle. On my low use system, my transactions were running without being redispatched, and so the CPU “used” was reported as 0.

In more detail…

Minimum functionality for C programs.

You cannot obtain the CPU used by the enclaves from a C program, as the functions are not defined. I had to write my own assembler code to called the assembler macros to obtain the information. Some of these macros require supervisor state.

Many macros clobber the save area

Many macros, use a program call to execute a function. Other functions such as IWMEQTME use a BASR instruction. This function then does a standard save of the registers. This means that you need to have a standard function save area. Without this, the callers save area was used, and this overwrote the register, and Branch back… just branched to after the macro.

Instead of a function like

EDEL     RMODE  ANY 
EDEL     AMODE  31 
EDEL     CSECT 
          USING *,12 
          STM  14,12,12(13) 
          LR   12,15 
          L    6 0(1)  the work area  
          L    2,4(1)  ADDRESS OF THE passed data              
          IWM4EDEL ETOKEN=0(2),MF=(E,0(6),COMPLETE),                   XX 
                CPUTIME=8(2),ZAAPTIME=16(2),ZIIPTIME=24(2),            XX 
                RSNCODE=32(2),RETCODE=36(2) 
          LM   14,12,12(13) 
          SR   15,15 
          BR   14

I needed to add in code to create a save area, for example with a different macro

QCPU     RMODE  ANY 
QCPU     AMODE  31 
QCPU     CSECT 
** CAUTION THE IWMEQTME CORRUPTS SAVE AREA SO PROGRAM NEEDS ITS OWN
** SAVE AREA 
      USING *,12 
      STM  14,12,12(13) 
      LR   2,1 
      LR   12,15 
      LA    0,WORKLEN 
      STORAGE OBTAIN,LENGTH=(0) 
      ST     1,8(,13) FORWARD CHAIN IN PREV SAVE AREA 
      ST     13,4(,1) BACKWARD CHAIN IN NEXT SAVE AREA 
      LR     13,1     SET UP NEW SAVE AREA/REENTRANT WORKAREA 
      L    2,0(2)  ADDRESS OF THE CPUTIME 
      IWMEQTME CPUTIME=8(2),ZAAPTIME=16(2),ZIIPTIME=24(2),          X 
            CURRENT_DISP=YES,                                       X 
            RSNCODE=4(2),RETCODE=0(2),MF=(E,32(2),COMPLETE) 
      LR   3,15 
* free the resgister save area
      LR     1,13               ADDRESS TO BE RELEASED 
      L     13,4(,13)          ADDRESS OF PRIOR SAVE AREA 
      LA    0,WORKLEN           LENGTH OF STORAGE TO RELEASE 
      STORAGE RELEASE,           RELEASE REENTRANT WORK AREA        X 
            ADDR=(1),            ..ADDRESS IN R1                    X 
            LENGTH=(0)           ..LENGTH IN R0 
      L    14,12(13) 
      LR  15,3 
      LM   0,12,20(13) 
 SR   15,15 
      BR   14

Problems using a 64 bit program

I initially had my C program in 64 bit mode. This caused when I wrote some stub code to use the assembler interface, as the assembler macros are supported in AMODE 31, but my program, and storage areas were 64 bit, and the assembler code had problems.

Various documentation problems

It is not documented that you need to pass the connection token to __server_classify(_SERVER_CLASSIFY_CONNTKN, (char * ) connToken. You get errno2 errno2=0x0330083B. Home address space does not own the connect token
from the input parameter list
_SERVER_CLASSIFY_SUBSYSTEM_PARM Set the transaction subsystem parameter. When specified, value contains a NULL-terminated character string of up to 255 characters containing the subsystem parameter being used for the __server_pwu() call. This applies to _Server_classify_ as well as __server_pwu(). The sample applies for _SERVER_CLASSIFY_TRANSACTION_CLASS , _SERVER_CLASSIFY_TRANSACTION_NAME, _SERVER_CLASSIFY_USERID.
Getting report and server class back from __server-classify
1. It is _SERVER_CLASSIFY_SRVCLSNM not _SERVER_CLASSIFY_SERVCLSNM.
2. You use _SERVER_CLASSIFY_RPTCLSNM@, _SERVER_CLASSIFY_SERVCLS@, _SERVER_CLASSIFY_SERVCLSNM@ without the @ at the end. I think this is meant to imply these are pointers.
3. They did not work for me. I could not see when the fields are available. The classify work is only done during the CreateWorkUnit() request. I request it before this function, and after this function and only got back a string of hex 0s.

Using enclaves in a Java program – capturing elapsed and CPU time used by a Java transaction.

Ive blogged about using enclaves from a C program. There is an interface from Java which uses this C interface.

Is is relatively easy to use enclave services from a java program, as there are java classes for most of the functions, available from JZOS toolkit. For example the WorkloadManager class is defined here.

Below is a program I used to get the Work Load Manager(WLM) services working.

import java.util.concurrent.TimeUnit;
import com.ibm.jzos.wlm.ServerClassification;
import com.ibm.jzos.wlm.WorkUnit;
import com.ibm.jzos.wlm.WorkloadManager;
public class main
{
  // run it with /usr/lpp/java/J8.0_64/bin/java main
  public static void main(String[] args) throws Exception  
  {
    WorkloadManager wlmToken = new WorkloadManager("JES", "SM3");
    ServerClassification serverC = wlmToken.createServerClassification();   
    serverC.setTransactionName("TCI3");
    for ( int j = 0;j<1000;j++)
    {
      WorkUnit wU = new WorkUnit(serverC, "MAINCP");
      wU.join();
      float f;
      for (int i = 0;i<1000000;i++) 
      {       
          f=ii2;
         TimeUnit.MICROSECONDS.sleep(20*1000); // 200 milliseconds
      }
      wU.leave();
      wU.delete(); // end the workload
   }
  wlmToken.disconnect();
  }
}

The WLM statements are explained below.

WorkloadManager wlmToken = new WorkloadManager(“JES”, “SM3”);

This connects to the Work Load Manager and returns a connection token. This needs to be done once per JVM. You can use any relevant subsystem type, I used JES, and a SubsystemInstance (SI) of SM3. As a test, I created a new subsystem category in WLM called DOG, and used that. I defined ServerInstance SI with a value of SM3 within DOG and it worked.

z/OS uses uses subsystems such as JES for jobs submitted into JES2, and STC for Started task.

ServerClassification serverC = m.createServerClassification();

If your application is going to classify the transaction to determine the WLM service class and reporting class you need this. You create it, then add the classification criteria to it, see the following section.

Internally this passes the connection token wlmToken to the createServerClassification function.

serverC.setTransactionName(“TCI3”);

This passes information to WLM to determine the best service class and reporting class. Within Subsystem CAT, Subsystem Instance SM1, I had a sub rule TransactionName (TN) with a value TCI3. I defined the service class and a reporting class.

WorkUnit wU = new WorkUnit(serverC, “MAINCP”);

This creates the Independent (business transaction) enclave. I have not see the value MAINCP reported in any reports. This invokes the C run time function CreateWorkUnit(). The CreateWorkUnit function requires a STCK value of when the work unit started. The Java code does this for you and passes the STCK through.

wU.join();

This connect the current task to the enclave, and any CPU it uses will be recorded against the enclave.

wU.leave();

Disconnect the current task from the enclave. After this call any CPU used by the thread will be recorded against the address space.

wU.delete();

The Independent enclave(Business transaction) has finished. WLM records the elapsed time and resources used for the business transaction.

m.disconnect();

The program disconnects from WLM.

Reporting class output.

I used RMF to print the SMF 72 records for this program. The Reporting class for this program had

-TRANSACTIONS--  TRANS-TIME HHH.MM.SS.FFFFFF 
AVG        0.29  ACTUAL                36320 
MPL        0.29  EXECUTION             35291 
ENDED       998  QUEUED                 1028 
END/S      8.31  R/S AFFIN                 0 
#SWAPS        0  INELIGIBLE                0 
EXCTD         0  CONVERSION                0 
                 STD DEV               18368 
                                             
----SERVICE----   SERVICE TIME  ---APPL %--- 
IOC           0   CPU   12.543  CP      0.01 
CPU       10747   SRB    0.000  IIPCP   0.01 
MSO           0   RCT    0.000  IIP    10.44 
SRB           0   IIT    0.000  AAPCP   0.00 
TOT       10747   HST    0.000  AAP      N/A

From this we can see that for the interval

998 transactions ended. (Another report interval had 2 transactions ending)
the response time was an average of 36.3 milliseconds
a total of 12.543 seconds of CPU was used.
it spent 10.44 % of the time on a ZIIP.
0.01 % of the time it was executing ZIIP eligible work on a CP as there was no available ZIIP.

Additional functions.

The functions below

ContinueWorkUnit – for dependent enclave
JoinWorkUnit – as before
LeaveWorkUnit – as before
DeleteWorkUnit – as before

can be used to record CPU against the dependent (Address space) enclave. There is no WLM classify for a dependent enclave.

Java threads and WLM

A common application pattern is to use connection pooling. For example the connect/disconnect to a database or MQ is expensive. If you have a pool of threads, which connect, and start connected, an application can request a thread and get a thread which has already been connected to the resource manager.

It should be a simple matter of changing the interface from

connectionPool.getConnection()

connectionPool.getConnection(WorkUnit wU)
{ connection = connectionPool.getConnection()
 connection.join(wU)
}

and add a connection.leave(wU) to the releaseConnection.