One Minute MVS performance – Work Load Manager – background

Question: In your car how do you tell if your car has a problem? Answer: You look at the dashboard and see if there is a red light showing. You may not know how to fix it – but you know that you need to get help to fix it.

The aim of this series of blog posts is to show you what to look for in z/OS performance and if you have a problem.

I will cover

Ive written a blog post on how to understand reports from WLM

Managing workload

40 years ago the pilot of a commercial jet had many knobs and dials to control the performance (and speed) of the aeroplane. These days computers do most of the small tuning; the pilot sets the overall goals, and the computer does the rest.

It is the same with managing workload performance on z/OS. The systems programmer used to individually adjust the performance of jobs and transactions running on z/OS. These days the systems programmer sets the overall goals and the computer does the rest.

The work on z/OS is managed by the WorkLoad Manager (WLM). The systems programmer defined goals like

  • these CICS transactions should run with an average response time of 1 second.
  • The trivial TSO commands should run in under 10 milliseconds.
  • Batch – I dont care… it can run in the background.

I remember one customer saying that one day when he switched on WLM (so the WLM managed the workload), he noticed that the batch workload finished early, it made every thing go faster!

This is because when the system had been manually “tuned”; the CICS transactions were finishing in under half a second (much faster than the requirements of of 1 second). WLM worked to the goals, and the CICS transactions executing with a response time of 1 second. This meant there was spare CPU, and more batch workload could be done during the day.

How to monitor work?

For short lived requests, like CICS transactions or TSO commands, the response time is the obvious metric. Typically the response time is under a second, and at the end of a minute you should have many data points to tell you if you are achieving your response time goals.

Long running jobs or started tasks may run for weeks, so the time to run the job is meaningless. It could run slower when the system is busy, or run faster when the system is lightly loaded. It needs a second by second metric to measure progress.

WLM uses the concept of how much the work can be delayed. It uses a metric called Execution velocity which in concept is “the ratio of CPU used” to “time waiting for a resource”. In simple terms

100 * CPU used /(CPU used + wait time for a resource).

WLM can periodically check this ratio and adjust the priority of the work to achieve the goals.

If the execution velocity is 100 then it is not delayed for CPU.

If the execution velocity is 1 then if the work used 1 second (or millisecond) of CPU, then it is OK for the job to be delay for I/O or waiting for CPU for 100 seconds (or milliseconds). The ratio is important – not the absolute values.

How to work out which work to dispatched?

Every 10 seconds WLM looks at the data to decide which service classes need more or less CPU

WLM looks at all the work, and if it is meeting the goals for the service class (the definition of the goals).

  1. If all the work is within its goals pick any waiting work to dispatch
  2. If any work is not within its goal, adjust the dispatching priorities. Start with work with Importance 1, when this is within its goals, look at work with Importance 2 etc..

What can you configure

You can configure the system with goals like

  1. CICS transactions should take 1 second elapsed time to execute.
  2. Quick TSO commands using less than 0.5 seconds of CPU have high velocity.
  3. Slow TSO command using more than 0.5, but less than 5 seconds of CPU have Importance 3.
  4. Expensive TSO commands using more than 5 seconds of CPU have low priority
  5. Colin’s TSO userid always gets high priority regardless of the commands.
  6. Batch jobs with this accounting information, can run with high velocity
  7. Long batch jobs, or those batch jobs using more than 1 second of CPU, have low priority.

Work can get tracked across the system, and if WLM detects that CICS transactions are slowing down, then when the CICS issues a DB2 request in a different LPAR in the sysplex, it makes sure the request in DB2 has a high enough priority to keep the response time goals. WLM can also prioritise I/O so that the I/O for one transaction takes precedence over the I/O for a batch job.

The systems programmer creates a few broad categories of work, and specifies the goals of the service class. These service classes control the priority of work.

Use the WLM redbook for guidance on defining WLM service classes.

Service classes define the goals.

You have reporting classes for groups of similar jobs or transactions to report WLM information on these similar jobs or transactions. So although a group of work has the same service class, you can report it different ways, for example by transaction, or by userid.

You can define CICS, IMS, or Liberty as a server, and transactions/work within the server get WLM classified. So for job CICSA, the transactions PAY1,PAY2,PAY3 have high priority; for z/OSMF, userid COLIN has high priority.

What class is my work in?

You can display which service class a job is in using the D A,jobname operator command, for example it gave

WKL=STARTED SCL=STCLOM .

The Service CLass is STCLOM.

You can use SDSF DA, and use the column SrvClass. (You need to start RMF, then go into SDSF to display the Srvclass and other WLM related parameters).

You can change the service class of a job by using the operator command

RESET IZUSVR1,SRVCLASS=STCMDM

(Note which way round the letters are jobname IZUSVR1, service class SRVCLASS=…)

or, if you are authorised, overtype the field in SDSF, or from z/OSMF WLM plugin.

To change it permanently you’ll need to change the WLM definitions.

More details of how it works

I found the WLM redbook useful.

I described above that the execution velocity was 100 * CPU used /(CPU used + wait time for a resource).

The concept is correct – but the implementation is different. If my job had used 1000 seconds of CPU since it started, it is not helpful in seeing it behaviour over the last few minutes, as the execution velocity would be insensitive.

Every 250 milliseconds (4 times a second) WLM looks at every job/transaction in the system. It then updates internal control blocks for each Service Class and Report Class and increments a table.

  • executing – add 1 to the active (or using) CPU
  • transferring data to a device (connect time) add 1 to the active ( or using) I/O
  • waiting in z/OS to start an I/O – add one to the delayed for I/O
  • being paged in – add 1 to the delayed for”page in”
  • etc
  • waiting for the end user to enter data – do nothing.
  • waiting for TCPIP data – do nothing.

Execution velocity = 100 * (Total active samples /(Total active samples + Total delayed samples).

If during a 25 second period the transaction was

  • using CPU, in 20 samples,
  • transferring data to disk, in 10 sample
  • waiting to start an I/O, in 25 samples
  • waiting for the end user to type some data, in 45 samples.

From this we can see…

  • The count of active samples is 20 + 10
  • The count of delayed is 25 samples.
  • 45 samples are not used.

Execution velocity = 100 * ( (20 + 10) /(20+10) + 25)) = 55 .

An execution velocity of 100 means that when ever the job was sampled, it was always either dispatched and running; or transferring data to I/O.

An execution velocity says if we expect the job to use 50 seconds of CPU, and has a velocity of 10 then we expect the job to run in about 500 seconds. If it used 50 seconds of CPU, and was transferring data (connect time) of 20 seconds, the execution velocity would be 100 * (50 + 20) /((50+ 20) + 450) = 13 % which is close enough to 10% velocity.

Real goals from my system

For TSO on my z/OS there are goals

  1. For the first 800 service units (a systems independent measure of CPU usage)
    1. 80% requests to complete within 00:00:00.30
    2. Work has importance 2
  2. After this, any work has an execution velocity of 40.

For started tasks with Medium Priority the goals are

  1. Execution velocity of 30
  2. Importance 3

For started tasks with Low Priority the goals are

  1. Discretionary – there no goals – just do your best

One Minute MVS performance – DASD

Question: In your car how do you tell if your car has a problem? Answer: You look at the dashboard and see if there is a red light showing. You may not know how to fix it – but you know that you need to get help to fix it.

The aim of this series of blog posts is to show you what to look for in z/OS performance and if you have a problem.

I will cover

For some of these you need data from z/OS. This post describes how to get the SMF data, and format it using RMF.

DASD has changed in 40 years

40 years ago “disk storage” was on huge rotating disks and you had to carefully manage where you put your datasets -both which disk, and whereabouts on the disk. For example people would put the hot dataset in the “centre” of the disk to minimise the time to move the heads.

For the last 20 years people use the term “storage” because most I/O activity goes to cache in the disk controller, and the disk controller writes the data out to PC sized disks – which in turn may be solid state, and have no moving parts.

A pictorial view of disks

  • You have the processor running z/OS
  • Plugged into the side of the processor is the I/O adapter
  • Plugged into this I/O adapter are a lot of channels (think optical fibre cables)
  • Theses cables can be plugged into a switch – think of a plug board or telephone exchange. This allows channels from 2 processors plugged into the switch, and have one cable down to the storage controller . You could avoid a switch and have cables directly from the processor to the storage controller. Each processor would need its own set of cables.
  • The storage controller manages all of the IO
    • It has a lot of cache so most I/O may go to the cache. During a read, the storage controller will read from the disks if the data is not in the cache.
    • It has many PC type of disks. These disks could be solid state, or have rotating disks
    • If you have mirrored disks, the storage controller talks to a remote storage controller
  • Within each channel are many logical sub channels. Each disk has at least one sub-channel allocated to it. A disk can have multiple sub-channels allocated to it. There can be a pool of sub-channels which are used as needed to allowed parallel I/O to a disk.

The I/O journey

  • Your application wants to read the first record of a file.
  • Once the file has been opened, the application can issue the read.
  • z/OS knows where the data set is on disk (eg VOLID A4USR1, Cylinder 200, track 4)
  • z/OS builds up a set of commands (such as locate disk, locate cylinder 200, locate track 4, read data, read data, read data) to get the data and issues the Start Sub channel request, passing the list of I/O commands.
  • This is queued to the I/O adapter.
  • The original application is suspended (until the I/O is complete)
  • The I/O adapter looks for a free sub-channel for the disk, or gets one from the sub-channel pool.
  • The I/O adapter takes the list of commands, and executes them one at a time.
  • When the I/O adapter has finished the list of commands, it sends an interrupt to the mainframe saying “this subchannel has finished”.
  • z/OS wakes up, looks at the interrupt, and resumes the application.

Today you have to consider 3 areas where you can get delays, you need to be an expert if you want to look at more detail.

  1. Waiting in the I/O adapter before being able to get a sub channel. This is known as IOSQ – IO subsystem Queueing.
  2. Establishing the connection from processor to the storage controller
  3. Transferring the data the connect time.

This is complicated by being able to use disks 50 km away, which adds to the delay time.

RMF Reports

In the RMF MFR000… report with section D I R E C T A C C E S S D E V I C E A C T I V I T Y. (I search for IOSQ).


                  DEVICE   AVG  AVG   AVG  AVG  AVG   AVG  AVG  AVG    %      %    
 VOLUME PAV  LCU  ACTIVITY RESP IOSQ  CMR  DB   INT   PEND DISC CONN   DEV    DEV  
 SERIAL   1       RATE     TIME TIME  DLY  DLY  DLY   TIME TIME TIME   CONN   UTIL 
 A4RES1   1       102.896  .044 .003  .001 .000       .004 .000 .036   0.38   0.38 
 A4RES2   1        27.293  .036 .000  .001 .000       .003 .000 .032   0.09   0.09 
 USER00   1        25.331  .031 .003  .001 .000       .004 .000 .024   0.06   0.06 
 A4SYS1   1       365.102  .026 .005  .001 .000       .004 .000 .017   0.62  24.52 

Key fields

  1. Volume Serial such as A4RES1 is the volid of the disk
  2. PAV – I’ll mention this below.
  3. Device Activity Rate – how many requests (start sub channel) from z/OS, per second
  4. Average response time in milliseconds
  5. Average IOSQ – how long did it have to wait in z/OS and the I/O adapter before the request was sent down to the storage controller

The times are in milliseconds.

There are often thousands of volumes in a z/OS environment some are heavily used, some are not used. See below on how to find the hot volumes.

I typically look at the volumes with the highest I/O. If the hot volumes have good response time, the not so hot should be OK.

If you think of the sub-channel connection between the mainframe and the volid in the storage controller, there can only be one I/O requests at a time per sub-subchannel. You can have multiple connections down to a volume. These are known as PAV, or Parallel Access Volumes. The PAV is the average number of sub-channels in use.

The first field you look at is the IOSQ. This is the time between z/OS starting the request, and before the I/O could be started to the storage controller. This should be small 10s of microseconds ( 0.0xx in the report above). If this value is larger than this, you need to speak to your Storage Manager or z/OS Systems Programmer.

The second field you look at is the % DEV UTIL. How busy was the connection to the storage controller. A value of 100% means that it was running flat out. If the utilisation is around 70-80% it may be a OK – just something to note. More PAVs can increase throughput for a busy disk.

The next figure you look at is the RESP TIME. This is the response time the application sees. For local disk, response times of under 1 millisecond are OK. If you have remote disks, and synchronous I/O then the response time will be longer.

Finding the hot volumes

I take the RMF report and extract the DASD records.

  • For SDSF where the output is in the spool
    • I use Status to list all of the jobs, (Output or Hold work just as well)
    • Put ? in front of the job to show all of the spool data sets
    • use the SE command to Spool Edit the report
  • For a dataset I use the View prefix command in ISPF 3.4
  • Put DD in line prefix area on line 1
  • Find ‘D I R E C T’
  • Put DD in line prefix area, press enter, to delete the lines above it
  • Find ‘D I R E C T’ last
  • put d9999 in the line prefix area following the data (My report has ‘P A G I N G’), and press enter.
  • You should now have only DASD records
  • Put ‘cols’ in the line command area, note the columns of the DAR (50 to 58)
  • In the command line type SORT 50 58 D on Device Activity Rate.
  • This shows you the top usage volumes. Check the response times. Under 1 millisecond is good for locally attached disks. It can be down to 0.1 ms
  • If the response time is 1 ms or larger…
    • Check columns 60-65 (AVG IOSQ TIME) this should be 0. If this is non zero it means there was queueing in z/OS before it got to the disks. If there was only one I/O request to the volume, then there would IOSQ would be zero. If there are multiple I/O requests then you can get IOSQ queuing time.
    • Any IOSQ could be reduced by moving data sets to other volumes, or adding more paths(sub-channels) between the mainframe and the disks. Each disk requires at least one subchannel. You can allocate more in a pool – which are used when needed, but this is a z/OS system programmer/Storage manager job.
    • As a performance person you can control which disks you use, and can spread the load.
    • Avg CMR (ComMand Response) is the time to get from the processor down to the Storage Controller, and the controller to respond with “I’ve got the request” This should be small. This value allows you to see if delays are due to getting to the Storage controller, or within the controller.

If you do this for all disks you get an overall view of the data. Now you can select the DASD volumes you are using and check those.

If you find you have a long response time, then it is hard to find out the root cause. There are many links in the end to end chain. See here for more information.

One Minute MVS performance – CPU at the LPAR level

Question: In your car how do you tell if your car has a problem? Answer: You look at the dashboard and see if there is a red light showing. You may not know how to fix it – but you know that you need to get help to fix it.

The aim of this series of blog posts is to show you what to look for in z/OS performance and if you have a problem.

I will cover

For some of these you need data from z/OS. This post describes how to get the SMF data, and format it using RMF.

CPU

There are two basic things you need to check

  1. Has my LPAR got all the CPU it wanted – has the hyper-visor restricted the CPU?
  2. How busy are my CPUs?

Has my LPAR got all the CPU it wanted

An LPAR can be configured to have dedicated engines, or share a pool of engines. Dedicated engines means that the engine is always there when it is needed. If the LPAR is using a shared engine, it may not always be available when needed.

An example to explain the concept

You have a class from 10am to 11 am.  You go in, and sit down.  The teacher starts the class.  the teacher’s phone rings and goes out of the classroom. You play with your phone until the teacher comes back after 40 minutes. (The teacher went to teach in a different class room.)
How long were you in class for and how much work did you do?

  • You were in class for 1 hour.
  • You did 20 minutes work.

This concept is the same as any LPAR with shared engines.

  • The 1 hour class is a time slice as seen by z/OS.
  • The “processor” (teacher) was used in the time slice for only 20 minutes
  • For 40 minutes the “processor” was doing work elsewhere.

How do you get the report to show these figures.?

You need the RMF CPU report. It has “C P U A C T I V I T Y “ at the top of the page.

Look at the section

---CPU---    ---------------- TIME % ----------------   
NUM  TYPE    ONLINE    LPAR BUSY    MVS BUSY   PARKED   
 0    CP     100.00    46.68        46.32        0.00   
 1    CP     100.00    38.98        38.78        0.00   
 2    CP     100.00    34.91        34.62        0.00   
TOTAL/AVERAGE          40.19       39.90               
 3    IIP    100.00    94.43        94.70        0.00   
 4    IIP    100.00    93.50        93.74        0.00   
TOTAL/AVERAGE          93.96       94.22               

LPAR BUSY is how much teacher time you got

MVS Busy is how much time you were in the classroom for.

  • If MVS BUSY TIME = LPAR BUSY TIME, perfect, what you needed you got.
  • If MVS BUSY TIME > LPAR BUSY TIME, MVS had to wait for an engines, the system may need more CPU, a small difference(5%) is OK.
  • If MVS BUSY TIME >> LPAR BUSY TIME, For much of the time, there was no engine when MVS needed This will have a major impact on your work. If your end user work is not meeting targets, you need more CPUs, or give your LPAR a higher dispatching priority.

These values should be similar: MVS BUSY TIME 39.60 is close to LPAR BUSY 40.19, and for the ZIIP, 93.96 is close to 94.22.

When these figures are significantly different, stop, and fix the problem. This can make all other performance data look bad. For example, disk response time, and timing in application trace entries.

How busy are my CPUs?

The TOTAL/Average will be close to 100 % on a busy system. 95% busy is OK, Make a note that the system may be short of CPU.

These are average values. The individual values could be spiky. For example at 100% busy for 4 minutes, 80% busy for 1 minute, or an average of 96% busy over 5 minutes. Consider using an online monitoring to see if you have big peaks and trough.

More advanced topic for information.

The following section gives you information on how much work was waiting. It is hard to say what is good or bad, as it could look bad, but all the performance goals are being met.

How much work was waiting?

-----------------------DISTRIBUTION OF IN-READY WORK UNIT QUEUE--------------
 NUMBER OF              0    10   20   30   40   50   60   70   80   90   100
 WORK UNITS     (%)     |....|....|....|....|....|....|....|....|....|....|  
                                                                             
<=  N          26.3     >>>>>>>>>>>>>>                                       
 =  N +   1   12.9     >>>>>>>                                              
 =  N +   2    10.1     >>>>>>                                               
 =  N +   3    10.1     >>>>>>                                               
<=  N +   5    12.5     >>>>>>>                                              
<=  N +  10    11.0     >>>>>>                                               
<=  N +  15     6.0     >>>>                                                 
<=  N +  20     5.2     >>>                                                  
<=  N +  30     1.6     >                                                    
<=  N +  40     0.6     >                                                    
<=  N +  60     1.1     >                                                    
<=  N +  80     1.1     >                                                    
<=  N + 100     0.8     >                                                    
<=  N + 120     0.1     >                                                    
<=  N + 150     0.0                                                          
>   N + 150     0.0                                                          

N is the number of CPUs. I have 5 on my system.

The data is sampled. If system was sampled 10 times a second, every 0.1 of a second RMF counts the number of tasks in the “ready to dispatch queue”, and increments the value in the appropriate box; if there were 5 tasks executing and one task waiting, increment the N+1 element;

  • 26.3 % of the time, there were no tasks waiting for CPU.
  • 12.9 % of the time, there was 1 task waiting for CPU. See the bold data in the data above. (N+1 12.9 >>)
  • 10.1 % of the time, there were 2 tasks waiting for CPU
  • 5.2 % of the time there were between 16 and 20 tasks waiting for CPU
  • 0.1 % of the time there were between 101 and 120 task waiting for CPU

Remember this could be waiting for CP, or IIP.

If there are hundreds to tasks waiting for CPU you should make a note. It may not be a problem.

If there are under 50 tasks waiting for CPU, this should be OK.

On a busy system there will always be work waiting to run. Compare the pictures from a busy time and a not so busy time.

Is this important?

I once did some measurements with MQ on a machine with 16 processors, on average the engines were about 5% busy. A performance person from IBM said that my workload showed a shortage of CPU! 5 % busy on 16 processors – was I really short of CPU?

My application received some data, and posted 30 threads to come and process the data. The first 15 threads could be dispatched because there were 15 unused CPUs. 15 threads had to wait.

This showed up in the above report at line N+15 of the tasks were waiting 20% of the time.

Out of the 30 tasks that were dispatched, one processed the work,the other 29 went back to sleep.

We changed the program to post no more threads the number of CPUs (16) in the LPAR, and had a significant saving in CPU.

One minute MVS performance – getting batch RMF reports

There is an introduction to getting RMF reports docucmented here.

You can display information about your SMF environment, using

D SMF,O

This tells you if you are using SMF datasets, or log streams ( in the coupling facility) for the RMF data.

Copy the data from SMF dataset

// SET SMFPDS=SYS1.S0W1.MAN3 
// SET SMFSDS=SYS1.S0W1.MAN4 
//* 
//SMFDUMP  EXEC PGM=IFASMFDP 
//DUMPINA  DD   DSN=&SMFPDS,DISP=SHR,AMP=('BUFSP=65536') 
//DUMPINB  DD   DSN=&SMFSDS,DISP=SHR,AMP=('BUFSP=65536') 
//* MPOUT  DD   DISP=(NEW,PASS),DSN=&RMF,SPACE=(CYL,(1,1)) 
//DUMPOUT  DD   DISP=SHR,DSN=IBMUSER.RMF SPACE=(CYL,(1,1)) 
//SYSPRINT DD   SYSOUT=* 
//SYSIN  DD * 
  INDD(DUMPINA,OPTIONS(DUMP)) 
  INDD(DUMPINB,OPTIONS(DUMP)) 
  OUTDD(DUMPOUT,TYPE(70:79)) 
  DATE(2020316,2022284) 
  START(0930) 
  END(1700) 
/* 

This job copies the records from the “MAN” data sets, and writes them to the DUMPOUT.

The RMF records with types 70 to 79 are copied, within the specified dates and start and end times.

Copy the data from a log stream.

SMF can write data to log streams, for example MQ records go to the MQ stream, and the RMF records go to the RMF Stream.

//SMFDUMP EXEC PGM=IFASMFDL
//DUMPOUT DD DSN=&TEMP,SPACE=(CYL,(10,10),RLSE),DISP=(NEW,PASS)
//SYSPRINT DD SYSOUT=*
//SYSIN DD *
DATE(2018004,2018012)
START(0900)
END(1900)
LSNAME(IFASMF.RMF,OPTIONS(ALL))
OUTDD(DUMPOUT,TYPE(70:79,116))
/*

This step writes the data to a temporary data set.

Sort the data

If you are processing the data from more than one LPAR you will need to sort the data. See here.

Format the RMF data

The RMF control statements are described here

//POST EXEC PGM=ERBRMFPP 
//* use the following if using a temporary data set in same job.
//* MPFINPUT DD DISP=SHR,DSN=*.SMFDUMP.DUMPOUT 
//MFPINPUT DD DISP=SHR,DSN=IBMUSER.RMF 
//PPXSRPTS DD   SYSOUT=*,DCB=(LRECL=200,BLKSIZE=4000) 
//SYSPRINT DD   SYSOUT=* 
//SYSOUT   DD   SYSOUT=* 
//SYSIN DD * 
  SUMMARY(INT,TOT) 
  NODELTA 
  SYSOUT(H) 
  REPORTS(ALL,NOENQ,DEVICE(NOUNITR,NOCOMM)) 
SYSRPTS(WLMGL(RCPER)) 
SYSRPTS(WLMGL(SCPER)) 
/* 
/* SYSRPTS(WLMGL(RCPER)) 
/* SYSRPTS(WLMGL(SCPER,RCLASS,RCPER,SCLASS)) 

This takes the records from MFPINPUT which could be a permanent data set, or a temporary data set passed from a previous job step.

You can have the output go to the spool (by default) or to preallocated data sets. See here

One Minute MVS performance reports of interest are

REPORTS(CPU,DEV(DASD))

For WLM the reports

SYSRPTS(WLMGL(RCPER,SCPER))