One Minute MVS – ICSF. It might be better if they finished it.

This is another post in the series of “One Minute MVS” which aims to give the basics you need to be able to get started with a topic.

The IBM documentation says: ICSF provides support for

  • The ANSI Data Encryption Algorithm (DES) and Advanced Encryption Standard (AES) encryption and decryption
  • DES key management and transport
  • AES key management and transport
  • Financial services including PINs, payment card industry transactions and ATMs
  • Public key operations including key generation, digital signatures and wrapping symmetric keys for transport
  • MAC and hash generation
  • Acceleration of handshake and frame encryption for SSL
  • PKCS #11 API

which has too many buzz words for me.

The manual Getting Started with z/OS Data Set Encryption is a very useful book.

My interpretation of what ICSF is:

ICSF…

  • Can create and store Public and Private certificates (as used in SSL and TLS). RACF can store its certificates in ICSF.
  • Can store symmetric keys used to encrypt data – such as data sets. Note: if you are using TLS the actual encryption of data over the network is done with a symmetric key.
  • Mange the hardware, tamper proof, keystores provided with z hardware. (You have to access the physical z machine to enter the master cryptographic keys).
  • To check credit card PIN number and other checks.
  • You can configure ICSF in a group of datasets, and switch to a different set.

ICSF facilities

ICSF has

  • A callable services API to allow you to call ICSF services from a program. There is a header file /usr/include/csfbext.h and SYS1.SIEAHDR.H which defines the function parameters.
  • Some ISPF panels to help you mange the ICSF entities.
  • Some batch command interfaces.

ICSF is not very usable

I found that ICSF was not very usable. For example

  • The ISPF panels (see here) are not intuitive. You can update the ICSF datasets in batch. You then have to refresh the in-memory copy.
    • For example, to refresh the CKDS data set and list it, using the ISPF panels for ICSF
      • 2 KDS MANAGEMENT -> 1 CKDS MANAGEMENT -> 1 CKDS OPERATIONS -> 2 REFRESH – Activate an updated CKDS.
      • Then PF3, 3 times.
    • To list the contents
      • 5 UTILITY -> 5 CKDS KEYS -> 1 List and manage all records
    • I would have had a page for KDS, and had a refresh option, and a list option on the same page.
  • I expected to be able to use ICSF using commands. It looks like I have to write programs to use the API! Some REXX programs are available for a subset of the function. See dataset CSF.SCSFCLI0 .
  • There is a lack of consistency. Two utilities doing similar things one has PARM=’REFESH,name’ the other has PARM=’name,REFRESH’. The utility for PKDS is CSFPUTIL. The utility for CKDS is CSFEUTIL.

It feels like ICSF has not yet been finished, more of an “Here are some API’s – good luck as you are on your own”, than a guide for the new user.

I’m writing some C programs to do some basic definitions, and pass parameters. I should not need to do this.

ICSF concepts

There are asymmetric keys such as private key and its public key. If you encrypt some text with a public key you need the private key to decrypt it. If you encrypt with the private key – you need the public key to decrypt it.

There are symmetric keys where the same key is used at each end. For example encrypt: change A to s, B to !; decrypt change s to A, ! to B.

Asymmetric keys are usually used in negotiating or sending a symmetric key to the remote end.

Symmetric keys are usually used to encrypt the payload. It is good practice to change the symmetric key periodically to make it harder for someone to break the cipher.

ICSF has

  • PKDS (Public Key Data Set) for storing Private and Public Asymmetric keys.
  • CKDS (Cryptographic Key Data Set) which is used to store Symmetric keys
  • TKDS (Token Key Data Set). When you are using keys stored in the hardware cryptographic facility, you have a token to reference the data.

The data in these key data sets, may be encrypted, for example by the hardware cryptographic facility. You can configure ICSF so your keys are encrypted, and in normal operation they are not available in clear text, as they are encrypted by the tamper proof hardware, and are used within the hardware, where they are decrypted and used.

Keys can be in one of several state

  • Active. The key can be used to process data. It is within its start and end dates (if present)
  • Archive. The key cannot be used to process data. For example you have backed up an encrypted data set to tape. If you delete the key, the data cannot be processed. If they key is archived then it cannot be used. If you need to access the backed up dataset, you can change the status to Active for the duration of the work.
  • Inactive – A key which has not been archived, and is outside of the start and end dates (if present).
  • Pre-active – I cannot find what this is. It is mentioned in the ISPF panels.

Configuration

You can configure which ICSF data sets are current using, and other parameters via SYS1.PARMLIB(CSFPRMxx).

When you start the CSF procedure you specify the xx. For example in SYS1.PROCLIB concatenation member CSF,

//CSF PROC PRM=00
//CSF EXEC PGM=CSFINIT,PARM=&PRM,…..

S CSF,PRM=CP

You can use the SETICSF operator command to change some parameters for the duration of the CSF task.

You can use D ICSF, for example d ICSF,kds to give

CKDS CSF.SCSFCKDS FORMAT=KDSR SYSPLEX=N MKVPs=DES AES
PKDS CSF.SCSFPKDS FORMAT=VARIABLE SYSPLEX=N MKVPs=RSA ECC
No TKDS was provided.

To change use a different CSFPRMxx, you have to stop and restart CSF, specifying the CSFPRM suffix.

You need to plan your ICSF usage

You can set up data set encryption so when you create a data set, the data is automatically encrypted. You define a key and give it a label. If you delete the key, the data cannot be read. To be able to process the dataset you need RACF access to the dataset, and RACF access to the key.

The label is associated with the data set (so it you send the dataset to a remote system, it will still have the same label).

Most organisations say you much change your password periodically, typically every month. For similar reasons organisations say you should define a new key and use it, typically every month. This is called Key Rotation, where you roll over your key to a new value.

I could have a key with label ColinDSKey, and encrypt my data sets with it. I encrypt data set COLIN.AUG2021.LOG with this key. Next month if I create a new key and reuse the same label, I can encrypt the data set COLIN.SEPT2021.LOG and use the new key. However I will be unable to read COLIN.AUG2021.LOG, because ColinDSKey now has a new value.

I’ve seen presentations which say ” just re-encrypt all your datasets with the new key”. This sounds like a lot of work and a major disruption.

Another approach is to change the key label, for example ColinAug2021DSKey. When I generate a new key, it has a label ColinSept2021DSKey. I configure datasets to use the new label. Datasets with the old label can still be used, as long as the key exists. You can tell ICSF to archive the label (and key), so it cannot be used. If it is needed you make the key active, use the data set, and re archive the key.

To set the label for a data set, you can

  • Specify it in JCL. So you may have to change your JCL every month, or use a symbolic such as the month and year.

// SET KEY=’OUTAESKEY’
//S4 EXEC PGM=IKJEFT01,REGION=0M
//SYSTSPRT DD DSN=IBMUSER.ENC,DISP=(MOD,CATLG),SPACE=(CYL,(1,1)),
// DSKEYLBL=&KEY,DSNTYPE=EXTREQ,DCB=(LRECL=132,BLKSIZE=3200)
//SYSTSIN DD *
LU *
/*

Use SMS and have rules to generate the label depending on your profile and the name of the data set.

Define a RACF profile for example ADDSD ‘PROTECT.*’ UACC(NONE) DFP(DATAKEY(AES2)) says for all datasets with a HLQ of PROTECT, then use key AES2.

When a dataset is allocated you get a message

IGD17150I DATA SET PROTECT.ENC IS ELIGIBLE FOR ACCESS METHOD ENCRYPTION. KEY LABEL IS (AES5)

You also need to think about encrypting your databases, which is another jump in complexity. As an old text book said “We’ll leave this as an exercise for the reader”.

To make the planning just a little! more complex; If you send an encrypted data set to another z/OS system, it will have the same label as when it was originally created, so you need the keys sent (securely) to the remote system for the data set to be processed, and coordinate naming conventions.

As the section header says You need to plan your ICSF usage. ICSF and data set encryption needs a lot of planning.

Rexx program to invoke ICSF panels

I used the following to invoke the ISPF panels for ICSF. I call the exec CSF.

To invoke it, from an ISPF screen type TSO CSF.

/* Rexx */                                                
/* IBMs ICSF */                                           
 address ispexec                                          
"LIBDEF ISPPLIB DATASET ID('CSF.SCSFPNL0') STACK"         
"LIBDEF ISPMLIB DATASET ID('CSF.SCSFMSG0') STACK"         
"LIBDEF ISPSLIB DATASET ID('CSF.SCSFSKL0') STACK"         
"LIBDEF ISPTLIB DATASET ID('CSF.SCSFTLIB') STACK"         
address tso "ALTLIB ACTIVATE APPLICATION(CLIST)           
            DATASET('CSF.SCSFCLI0')"                      
                                                          
"SELECT PANEL(CSF@PRIM) NEWAPPL(CSF) PASSLIB"             
                                                          
address tso "ALTLIB DEACTIVATE APPLICATION(CLIST)"        
"LIBDEF ISPSLIB"                                          
"LIBDEF ISPPLIB"                                          
"LIBDEF ISPMLIB"                                          
"LIBDEF ISPTLIB"                                          

One Minute MVS – tuning stack and heap pools

These days many applications use a stack and heap to manage storage used by an application. For C and Cobol programs on z/OS these use the C run time facilities. As Java uses the C run time facilities, it also uses the stack and heap.

If the stack and heap are not configured appropriately it can lead to an increase in CPU. With the introduction of 64 bit storage, tuning the heap pools and stack is no longer critical. You used to have to carefully manage the stack and heap pool sizes so you didn’t run out of storage.

The 5 second information on what to check, is the number of segments freed for the stack and heap should be zero. If the value is large then a lot of CPU is being used to manage the storage.

The topics are

Kinder garden background to stack.

When a C (main) program starts, it needs storage for the variables uses in the program. For example

int i;
for (ii=0;ii<3:ii++)
{}

char * p = malloc(1024);

The variables ii and p are variables within the function, and will be on the functions stack. p is a pointer.

The block of storage from the malloc(1024) will be obtained from the heap, and its address stored in p.

When the main program calls a function the function needs storage for the variables it uses. This can be done in several ways

  1. Each function uses a z/OS GETMAIN request on entry, to allocate storage, and a z/OS FREEMAIN request on exit. These storage requests are expensive.
  2. The main program has a block of storage which functions can use. For the main program uses bytes 0 to 1500 of this block, and the first function needs 500 bytes, so uses bytes 1501 to 2000. If this function calls another function, the lower level function uses storage from 2001 on wards. This is what usually happens, it is very efficient, and is known as a “stack”.

Intermediate level for stack

It starts to get interesting when initial block of storage allocated in the main program is not big enough.

There are several approaches to take when this occurs

  1. Each function does a storage GETMAIN on entry, and FREEMAIN on exit. This is expensive.
  2. Allocate another big block of storage, so successive functions now use this block, just like in the kinder garden case. When functions return to the one that caused a new block to be allocated,
    1. this new block is freed. This is not as expensive as the previous case.
    2. this block is retained, and stored for future requests. This is the cheapest case. However a large block has been allocated, and may never be used again.

How big a block should it allocate?

When using a stack, the size of the block to allocate is the larger of the user specified size, and the size required for the function. If the specified secondary size was 16KB, and a function needs 20KB of storage, then it will allocate at least 20KB.

How do I get the statistics?

For your C programs you can specify options in the #PRAGMA statement or, the easier way, is to specify it through JCL. You specify C run time options through //CEEOPTS … For example

//CEEOPTS DD *
STACK(2K,12K,ANYWHERE,FREE,2K,2K)
RPTSTG(ON)

Where

  • STACK(…) is the size of the stack
  • RPTSTG(ON) says collect and display statistics.

There is a small overhead in collecting the data.

The output is like:

STACK statistics:                                                
  Initial size:                                2048     
  Increment size:                             12288     
  Maximum used by all concurrent threads:  16218808     
  Largest used by any thread:              16218808     
  Number of segments allocated:                2004     
  Number of segments freed:                    2002     

Interpreting the stack statistics

From the above data

  • This shows the initial stack size was 2KB and an increment of 12KB.
  • The stack was extended 2004 times.
  • Because the statement had STACK(2K,12K,ANYWHERE,FREE,2K,2K), when the secondary extension became free it was FREEMAINed back to z/OS.

When KEEP was used instead of FREE, the storage was not returned back to z/OS.

The statistics looked like

STACK statistics:                                                
  Initial size:                               2048     
  Increment size:                            12288     
  Maximum used by all concurrent thread:  16218808     
  Largest used by any thread:             16218808     
  Number of segments allocated:               1003     
  Number of segments freed:                      0     

What to check for and what to set

For most systems, the key setting is KEEP, so that freed blocks are not released. You can see this a) from the definition b) Number of segments freed is 0.

If a request to allocate a new segment fails, then the C run time can try releasing segments that are not in use. If this happens the “”segments freed” will be incremented.

Check that the “segments freed” is zero, and if not, investigate why not.

When a program is running for a long time, a small number of “segments allocated” is not a problem.

Make the initial size larger, closer to the “Largest used of any thread” may improve the storage utilisation. With smaller segments there is likely to be unused space, which was too small for a functions request, causing the next segment to be used. So a better definition would be

STACK(16M,12K,ANYWHERE,KEEP,2K,2K)

Which gave

STACK statistics:                                                          
  Initial size:                                     16777216               
  Increment size:                                      12288               
  Maximum used by all concurrent threads:           16193752               
  Largest used by any thread:                       16193752               
  Number of segments allocated:                            1               
  Number of segments freed:                                0               

Which shows that just one segment was allocated.


Kinder garden background to heap

When there is a malloc() request in C, or a new … in Java, the storage may exist outside of the function. The storage is obtained from the heap.

The heap has blocks of storage which can be reused. The blocks may all be of the same size, or or different sizes. It uses CPU time to scan free blocks looking for the best one to reuse. With more blocks it can use increasing amounts of CPU.

There are heap pools which avoids the costs of searching for the “right” block. It uses a pools of blocks. For example:

  1. there is a heap pool with 1KB fixed size blocks
  2. there is another heap pool with 16KB blocks
  3. there is another heap pool with 256 KB blocks.

If there is a malloc request for 600 bytes, a block will be taken from the 1KB heap pool.

If there is a malloc request for 32KB, a block would be used from the 256KB pool.

If there is a malloc request for 512KB, it will issue a GETMAIN request.

Intermediate level for heap

If there is a request for a block of heap storage, and there is no free storage, a large segment of storage can be obtained, and divided up into blocks for the stack. If the heap has 1KB blocks, and a request for another block fails, it may issue a GETMAIN request for 100 * 1KB and then add 100 blocks of 1KB to the heap. As storage is freed, the blocks are added to the free list in the heap pool.

There is the same logic as for the stack, about returning storage.

  1. If KEEP is specified, then any storage that is released, stays in the thread pool. This is the cheapest solution.
  2. If FREE is specified, then when all the blocks in an additional segment have been freed, then free the segment back to the z/OS. This is more expensive than KEEP, as you may get frequent GETMAIN and FREEMAIN requests.

How many heap pools do I need and of what size blocks?

There is usually a range of block sizes used in a heap. The C run time supports up to 12 cell sizes. Using a Liberty Web server, there was a range of storage requests, from under 8 bytes to 64KB.

With most requests there will frequently be space wasted. If you want a block which is 16 bytes long, but the pool with the smallest block size is 1KB – most of the storage is wasted.
The C run time gives you suggestions on the configuration of the heap pools, the initial size of the pool and the size of the blocks in the pool.

Defining a heap pool

How to define a heap pool is defined here.

You specify the size of overall size of storage in the heap using the HEAP statement. For example for a 16MB total heap size.

HEAP(16M,32768,ANYWHERE,FREE,8192,4096)

You then specify the pool sizes


HEAPPOOL(ON,32,1,64,2,128,4,256,1,1024,7,4096,1,0)

The figures in bold are the size of the blocks in the pool.

  • 32,1 says maximum size of blocks in the pool is 32 bytes, allocate 1% of the heap size to this pool
  • 64,2 says maximum size of blocks in the pool is 64 bytes, allocate 2% of the heap size to this pool
  • 128,4 says maximum size of blocks in the pool is 128 bytes, allocate 4% of the heap size to this pool
  • 256,1 says maximum size of blocks in the pool is 256 bytes, allocate 1% of the heap size to this pool
  • 1024,7 says maximum size of blocks in the pool is 1024 bytes, allocate 7% of the heap size to this pool
  • 4096,1 says maximum size of blocks in the pool is 4096 bytes, allocate 1% of the heap size to this pool
  • 0 says end of definition.

Note, the percentages do not have to add up to 100%.

For example, with the CEEOPTS

HEAP(16M,32768,ANYWHERE,FREE,8192,4096)
HEAPPOOLS(ON,32,50,64,1,128,1,256,1,1024,7,4096,1,0)

After running my application, the data in //SYSOUT is


HEAPPOOLS Summary:                                                         
  Specified Element   Extent   Cells Per  Extents    Maximum      Cells In 
  Cell Size Size      Percent  Extent     Allocated  Cells Used   Use      
  ------------------------------------------------------------------------ 
       32        40    50      209715           0           0           0 
       64        72      1        2330           1        1002           2 
      128       136      1        1233           0           0           0 
      256       264      1         635           0           0           0 
     1024      1032      7        1137           1           2           0 
     4096      4104      1          40           1           1           1 
  ------------------------------------------------------------------------ 

For the cell size of 32, 50% of the pool was allocated to it,

Each block has a header, and the total size of the 32 byte block is 40 bytes. The number of 40 bytes units in 50% of 16 MB is 8MB/40 = 209715, so these figures match up.

(Note with 64 bit heap pools, you just specify the absolute number you want – not a percentage of anything).

Within the program there was a loop doing malloc(50). This uses cell pool with size 64 bytes. 1002 blocks(cells) were used.

The output also has

Suggested Percentages for current Cell Sizes:
HEAPP(ON,32,1,64,1,128,1,256,1,1024,1,4096,1,0)


Suggested Cell Sizes:
HEAPP(ON,56,,280,,848,,2080,,4096,,0)

I found this confusing and not well documented. It is another of the topics that once you understand it it make sense.

Suggested Percentages for current Cell Sizes

The first “suggested… ” values are the suggestions for the size of the pools if you do not change the size of the cells.

I had specified 50% for the 32 byte cell pool. As this cell pool was not used ( 0 allocated cells) then it suggests making this as 1%, so the suggestion is HEAPP(ON,32,1

You could cut and paste this into you //CEEOPTS statement.

Suggested Cell Sizes

The C run times has a profile of all the sizes of blocks used, and has suggested some better cell sizes. For example as I had no requests for storage less than 32 bytes, making it bigger makes sense. For optimum storage usage, it suggests of using sizes of 56, 280,848,2080,4096 bytes.

Note it does not give suggested number of blocks. I think this is poor design. Because it knows the profile it could have an attempt at specifying the numbers.

If you want to try this definition, you need to add some values such as

HEAPP(ON,56,1,280,1,848,1,2080,1,4096,1,0)

Then rerun your program, and see what percentage figures it recommends, update the figures, and test again. Not the easiest way of working.

What to check for and what to set

There can be two heap pools. One for 64 bit storage ( HEAPPOOL64) the other for 31 bit storage (HEAPPOOL).

The default configuration should be “KEEP”, so any storage obtained is kept and not freed. This saves the cost of expensive GETMAINS and FREEMAINs.

If the address space is constrained for storage, the C run time can go round each heap pool and free up segments which are in use.

The value “Number of segments freed” for each heap should be 0. If not, find out why (has the pool been specified incorrectly, or was there a storage shortage).

You can specify how big each pool is

  • for HEAPPOOL the HEAP size, and the percentage to be allocated to each pool – so two numbers to change
  • for HEAPPOOL64 you specify the size of each pool directly.

The sizes you specify are not that sensitive, as the pools will grow to meet the demand. Allocating one large block is cheaper that allocating 50 smaller blocks – but for a server, this different can be ignored.

With a 4MB heap specified

HEAP(4M,32768,ANYWHERE,FREE,8192,4096)
HEAPP(ON,56,1,280,1,848,1,2080,1,4096,1,0)

the heap report was

 HEAPPOOLS Summary: 
   Specified Element   Extent   Cells Per  Extents    Maximum      Cells In 
   Cell Size Size      Percent  Extent     Allocated  Cells Used   Use 
   ------------------------------------------------------------------------ 
        56        64      1         655           2        1002           2 
       280       288      1         145           1           1           0 
       848       856      1          48           1           1           0 
      2080      2088      1          20           1           1           1 
      4096      4104      1          10           0           0           0 
   ------------------------------------------------------------------------ 
   Suggested Percentages for current Cell Sizes: 
     HEAPP(ON,56,2,280,1,848,1,2080,1,4096,1,0) 

With a small(16KB) heap specified

HEAP(16K,32768,ANYWHERE,FREE,8192,4096)
HEAPP(ON,56,1,280,1,848,1,2080,1,4096,1,0)

The output was

HEAPPOOLS Summary:                                                            
  Specified Element   Extent   Cells Per  Extents    Maximum      Cells In    
  Cell Size Size      Percent  Extent     Allocated  Cells Used   Use         
  ------------------------------------------------------------------------    
       56        64      1           4         251        1002           2    
      280       288      1           4           1           1           0    
      848       856      1           4           1           1           0    
     2080      2088      1           4           1           1           1    
     4096      4104      1           4           0           0           0    
  ------------------------------------------------------------------------    
  Suggested Percentages for current Cell Sizes:                               
    HEAPP(ON,56,90,280,2,848,6,2080,13,4096,1,0)                             

and we can see it had to allocate 251 extents for all the request.

Once the system has “warmed up” there should not be a major difference in performance. I would allocate the heap to be big enough to start with, and avoid extensions.

With the C run time there are heaps as well as heap pools. My C run time report gave

64bit User HEAP statistics:
31bit User HEAP statistics:
24bit User HEAP statistics:
64bit Library HEAP statistics:
31bit Library HEAP statistics:
24bit Library HEAP statistics:
64bit I/O HEAP statistics:
31bit I/O HEAP statistics:
24bit I/O HEAP statistics:

You should check all of these and make the initial size the same as the suggested recommended size. This way the storage will be allocated at startup, and you avoid problems of a request to expand the heap failing due to lack of storage during a buys period.

Advanced level for heap

While the above discussion is suitable for many workloads, especially if they are single threaded. It can get more complex when there are multiple thread using the heappools.

If you have a “hot” or highly active pool you can get contention when obtaining and releasing blocks from the heap pool. You can define multiple pools for an element size. For example

HEAPP(ON,(56,4),1,280,1,848,1,2080,1,4096,1,0)

The (56,4) says make 4 pools with block size of 56 bytes.

The output has

HEAPPOOLS Summary:                                                          
  Specified Element   Extent   Cells Per  Extents    Maximum      Cells In  
  Cell Size Size      Percent  Extent     Allocated  Cells Used   Use       
  ------------------------------------------------------------------------  
       56       64     1           4         251        1002           2  
       56       64     1           4           0           0           0  
       56       64     1           4           0           0           0  
       56       64     1           4           0           0           0  
      280       288      1           4           1           1           0  
      848       856      1           4           1           1           0  
     2080      2088      1           4           1           1           1  
     4096      4104      1           4           0           0           0  
  ------------------------------------------------------------------------  

We can see there are now 4 pools with cell size of 56 bytes. The documentation says Multiple pools are allocated with the same cell size and a portion of the threads are assigned to allocate cells out of each of the pools.

If you have 16 threads you might expect 4 threads to be allocated to each pool.

How do you know if you have a “hot” pool.

You cannot tell from the summary, as you just get the maximum cells used.

In the report is the count of requests for different storage ranges.

Pool  2     size:   160 Get Requests:           777707 
  Successful Get Heap requests:    81-   88                 77934 
  Successful Get Heap requests:    89-   96                 59912 
  Successful Get Heap requests:    97-  104                 47233 
  Successful Get Heap requests:   105-  112                 60263 
  Successful Get Heap requests:   113-  120                 80064 
  Successful Get Heap requests:   121-  128                302815 
  Successful Get Heap requests:   129-  136                 59762 
  Successful Get Heap requests:   137-  144                 43744 
  Successful Get Heap requests:   145-  152                 17307 
  Successful Get Heap requests:   153-  160                 28673
Pool  3     size:   288 Get Requests:            65642  

I used ISPF edit, to process the report.

By extracting the records with size: you get the count of requests per pool.

Pool  1     size:    80 Get Requests:           462187 
Pool  2     size:   160 Get Requests:           777707 
Pool  3     size:   288 Get Requests:            65642 
Pool  4     size:   792 Get Requests:            18293 
Pool  5     size:  1520 Get Requests:            23861 
Pool  6     size:  2728 Get Requests:            11677 
Pool  7     size:  4400 Get Requests:            48943 
Pool  8     size:  8360 Get Requests:            18646 
Pool  9     size: 14376 Get Requests:             1916 
Pool 10     size: 24120 Get Requests:             1961 
Pool 11     size: 37880 Get Requests:             4833 
Pool 12     size: 65536 Get Requests:              716 
Requests greater than the largest cell size:               1652 

It might be worth splitting Pool 2 and seeing if makes a difference in CPU usage at peak time. If it has a benefit, try Pool 1.

You can also sort the “Successful Heap requests” count, and see what range has the most requests. I don’t know what you would use this information for, unless you were investigating why so much storage was being used.

Ph D level for heap

For high use application on boxes with many CPUs you can get contention for storage at the hardware cache level.

Before a CPU can use storage, it has to get the 256 byte cache line into the processor cache. If two CPU’s are fighting for storage in the same 256 bytes the throughput goes down.

By specifying

HEAPP(ALIGN….

It ensures each block is isolated in its own cache line. This can lead to an increase in virtual storage, but you should get improved throughput at the high end. It may make very little difference when there is little load, or on an LPAR with few engines.

One minute MVS – synchronous I/O.

This blog is one of a series of blog posts about performance topics on z/OS. This post is about synchronous I/O technology which reduces I/O time from milliseconds down to microseconds.

There is a 2018 redbook with some good content. This 2020 presentation is a good introduction. This gets a bit technical and looks at it from the Storage Controller view point.

History

Some protocols are very chatty. For example when you download a television program to your television, you see it as one download. Under the covers the software sees this television program as many parts, and there will be a conversation like “Here is part 1, the first 10 MBs”… “ok got that – please send the next”. Within this conversation is TCP/IP chatter, “here is packet 4013” … “ok got that” etc.

40 years ago the I/O to disks was a bit like the following conversation between the mainframe and the disk controller. Imaging this as a series of phone calls. The “Hello” is where they initiate a new call, and “Bye” is where they hang up

  • Mainframe: Hello storage controller, I want to read from disk with volume label A1USR1
  • Storage controller: Sorry, it is in use, please try later. Bye.
  • Mainframe : Hello? the telephone line is busy, I’ll retry.
  • Mainframe: Hello, is disk A1USR1 available now?
  • Storage controller: yes. I’ve reserved it for you
  • Mainframe: Please move the disk read heads so they are under cylinder 61
  • Storage controller: OK will do. Bye.
  • Storage controller: Hello. OK the heads are under cylinder 61
  • Mainframe: Read from head 4 (track 4) third record
  • Storage controller: OK will do. Bye.
  • Storage controller: Hello. OK, done that – here is the data
  • Mainframe: Thank you, I’ve finished with the volume A1USR1.
  • Storage controller: You’re welcome – that took 25 milliseconds. Bye.

Today’s conversation is much faster, as mostly data is in the storage controller’s cache. There are still a few phone calls.

Today’s synchronous I/O

With the new synchronous I/O technology (zHyperLink) the conversation is more like the following. (Using the new premium support phone number)

  • Mainframe: Hello Storage controller, I want to read from disk with volume label A1USR1 Cylinder 61 track 4.
  • Storage controller: Just a second – we have it in cache, here is the data. Bye.

Or perhaps the conversation goes like.

  • Mainframe: Hello Storage controller, I want to read from disk with volume label A9SYS1 Cylinder 995 track 2.
  • Storage controller: Just a second – I’m sorry we do not have it in cache. Please use the old way of doing it, using the old customer support number. Bye.

Technical background

With the old way of doing I/O there were multi asynchronous requests coming back from the Storage Controller. It needed CPU to start the I/O ( Start subchannel) then suspend the program until the I/O completed and then resume the program. The program could have been re-dispatched on a different CPU, so its data was not in the processor cache.

Sometimes the amount of CPU used was more than the duration of I/O request! With the coupling facility(CF), came technology to issue an I/O request synchronously. With this the “Issue CF request” was one mainframe instruction which started the I/O, waited for the response and then resumed. There was no z/OS interrupt, and there was no dispatcher involved. Generally this used less CPU than the asynchronous request, and was faster.

The duration of the synchronous request depends on the work being done in the CF. The more work it does, the longer the instruction takes. For example the CF is doing some work on some data for a different processor, and locks a resource. This will delay other users of the data, who get locked out. Also as the physical distance between the CF and the mainframe increased, the overall duration increased (due to the speed to light).

At some point it is more efficient to use the asynchronous request; send the request; suspend until it has completed; rather than the send the request and wait.

The CF code has logic to determine if the request should be synchronous or asynchronous. Even though the recent requests have all been asynchronous, it will periodically try a synchronous request to see if it is worth switching to synchronous mode.

Additional hardware

For the mainframe to support this synchronous I/O it needs additional hardware.

  • It continues to need the FIbreCONnection (FICON) today for those requests that do not support synchronous I/O.
  • It needs new zHyperLink cables connected from the mainframe to the Storage Controller. You need a direct connection, not via a switch.
  • The CF and mainframe need to be closer than 150 meters.
  • You can have duplexed CFs which both have to be within 150 meters.

One Minute MVS performance – TCP/IP

Question: In your car how do you tell if your car has a problem? Answer: You look at the dashboard and see if there is a red light showing. You may not know how to fix it – but you know that you need to get help to fix it.

The aim of this series of blog posts is to show you what to look for in z/OS performance and if you have a problem.

I will cover

What is a TCP/IP performance problem?

People complain about a TCP/IP performance problem when “it” seems slow. This could be caused by a variety of problems

  • Data between two ends is being discarded. This can occur on an unreliable, or overloaded component, whose default action is to throw away data, knowing it will be resent.
  • The time taken to get from one end to the other and back (“a ping”) is slow. This can be caused by slow or overloaded components.
  • There is a lot of data to send, for example a movie, or a web page with lots of javascript or graphics.
  • Or all of the above.

There is a quote “Never under estimate the bandwith with of a lorry full of tapes”. It might take 10 hours, but a truck 6 ft wide by 20 ft long could hold 300,000 1TB tapes and deliver 8 TBytes/second (with a round trip time of 20 hours). Which is more than the internet can provide!

You need to know

  • Are packets being thrown away? You see this from the number of packets which were resent.
  • What is the round trip time? (You could use ping – but you may not be able to)
  • Is data being sent efficiently – in big blocks?

TCP/IP concepts

With TCP/IP there is a connection between a sender and a receiver. The sender sends numbered packets of data to the receiver. The receiver sends an acknowledgement that a packet has been received.

The following is a representation of the flow

  • The sender sends packet 1
  • The sender sends packet 2
  • The sender sends packet 3
  • The receiver receives packet 1 and sends an acknowledgement for packet 1
  • The sender sends packet 4
  • The receiver receives packet 2 and sends an acknowledgement for packet 2
  • The sender waits until the acknowledgement of packet 1 has been received
  • The sender sends packet 5 and waits till the acknowledgement of packet2 has been received
  • etc

This way it is self limiting. It means the sender cannot send more than the receiver can handle.

If a packet goes missing, eventually the sender gets a time out, and resends it.

There are two parts to “performance”.

  1. FTP like: How much data can be sent per second. This is of interest to FTP and MQ, where there is mainly a one way transmission of lots of data. The round trip time is not so critical if you can have a lot of data in transit.
  2. Transactional: Send some data and wait for the remote end to respond, for example a web browser. The amount of data may be measured in KB, but the round trip time is important.

The term “window” is often used in TCP/IP.

The term “send window” on the sender side represents the total number of packets yet to be acknowledged by the receiver. With a bigger window, there is more data in the pipe line, and the throughput goes up. With a window of 1, one packet is sent and the sender waits for the acknowledgement before sending the next. With this, if there is a high latency, the overall throughput will be low.

More details

One of the factors that affects performance is the receive buffer size. If this was set to 4KB, it means that an application can read up to 4 KB of data at a time. This receive buffer size is sent to the sender, and basically says “send chunks up to this size – as that is all the receiver can take” – this sets the send-buffer-size.

The term Dynamic Right Sizing(DRS) allows the TCP receive buffer size to expand if the network conditions are favourable.

The term Outbound Right Sizing(ORS) allows the TCP send buffer size to expand if the network conditions are favourable.

Another term used is congestion window. If too much data is sent, or the network is unreliable, packets will get lost or thrown away. The congestion window is a measure of how much data can be in-flight. If packets get lost, the congestion window is made smaller. If packets are not lost, then it will try to increase the congestion window. This is a very rough indication of the quality of the network.

FTP like performance

There are several factors which can improve the throughput down a connection

  • Make packets bigger. In the early days of TCP/IP a typical packet was 256 bytes. These days a typical default packet size can be 64KB or more.
    • One of the Smarts in the protocol is called dynamic right sizing, where TCP will send increasing larger packets until the receiver says “big enough”. The packet size can change with load.
  • How much data to send before waiting for the acknowledgement. For a reliable connection, where data is never lost, it is efficient to send a lot of data before waiting. This is called a large send window.
  • If the connection is unreliable, it may be more efficient to have only a small send window, before waiting for the acknowledgement.

Transactional work

  • Having big buffers may not improve throughput, for example with a web page, the data may all fit into 2KB. In this case having a buffer size of 16KB or 64 KB may make no difference to throughput or performance.
  • Typically if one packet contains all the data, then this will be acknowledge as soon as it arrives.
  • Some web pages with a lot of javascript or images, may require big buffers, and many packets.

How to see what is going on

You can use the well known “ping” command to send data to the remote end, and get the response. This gives a measure of the network time.

I found most of the data for looking at performance, is available from the netstat command. I found it useful to capture the output of the command in a file or data set.

What connections are connected to this server?

I use the netstat command in TSO , because my fingers are more used to it, and the command options are more memorable than the omvs command ( for example with omvs netstat, do I need the -a or -A option)

netstat conn (port 1414
netstat conn report hlq colin ( port 1414
netstat conn report dsn ‘colin.output’ ( port 1414

These all gave the same output. The report hlq colin creates a data set colin.netstat.conn. The data set name is from the hlq, ‘netstat’, and the subcommand. You can specify a data set name using the ‘dsn’ option.

For omvs you can use

netstat -c -p TCPIP -P 1414 > filename

That lists all of the connections for port 1414.

The command gave me

MVS TCP/IP NETSTAT CS V2R4       TCPIP Name: TCPIP           09:18:34    
User Id  Conn     Local Socket           Foreign Socket         State    
-------  ----     ------------           --------------         -----    
CSQ9CHIN 00000023 10.1.1.2..1414         10.1.0.2..60538        Establsh 
CSQ9CHIN 00000022 0.0.0.0..1414          0.0.0.0..0             Listen   

There is one connection established from 10.1.0.2 port 60538 to the server with the port listening on 1414.

The commands below give a lot of information about the connection

netstat all report hlq colin (ipport 10.1.0.2+60538
netstat -A -p TCPIP -B 10.1.0.2+60538 > all.port1

Output from the netstat command

The fields are described at the bottom of this page.

Both commands gave me the same output.

There is a lot of data. I’ve broken it into sections with comments after the interesting fields.

  MVS TCP/IP NETSTAT CS V2R4       TCPIP Name: TCPIP           09:23:29 
  Client Name: CSQ9CHIN                 Client Id: 00000023 
  Local Socket: 10.1.1.2..1414          Foreign Socket: 10.1.0.2..60538 
  BytesIn:            0000002988        BytesOut:           0000002912 
  SegmentsIn:         0000000019        SegmentsOut:        0000000011   
  
  • 09:23:29 is the time when request was made. If you repeat the command you can get the interval between commands, and so calculate rates.
  • You get the client (job) name CSQ9CHIN.
  • The listener socket for the job (local socket) 10.1.1.2 with port 1414.
  • The foreign socket – the remote end of the connection. IP address 10.1.0.2 port 60538.
  • You can get the data rate If you repeat the command, calculate the deltas BytesIn and BytesOut, and divide by the time between measurement.
  StartDate:          06/16/2021        StartTime:          10:00:21 
  Last Touched:       10:20:37          State:              Establsh 
  RcvNxt:             2019327903        SndNxt:             0864946572 
  ClientRcvNxt:       2019327903        ClientSndNxt:       0864946572 
  InitRcvSeqNum:      2019324914        InitSndSeqNum:      0864943659 
  CongestionWindow:  0000018720        SlowStartThreshold: 0000065535 
  

Look at the congestion window. Big is good. Small may indicate small amounts of data being sent or it may indicate network problems, either slow connections or packets are being dropped.

IncomingWindowNum:  2019458463        OutgoingWindowNum:  0865008524 
SndWl1: 2019327903 SndWl2: 0864946572
SndWnd: 0000061952 MaxSndWnd: 0000064256

Check the send window. A small (1KB) send window can indicate poor configuration at the remote client, or only small amounts of data are being sent.

SndUna:             0864946572        rtt_seq:            0864946064 
MaximumSegmentSize: 0000001440 DSField: 00
Round-trip information:
Smooth trip time: 6.000
SmoothTripVariance: 12.000

Monitor the smooth route trip time (in milliseconds) this the local end to the remote end, and back. The variance gives a measure of the spread of response times. These are not strictly averages.

If you had a million requests taking 1 millisecond, and then had a long request taking 1000 milliseconds. The “Average” response time would change by a very small amount (to 1.09 milliseconds). The smoothed (or weighted average) may be something like – (99 * previous average + current value) /100. In this case the “average” goes up to 10.9 milliseconds, which is noticeable different.

ReXmt:              0000000000        ReXmtCount:         0000000000 

The re transmits should be zero – or not changing. If this number increases it means the network has lost packets.

DupACKs:            0000000000        RcvWnd:             0000130560  

The receive window is usually set to 2 * receive buffer.

SockOpt:            88                TcpTimer:           00   

Check SockOpt. Check bit 0x08. If set this indicates “delayed acknowledgement disabled”. See Nagle algorithm. This value being set is good.

If this is not set, then sender can delay sending data for up to about 200 ms, and so combine data from different applications into the same packet for the same destination. This reduces network traffic as there are fewer packets, but it delays the data being sent.

TcpSig:             04                TcpSel:             40 
TcpDet: E4 TcpPol: 00
TcpPrf: 81 TcpPrf2: 20
TcpPrf3: 00

For FTP type applications check the TCP Performance Flag TcpPrf. This says if Dynamic Right sizing (using bigger buffers) is enabled. The flag bits are x80 – enabled, x40 Active, x20 Active but disabled. X80 |X40 is good.

The TCP performance flag2 TcpPrf2. This is for outbound right sizing (ORS). A non zero value is good.

DelayAck:           Yes 
QOSPolicy: No
TTLSPolicy: No
RoutingPolicy: No
ReceiveBufferSize: 0000065536 SendBufferSize: 0000065536

These buffer sizes should be large with 64KB or larger, if so the system can dynamically increase them.

They can be configured at the TCP/IP level, or by the application. If they are 64KB or higher then TCP Dynamic Right Sizing can be used (adjust the buffers to match the load).

ReceiveDataQueued:  0000000000 
SendDataQueued: 0000000000

These should always be zero.

  • Received data queued means the application is slow to retrieve the data
  • Send data queued – the application has issued a send – but TCP/IP cannot process it.
SendStalled:        No 
Ancillary Input Queue: N/A

Send stalled should always be no.

What do you need to check?

  • SendStalled, ReceiveDataQueued,SendDataQueued should all be 0. They usually are 0. They would be non zero if there was a problem right now. If the problem gets better, these values would be 0.
  • Check ReXmt = The total number of times a packet has been retransmitted for this connection. This count is historical for the life of the connection.
    • If this is zero then there have been no re transmits, and so no packets lost.
    • If this is non zero, then it could be a historical problem. Wait and reissue the netstat command. If the ReXmt value has changed, this indicates packets are being lost.
  • Check the round trip time (and variance). Is the value what you expected? If there is traffic flowing on the connection, display the value multiple times, and see if there is significant variation.
  • Check ReceiveBufferSize and SendBufferSize. Values of 64KB or larger are good. Small is not good.
  • Check congestion window.

It is good to have some data for a normal day, and a problem day. For example if the packets are often lost, then this may not be the problem. If the SendBufferSize is only 8KB today and was 64KB last week – this would a good place to start looking. So capture and save NETSTAT reports for typical sessions.

What about connections into z/OS

Windows has a netstat command.

On Linux Netstat has been superseded with ss for example

ss –info dst 10.1.1.2
ss –info dst 10.1.1.2:1414
ss –info src 101.0.2

This is ss dash dash info …

gives similar information for connections going to 10.1.1.2, or the address and port 10.1.1.2:1414

Example netstat output from a slow FTP in connection

Client Name: IBMUSER                  Client Id: 000006FE 
Local Socket: 10.1.1.2..1109          Foreign Socket: 10.1.0.2..35508 
  BytesIn:            0220191104        BytesOut:           0000000000
  SegmentsIn:         0000152946        SegmentsOut:        0000083051
  StartDate:          06/28/2021        StartTime:          13:47:56 
  Last Touched:       14:24:28          State:              Establsh 
  RcvNxt:             3569682809        SndNxt:             2105824963
  ClientRcvNxt:       3569577977        ClientSndNxt:       2105824963
  InitRcvSeqNum:      3349491704        InitSndSeqNum:      2105824962
  CongestionWindow:   0000005760        SlowStartThreshold: 0000065535
  IncomingWindowNum:  3569946679        OutgoingWindowNum:  2105889219
  SndWl1:             3569681369        SndWl2:             2105824963
  SndWnd:             0000064256        MaxSndWnd:          0000064256
  SndUna:             2105824963        rtt_seq:            2105824962
  MaximumSegmentSize: 0000001440        DSField:            00 
  Round-trip information: 
    Smooth trip time: 3.000             SmoothTripVariance: 2.000 
  ReXmt:              0000000000        ReXmtCount:         0000000000
  DupACKs:            0000000000        RcvWnd:             0000263870 
  SockOpt:            A0                TcpTimer:           00 
  TcpSig:             04                TcpSel:             40 
  TcpDet:             E0                TcpPol:             00 
  TcpPrf:             E0                TcpPrf2:            28 
  TcpPrf3:            00 
  DelayAck:           Yes 
  QOSPolicy:          No 
  TTLSPolicy:         No 
  RoutingPolicy:      No 
  ReceiveBufferSize:  0000184351        SendBufferSize:     0000184320 
  ReceiveDataQueued:  0000104832 
    OldQDate:         06/28/2021        OldQTime:           14:24:27 
  SendDataQueued:     0000000000 
  SendStalled:        No 
  Ancillary Input Queue: N/A 
  Application Data:   EZAFTP0S D IBMUSER   C      FSSH 

Comments

  • Congestion window low
  • Smooth trip time: 3.00 good
  • ReXmt: 0 good
  • Receive buffr 184351- good
  • Receive buffer queued 104832 – BAD

One Minute MVS performance – Work Load Manager – looking at WLM reports.

I have a set of blog posts relating to getting started with z/OS performance. This blog post follows on the overview of WLM, and describes the contents of the reports, and how you can tell if work is being delayed, and why it is being delayed.

Real goals from my system

For TSO on my z/OS there are goals

  1. For the first 800 service units (a systems independent measure of CPU usage)
    1. 80% requests to complete within 00:00:00.30
    2. Work has importance 2
  2. After this, any work has an execution velocity of 40.

For started tasks with Medium Priority the goals are

  1. Execution velocity of 30
  2. Importance 3

For started tasks with Low Priority the goals are

  1. Discretionary – there no goals – just do your best

How do I tell what is going on and if the goals have been met?

RMF can display data in near real time (every minute or so).

RMF captures data and produces SMF records which can be processed by RMF and other products.

You can report on

  1. How well the service class did against its goals
  2. How well transactions or work did, from a reporting class.

You could have all CICS transactions in a service class, so they get the same CPU profile etc, but have different reporting classes. You can monitor CE* transaction, and PAY* transactions differently.

You could have a reporting class for work coming in from other systems, depending on the userid.

I set up a reporting class for z/OSMF. In the RMF batch report SYSRPTS(WLMGL(RCPER(ZOSMF)).

One part of the report was contained


         z/OS V2R4               SYSPLEX ADCDPL             DATE 06/14/2021           INTERVAL 05.00.003   
                                 RPT VERSION V2R4 RMF       TIME 09.25.00
POLICY=ETPBASE                        REPORT CLASS=ZOSMF                                   PERIOD=1 
 -TRANSACTIONS--  TRANS-TIME HHH.MM.SS.FFFFFF  TRANS-APPL%-----CP-IIPCP/AAPCP-IIP/AAP  ---ENCLAVES--- 
 AVG        1.00  ACTUAL                    0  TOTAL        66.25       64.20  173.99  AVG ENC   0.00 
 MPL        1.00  EXECUTION                 0  MOBILE        0.00        0.00    0.00  REM ENC   0.00 
 ENDED         0  QUEUED                    0  CATEGORYA     0.00        0.00    0.00  MS ENC    0.00 
 END/S      0.00  R/S AFFIN                 0  CATEGORYB     0.00        0.00    0.00 
                                                                                                                
 ----SERVICE----   SERVICE TIME  ---APPL %---  --PROMOTED--  --DASD I/O---  ----STORAGE----  -PAGE-IN RATES- 
 IOC        2366K  CPU  720.505  CP     66.25  BLK    0.000  SSCHRT    0.2  AVG    81420.24  SINGLE      0.0 
 CPU      617333   SRB    0.223  IIPCP  64.20  ENQ    0.000  RESP      0.0  TOTAL  81421.05  BLOCK       0.0 
 MSO      154219   RCT    0.000  IIP   173.99  CRM    0.000  CONN      0.0  SHARED     0.00  SHARED      0.0 
 SRB         191   IIT    0.013  AAPCP   0.00  LCK    0.889  DISC      0.0                   HSP         0.0 
 TOT        3138K  HST    0.000  AAP      N/A  SUP    0.000  Q+PEND    0.0 
 GOAL: EXECUTION VELOCITY 70.0%     VELOCITY MIGRATION:   I/O MGMT  28.3%     INIT MGMT 28.3% 
                                                                                                                
          RESPONSE TIME    EX   PERF  AVG   --EXEC USING%--  -------------- EXEC DELAYS % -----------  
 SYSTEM                    VEL% INDX ADRSP  CPU AAP IIP I/O  TOT IIP CPU                                
 S0W1        --N/A--       28.3  2.5   1.0  8.0 N/A  20 0.0   72  53  19                               

Key fields:

INTERVAL 05.00.003

This tells the duration of the requests.

POLICY=ETPBASE REPORT CLASS=ZOSMF PERIOD=1

This tells you this is a report class (rather than a service class) the name is zOSMF, and is for period 1 . When you have service classes which have more than one criteria , such as high priority for the first 0.5 seconds of CPU – then low priority, these will have multiple periods.

-TRANSACTIONS–
AVG 1.00
MPL 1.00
ENDED 0
END/S 0.00

This says on average there was one instance running. You can have multiple transactions or jobs in a class. Add up the total duration of all jobs/transactions and divide by the interval to get the average(AVG).

MPL (multi programming level) is an advanced topic and describes how many instances were concurrently active.

No jobs/transactions ended in this interval, with a ending rate of 0 in 5 minutes.

—APPL %—
CP 66.25
IIPCP 64.20
IIP 173.99
AAPCP 0.00
AAP N/A

This shows the percentage of CPU used over the interval

  • 66.25 percent on GP engines
  • 64.20 percent IIPCP is 64.20 % of GP engine was doing work that could have run on a ZIIP – if there had been spare ZIIP capacity. 66.25 – 64.20 = 2.05 of work on a GP that was not ZIIP eligible.
  • 173.99 percent of ZIIP work running on a ZIIP engine – so nearly 2 ZIIP engines were being used
  • 0 AAPCP – there was no ZAAP eligible work offloaded onto a GP
  • 0 AAP there was no work running on an ZAAP

The total ZIIP used was 173.99 in ZIIP engines, +64.20 of a GP = 238 or almost 2.5 ZIIP engines worth.

It is good to run on ZIIPs where possible, because ZIIPs are cheaper ($$) than GPs, and GPs may be configured to be slower than a ZIIP.

GOAL: EXECUTION VELOCITY 70.0%

The performance goal for this work was defined as Execution Velocity of 70 %.

 
         EX   PERF  AVG   --EXEC USING%--  - EXEC DELAYS % -
 SYSTEM  VEL% INDX ADRSP  CPU AAP IIP I/O  TOT IIP CPU      
 S0W1    28.3  2.5   1.0  8.0 N/A  20 0.0   72  53  19       
  • The achieved execution velocity was 28.3% against a target of 70%
  • The performance index was 2.5. The performance goal is goal/actual. A value of 1 or smaller is good. The value here shows the goal was not met. You need to consider
    • Changing the goal for this work so the target goal is what you can achieve on a normal day
    • Changing the importance of the work for when the system is constrained.
    • If you change the goal for one set of work – it may impact other work, so you need to look at the system as a whole and decide which is your important work.
    • Add more CPUs or ZIIPs – these may not help if the delays are not CPU… see below
  • Average number of address spaces in this class 1.
  • EXEC USING%. The figures above were for true CPU used. WLM samples activities 4 times a second. Of the samples where jobs were running or waiting for waiting for a resource.
    • 8% of an CPU engine was used – this includes ZIIP work running on GP.
    • 20% of a ZIIP engine
    • The ratio 8:20 is similar to CPU on GP and ZIIP actually used in this period of 66.25: 173.99.
  • EXEC DELAYS
    • The total delay was 72% = ( 100 – (8+20) “using samples” above)
    • for 53% of all the samples it was was waiting for a ZIIP engine
    • for 19% of all the the samples it was waiting for a GP engine.
    • You can have other delays listed here, for example paging, or your program is capped to limit how much CPU it is allowed.

Once z/OSMF had started, and settled down, there were still delays for IIP (28%). To me this looks like a lumpy workload, that perhaps there is a timer which pops and runs multiple threads. There are more threads than IIPs – so some have to wait.

Reports for transactional work

I defined a transaction so I could measure the response times (and CPU used) for a service in z/OSMF. A TSO address space is started, and z/OSMF sends a client/server request to the TSO address space. The response time is sub-second so a good candidate to demonstrate WLM for a transaction.

I configured z/OSMF to have

<zosWorkloadManager collectionName=”MOPZCET”/>
<wlmClassification>
<httpClassification transactionClass=”ZCI3″ resource=”/zosmf/webispf/*/“/>
</wlmClassification>

The collection name is passed to WLM to determine the service class and report class of the work. The default is the server name.

All ISPF (with a URL of /zosmf/webispf/*) requests were classified as ZCI3.

I then used WLM to configure

  • a service class ZCI3 with Average response time of 00:00:00.010
  • a classification rule for type CB, a rule for CN=MOPZCET, and sub-rule TC = ZCI4. This gave the service class and report class.

The data in the report had

-TRANSACTIONS–
AVG 0.01
MPL 0.01
ENDED 21
END/S 0.07

21 transactions in 5 minutes is 0.07 a second.

MPL (MultiProgramming Limit is the target which represents the number of address spaces that must be in the swapped-in state for the service class period to meet its goals. I’ve never used it!

TRANS-TIME HHH.MM.SS.FFFFFF
ACTUAL               140526
EXECUTION            139950
QUEUED                  575

The average time was 0.140 seconds.

GOAL: RESPONSE TIME 000.00.00.010 AVG

That was the specification in WLM (note the specified value of 0.010 is very different to the 0.140 achieved)


          RESPONSE TIME    EX   PERF  AVG   --EXEC USING%--  - EXEC DELAYS % -
 SYSTEM   HHH.MM.SS.FFFFFF VEL% INDX ADRSP  CPU AAP IIP I/O  TOT IIP 
 S0W1     000.00.00.140526 66.7 14.1   0.0  0.0 N/A  18 0.0  9.1 9.1  

This shows the average response time was 0.140 seconds, used 18% on a ZIIP, and waited 9% of the time for a ZIIP

To the right of the data in the report was

--- DELAY % --- 
UNK IDL CRY CNT                 
 64 0.0 0.0 0.0 

Which says there was 64% of the delay was unknown. This could be

  • waiting for end user input
  • waiting for TCP/IP data
  • the program sent off a request and is waiting for a response.

For example the ISPF transaction in z/OSMF had sent a request to an address space running TSO. This address space processed the request and sent the response back. I am guessing that the 64% delay was waiting for TSO to process the request and send back the response.

You also get a response time profile based on the service class

                              ----------RESPONSE TIME DISTRIBUTION---------- 
   -----TIME------  # TRANS   0    10   20   30   40   50   60   70   80   90   100 
   HH.MM.SS.FFFFFF  IN BUCKET |....|....|....|....|....|....|....|....|....|....| 
<= 00.00.00.014000          0  > 
<= 00.00.00.015000          0  > 
<= 00.00.00.020000          2  >>>>>> 
<= 00.00.00.040000          5  >>>>>>>>>>>>> 
>  00.00.00.040000         14  >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 

This shows that out of the 21 requests, 7 were below 0.040 seconds, and 14 were over 0.040 seconds.

From the service class, it was specified as GOAL: RESPONSE TIME 000.00.00.010 AVG so this goal is very badly specified. It would be better set to average of 0.140 seconds.

I changed the service class to a goal of 0.140 seconds and activated it. After I had run some tests the output was

          RESPONSE TIME    EX   PERF  AVG   --EXEC USING%--  - EXEC DELAYS %
 SYSTEM   HHH.MM.SS.FFFFFF VEL% INDX ADRSP  CPU AAP IIP I/O  TOT            
 S0W1     000.00.00.097733  100  0.7   0.0  0.0 N/A  50 0.0  0.0            

Which showed no delays

and a response time profile

                                ---RESPONSE TIME DISTRIBUTION--- 
    -----TIME------  --# TRANS  0    10   20   30   40   50   60
    HH.MM.SS.FFFFFF  IN BUCKET  |....|....|....|....|....|....|.
 <= 00.00.00.070000          0  > 
 <= 00.00.00.084000          5  >>>>>>>>>>>>>>> 
 <= 00.00.00.098000          9  >>>>>>>>>>>>>>>>>>>>>>>>>> 
 <= 00.00.00.112000          1  >>>> 
 <= 00.00.00.126000          0  > 
 <= 00.00.00.140000          1  >>>> 
 <= 00.00.00.154000          1  >>>> 
 <= 00.00.00.168000          0  > 
 <= 00.00.00.182000          0  > 
 <= 00.00.00.196000          0  > 
 <= 00.00.00.210000          1  >>>> 
 <= 00.00.00.280000          0  > 

An average of 0.10 seconds, with some taking up to 0.210 seconds.

Real time information

You can get the information in near real time from RMF (or other monitors)

For example for processor delays

            Service  CPU  DLY USG EAppl  ----------- Holding Job(s) ---------
Jobname  CX Class    Type  %   %    %     %  Name      %  Name      %  Name 
IZUSVR1  SO STCHIM   CP     2  35 56.53   91 IZUSVR1    4 JES2MON    2 TCPIP 
                     IIP   94  95 183.1   89 IZUSVR1                         

This shows that job IZUSVR1

  • Was delayed for 2% of the time on a GP
  • Used 35% of the GP engines
  • Was delayed 94% of the time on a ZIIP
  • and used 95% of the available ZIIP resource
  • The jobs using CPU were IZUSVR1 (using 91%) JES2MON and TCPIP
  • The jobs using ZIIP were IZUSVR1

What to do now?

You need to identify the goals of your work, and set sensible goals. This may take several iterations. You also need to understand the priorities of the work, and userid.

Once you have configured your system to report on response times of your business critical work, you can adjust the service classes so your work achieves it goals.

Define reporting classes so you can monitor different groups of work and that they are meeting their goals.

One Minute MVS performance – Work Load Manager – background

Question: In your car how do you tell if your car has a problem? Answer: You look at the dashboard and see if there is a red light showing. You may not know how to fix it – but you know that you need to get help to fix it.

The aim of this series of blog posts is to show you what to look for in z/OS performance and if you have a problem.

I will cover

Ive written a blog post on how to understand reports from WLM

Managing workload

40 years ago the pilot of a commercial jet had many knobs and dials to control the performance (and speed) of the aeroplane. These days computers do most of the small tuning; the pilot sets the overall goals, and the computer does the rest.

It is the same with managing workload performance on z/OS. The systems programmer used to individually adjust the performance of jobs and transactions running on z/OS. These days the systems programmer sets the overall goals and the computer does the rest.

The work on z/OS is managed by the WorkLoad Manager (WLM). The systems programmer defined goals like

  • these CICS transactions should run with an average response time of 1 second.
  • The trivial TSO commands should run in under 10 milliseconds.
  • Batch – I dont care… it can run in the background.

I remember one customer saying that one day when he switched on WLM (so the WLM managed the workload), he noticed that the batch workload finished early, it made every thing go faster!

This is because when the system had been manually “tuned”; the CICS transactions were finishing in under half a second (much faster than the requirements of of 1 second). WLM worked to the goals, and the CICS transactions executing with a response time of 1 second. This meant there was spare CPU, and more batch workload could be done during the day.

How to monitor work?

For short lived requests, like CICS transactions or TSO commands, the response time is the obvious metric. Typically the response time is under a second, and at the end of a minute you should have many data points to tell you if you are achieving your response time goals.

Long running jobs or started tasks may run for weeks, so the time to run the job is meaningless. It could run slower when the system is busy, or run faster when the system is lightly loaded. It needs a second by second metric to measure progress.

WLM uses the concept of how much the work can be delayed. It uses a metric called Execution velocity which in concept is “the ratio of CPU used” to “time waiting for a resource”. In simple terms

100 * CPU used /(CPU used + wait time for a resource).

WLM can periodically check this ratio and adjust the priority of the work to achieve the goals.

If the execution velocity is 100 then it is not delayed for CPU.

If the execution velocity is 1 then if the work used 1 second (or millisecond) of CPU, then it is OK for the job to be delay for I/O or waiting for CPU for 100 seconds (or milliseconds). The ratio is important – not the absolute values.

How to work out which work to dispatched?

Every 10 seconds WLM looks at the data to decide which service classes need more or less CPU

WLM looks at all the work, and if it is meeting the goals for the service class (the definition of the goals).

  1. If all the work is within its goals pick any waiting work to dispatch
  2. If any work is not within its goal, adjust the dispatching priorities. Start with work with Importance 1, when this is within its goals, look at work with Importance 2 etc..

What can you configure

You can configure the system with goals like

  1. CICS transactions should take 1 second elapsed time to execute.
  2. Quick TSO commands using less than 0.5 seconds of CPU have high velocity.
  3. Slow TSO command using more than 0.5, but less than 5 seconds of CPU have Importance 3.
  4. Expensive TSO commands using more than 5 seconds of CPU have low priority
  5. Colin’s TSO userid always gets high priority regardless of the commands.
  6. Batch jobs with this accounting information, can run with high velocity
  7. Long batch jobs, or those batch jobs using more than 1 second of CPU, have low priority.

Work can get tracked across the system, and if WLM detects that CICS transactions are slowing down, then when the CICS issues a DB2 request in a different LPAR in the sysplex, it makes sure the request in DB2 has a high enough priority to keep the response time goals. WLM can also prioritise I/O so that the I/O for one transaction takes precedence over the I/O for a batch job.

The systems programmer creates a few broad categories of work, and specifies the goals of the service class. These service classes control the priority of work.

Use the WLM redbook for guidance on defining WLM service classes.

Service classes define the goals.

You have reporting classes for groups of similar jobs or transactions to report WLM information on these similar jobs or transactions. So although a group of work has the same service class, you can report it different ways, for example by transaction, or by userid.

You can define CICS, IMS, or Liberty as a server, and transactions/work within the server get WLM classified. So for job CICSA, the transactions PAY1,PAY2,PAY3 have high priority; for z/OSMF, userid COLIN has high priority.

What class is my work in?

You can display which service class a job is in using the D A,jobname operator command, for example it gave

WKL=STARTED SCL=STCLOM .

The Service CLass is STCLOM.

You can use SDSF DA, and use the column SrvClass. (You need to start RMF, then go into SDSF to display the Srvclass and other WLM related parameters).

You can change the service class of a job by using the operator command

RESET IZUSVR1,SRVCLASS=STCMDM

(Note which way round the letters are jobname IZUSVR1, service class SRVCLASS=…)

or, if you are authorised, overtype the field in SDSF, or from z/OSMF WLM plugin.

To change it permanently you’ll need to change the WLM definitions.

More details of how it works

I found the WLM redbook useful.

I described above that the execution velocity was 100 * CPU used /(CPU used + wait time for a resource).

The concept is correct – but the implementation is different. If my job had used 1000 seconds of CPU since it started, it is not helpful in seeing it behaviour over the last few minutes, as the execution velocity would be insensitive.

Every 250 milliseconds (4 times a second) WLM looks at every job/transaction in the system. It then updates internal control blocks for each Service Class and Report Class and increments a table.

  • executing – add 1 to the active (or using) CPU
  • transferring data to a device (connect time) add 1 to the active ( or using) I/O
  • waiting in z/OS to start an I/O – add one to the delayed for I/O
  • being paged in – add 1 to the delayed for”page in”
  • etc
  • waiting for the end user to enter data – do nothing.
  • waiting for TCPIP data – do nothing.

Execution velocity = 100 * (Total active samples /(Total active samples + Total delayed samples).

If during a 25 second period the transaction was

  • using CPU, in 20 samples,
  • transferring data to disk, in 10 sample
  • waiting to start an I/O, in 25 samples
  • waiting for the end user to type some data, in 45 samples.

From this we can see…

  • The count of active samples is 20 + 10
  • The count of delayed is 25 samples.
  • 45 samples are not used.

Execution velocity = 100 * ( (20 + 10) /(20+10) + 25)) = 55 .

An execution velocity of 100 means that when ever the job was sampled, it was always either dispatched and running; or transferring data to I/O.

An execution velocity says if we expect the job to use 50 seconds of CPU, and has a velocity of 10 then we expect the job to run in about 500 seconds. If it used 50 seconds of CPU, and was transferring data (connect time) of 20 seconds, the execution velocity would be 100 * (50 + 20) /((50+ 20) + 450) = 13 % which is close enough to 10% velocity.

Real goals from my system

For TSO on my z/OS there are goals

  1. For the first 800 service units (a systems independent measure of CPU usage)
    1. 80% requests to complete within 00:00:00.30
    2. Work has importance 2
  2. After this, any work has an execution velocity of 40.

For started tasks with Medium Priority the goals are

  1. Execution velocity of 30
  2. Importance 3

For started tasks with Low Priority the goals are

  1. Discretionary – there no goals – just do your best

One Minute MVS performance – DASD

Question: In your car how do you tell if your car has a problem? Answer: You look at the dashboard and see if there is a red light showing. You may not know how to fix it – but you know that you need to get help to fix it.

The aim of this series of blog posts is to show you what to look for in z/OS performance and if you have a problem.

I will cover

For some of these you need data from z/OS. This post describes how to get the SMF data, and format it using RMF.

DASD has changed in 40 years

40 years ago “disk storage” was on huge rotating disks and you had to carefully manage where you put your datasets -both which disk, and whereabouts on the disk. For example people would put the hot dataset in the “centre” of the disk to minimise the time to move the heads.

For the last 20 years people use the term “storage” because most I/O activity goes to cache in the disk controller, and the disk controller writes the data out to PC sized disks – which in turn may be solid state, and have no moving parts.

A pictorial view of disks

  • You have the processor running z/OS
  • Plugged into the side of the processor is the I/O adapter
  • Plugged into this I/O adapter are a lot of channels (think optical fibre cables)
  • Theses cables can be plugged into a switch – think of a plug board or telephone exchange. This allows channels from 2 processors plugged into the switch, and have one cable down to the storage controller . You could avoid a switch and have cables directly from the processor to the storage controller. Each processor would need its own set of cables.
  • The storage controller manages all of the IO
    • It has a lot of cache so most I/O may go to the cache. During a read, the storage controller will read from the disks if the data is not in the cache.
    • It has many PC type of disks. These disks could be solid state, or have rotating disks
    • If you have mirrored disks, the storage controller talks to a remote storage controller
  • Within each channel are many logical sub channels. Each disk has at least one sub-channel allocated to it. A disk can have multiple sub-channels allocated to it. There can be a pool of sub-channels which are used as needed to allowed parallel I/O to a disk.

The I/O journey

  • Your application wants to read the first record of a file.
  • Once the file has been opened, the application can issue the read.
  • z/OS knows where the data set is on disk (eg VOLID A4USR1, Cylinder 200, track 4)
  • z/OS builds up a set of commands (such as locate disk, locate cylinder 200, locate track 4, read data, read data, read data) to get the data and issues the Start Sub channel request, passing the list of I/O commands.
  • This is queued to the I/O adapter.
  • The original application is suspended (until the I/O is complete)
  • The I/O adapter looks for a free sub-channel for the disk, or gets one from the sub-channel pool.
  • The I/O adapter takes the list of commands, and executes them one at a time.
  • When the I/O adapter has finished the list of commands, it sends an interrupt to the mainframe saying “this subchannel has finished”.
  • z/OS wakes up, looks at the interrupt, and resumes the application.

Today you have to consider 3 areas where you can get delays, you need to be an expert if you want to look at more detail.

  1. Waiting in the I/O adapter before being able to get a sub channel. This is known as IOSQ – IO subsystem Queueing.
  2. Establishing the connection from processor to the storage controller
  3. Transferring the data the connect time.

This is complicated by being able to use disks 50 km away, which adds to the delay time.

RMF Reports

In the RMF MFR000… report with section D I R E C T A C C E S S D E V I C E A C T I V I T Y. (I search for IOSQ).


                  DEVICE   AVG  AVG   AVG  AVG  AVG   AVG  AVG  AVG    %      %    
 VOLUME PAV  LCU  ACTIVITY RESP IOSQ  CMR  DB   INT   PEND DISC CONN   DEV    DEV  
 SERIAL   1       RATE     TIME TIME  DLY  DLY  DLY   TIME TIME TIME   CONN   UTIL 
 A4RES1   1       102.896  .044 .003  .001 .000       .004 .000 .036   0.38   0.38 
 A4RES2   1        27.293  .036 .000  .001 .000       .003 .000 .032   0.09   0.09 
 USER00   1        25.331  .031 .003  .001 .000       .004 .000 .024   0.06   0.06 
 A4SYS1   1       365.102  .026 .005  .001 .000       .004 .000 .017   0.62  24.52 

Key fields

  1. Volume Serial such as A4RES1 is the volid of the disk
  2. PAV – I’ll mention this below.
  3. Device Activity Rate – how many requests (start sub channel) from z/OS, per second
  4. Average response time in milliseconds
  5. Average IOSQ – how long did it have to wait in z/OS and the I/O adapter before the request was sent down to the storage controller

The times are in milliseconds.

There are often thousands of volumes in a z/OS environment some are heavily used, some are not used. See below on how to find the hot volumes.

I typically look at the volumes with the highest I/O. If the hot volumes have good response time, the not so hot should be OK.

If you think of the sub-channel connection between the mainframe and the volid in the storage controller, there can only be one I/O requests at a time per sub-subchannel. You can have multiple connections down to a volume. These are known as PAV, or Parallel Access Volumes. The PAV is the average number of sub-channels in use.

The first field you look at is the IOSQ. This is the time between z/OS starting the request, and before the I/O could be started to the storage controller. This should be small 10s of microseconds ( 0.0xx in the report above). If this value is larger than this, you need to speak to your Storage Manager or z/OS Systems Programmer.

The second field you look at is the % DEV UTIL. How busy was the connection to the storage controller. A value of 100% means that it was running flat out. If the utilisation is around 70-80% it may be a OK – just something to note. More PAVs can increase throughput for a busy disk.

The next figure you look at is the RESP TIME. This is the response time the application sees. For local disk, response times of under 1 millisecond are OK. If you have remote disks, and synchronous I/O then the response time will be longer.

Finding the hot volumes

I take the RMF report and extract the DASD records.

  • For SDSF where the output is in the spool
    • I use Status to list all of the jobs, (Output or Hold work just as well)
    • Put ? in front of the job to show all of the spool data sets
    • use the SE command to Spool Edit the report
  • For a dataset I use the View prefix command in ISPF 3.4
  • Put DD in line prefix area on line 1
  • Find ‘D I R E C T’
  • Put DD in line prefix area, press enter, to delete the lines above it
  • Find ‘D I R E C T’ last
  • put d9999 in the line prefix area following the data (My report has ‘P A G I N G’), and press enter.
  • You should now have only DASD records
  • Put ‘cols’ in the line command area, note the columns of the DAR (50 to 58)
  • In the command line type SORT 50 58 D on Device Activity Rate.
  • This shows you the top usage volumes. Check the response times. Under 1 millisecond is good for locally attached disks. It can be down to 0.1 ms
  • If the response time is 1 ms or larger…
    • Check columns 60-65 (AVG IOSQ TIME) this should be 0. If this is non zero it means there was queueing in z/OS before it got to the disks. If there was only one I/O request to the volume, then there would IOSQ would be zero. If there are multiple I/O requests then you can get IOSQ queuing time.
    • Any IOSQ could be reduced by moving data sets to other volumes, or adding more paths(sub-channels) between the mainframe and the disks. Each disk requires at least one subchannel. You can allocate more in a pool – which are used when needed, but this is a z/OS system programmer/Storage manager job.
    • As a performance person you can control which disks you use, and can spread the load.
    • Avg CMR (ComMand Response) is the time to get from the processor down to the Storage Controller, and the controller to respond with “I’ve got the request” This should be small. This value allows you to see if delays are due to getting to the Storage controller, or within the controller.

If you do this for all disks you get an overall view of the data. Now you can select the DASD volumes you are using and check those.

If you find you have a long response time, then it is hard to find out the root cause. There are many links in the end to end chain. See here for more information.

One Minute MVS performance – CPU at the LPAR level

Question: In your car how do you tell if your car has a problem? Answer: You look at the dashboard and see if there is a red light showing. You may not know how to fix it – but you know that you need to get help to fix it.

The aim of this series of blog posts is to show you what to look for in z/OS performance and if you have a problem.

I will cover

For some of these you need data from z/OS. This post describes how to get the SMF data, and format it using RMF.

CPU

There are two basic things you need to check

  1. Has my LPAR got all the CPU it wanted – has the hyper-visor restricted the CPU?
  2. How busy are my CPUs?

Has my LPAR got all the CPU it wanted

An LPAR can be configured to have dedicated engines, or share a pool of engines. Dedicated engines means that the engine is always there when it is needed. If the LPAR is using a shared engine, it may not always be available when needed.

An example to explain the concept

You have a class from 10am to 11 am.  You go in, and sit down.  The teacher starts the class.  the teacher’s phone rings and goes out of the classroom. You play with your phone until the teacher comes back after 40 minutes. (The teacher went to teach in a different class room.)
How long were you in class for and how much work did you do?

  • You were in class for 1 hour.
  • You did 20 minutes work.

This concept is the same as any LPAR with shared engines.

  • The 1 hour class is a time slice as seen by z/OS.
  • The “processor” (teacher) was used in the time slice for only 20 minutes
  • For 40 minutes the “processor” was doing work elsewhere.

How do you get the report to show these figures.?

You need the RMF CPU report. It has “C P U A C T I V I T Y “ at the top of the page.

Look at the section

---CPU---    ---------------- TIME % ----------------   
NUM  TYPE    ONLINE    LPAR BUSY    MVS BUSY   PARKED   
 0    CP     100.00    46.68        46.32        0.00   
 1    CP     100.00    38.98        38.78        0.00   
 2    CP     100.00    34.91        34.62        0.00   
TOTAL/AVERAGE          40.19       39.90               
 3    IIP    100.00    94.43        94.70        0.00   
 4    IIP    100.00    93.50        93.74        0.00   
TOTAL/AVERAGE          93.96       94.22               

LPAR BUSY is how much teacher time you got

MVS Busy is how much time you were in the classroom for.

  • If MVS BUSY TIME = LPAR BUSY TIME, perfect, what you needed you got.
  • If MVS BUSY TIME > LPAR BUSY TIME, MVS had to wait for an engines, the system may need more CPU, a small difference(5%) is OK.
  • If MVS BUSY TIME >> LPAR BUSY TIME, For much of the time, there was no engine when MVS needed This will have a major impact on your work. If your end user work is not meeting targets, you need more CPUs, or give your LPAR a higher dispatching priority.

These values should be similar: MVS BUSY TIME 39.60 is close to LPAR BUSY 40.19, and for the ZIIP, 93.96 is close to 94.22.

When these figures are significantly different, stop, and fix the problem. This can make all other performance data look bad. For example, disk response time, and timing in application trace entries.

How busy are my CPUs?

The TOTAL/Average will be close to 100 % on a busy system. 95% busy is OK, Make a note that the system may be short of CPU.

These are average values. The individual values could be spiky. For example at 100% busy for 4 minutes, 80% busy for 1 minute, or an average of 96% busy over 5 minutes. Consider using an online monitoring to see if you have big peaks and trough.

More advanced topic for information.

The following section gives you information on how much work was waiting. It is hard to say what is good or bad, as it could look bad, but all the performance goals are being met.

How much work was waiting?

-----------------------DISTRIBUTION OF IN-READY WORK UNIT QUEUE--------------
 NUMBER OF              0    10   20   30   40   50   60   70   80   90   100
 WORK UNITS     (%)     |....|....|....|....|....|....|....|....|....|....|  
                                                                             
<=  N          26.3     >>>>>>>>>>>>>>                                       
 =  N +   1   12.9     >>>>>>>                                              
 =  N +   2    10.1     >>>>>>                                               
 =  N +   3    10.1     >>>>>>                                               
<=  N +   5    12.5     >>>>>>>                                              
<=  N +  10    11.0     >>>>>>                                               
<=  N +  15     6.0     >>>>                                                 
<=  N +  20     5.2     >>>                                                  
<=  N +  30     1.6     >                                                    
<=  N +  40     0.6     >                                                    
<=  N +  60     1.1     >                                                    
<=  N +  80     1.1     >                                                    
<=  N + 100     0.8     >                                                    
<=  N + 120     0.1     >                                                    
<=  N + 150     0.0                                                          
>   N + 150     0.0                                                          

N is the number of CPUs. I have 5 on my system.

The data is sampled. If system was sampled 10 times a second, every 0.1 of a second RMF counts the number of tasks in the “ready to dispatch queue”, and increments the value in the appropriate box; if there were 5 tasks executing and one task waiting, increment the N+1 element;

  • 26.3 % of the time, there were no tasks waiting for CPU.
  • 12.9 % of the time, there was 1 task waiting for CPU. See the bold data in the data above. (N+1 12.9 >>)
  • 10.1 % of the time, there were 2 tasks waiting for CPU
  • 5.2 % of the time there were between 16 and 20 tasks waiting for CPU
  • 0.1 % of the time there were between 101 and 120 task waiting for CPU

Remember this could be waiting for CP, or IIP.

If there are hundreds to tasks waiting for CPU you should make a note. It may not be a problem.

If there are under 50 tasks waiting for CPU, this should be OK.

On a busy system there will always be work waiting to run. Compare the pictures from a busy time and a not so busy time.

Is this important?

I once did some measurements with MQ on a machine with 16 processors, on average the engines were about 5% busy. A performance person from IBM said that my workload showed a shortage of CPU! 5 % busy on 16 processors – was I really short of CPU?

My application received some data, and posted 30 threads to come and process the data. The first 15 threads could be dispatched because there were 15 unused CPUs. 15 threads had to wait.

This showed up in the above report at line N+15 of the tasks were waiting 20% of the time.

Out of the 30 tasks that were dispatched, one processed the work,the other 29 went back to sleep.

We changed the program to post no more threads the number of CPUs (16) in the LPAR, and had a significant saving in CPU.

One minute MVS performance – getting batch RMF reports

There is an introduction to getting RMF reports docucmented here.

You can display information about your SMF environment, using

D SMF,O

This tells you if you are using SMF datasets, or log streams ( in the coupling facility) for the RMF data.

Copy the data from SMF dataset

// SET SMFPDS=SYS1.S0W1.MAN3 
// SET SMFSDS=SYS1.S0W1.MAN4 
//* 
//SMFDUMP  EXEC PGM=IFASMFDP 
//DUMPINA  DD   DSN=&SMFPDS,DISP=SHR,AMP=('BUFSP=65536') 
//DUMPINB  DD   DSN=&SMFSDS,DISP=SHR,AMP=('BUFSP=65536') 
//* MPOUT  DD   DISP=(NEW,PASS),DSN=&RMF,SPACE=(CYL,(1,1)) 
//DUMPOUT  DD   DISP=SHR,DSN=IBMUSER.RMF SPACE=(CYL,(1,1)) 
//SYSPRINT DD   SYSOUT=* 
//SYSIN  DD * 
  INDD(DUMPINA,OPTIONS(DUMP)) 
  INDD(DUMPINB,OPTIONS(DUMP)) 
  OUTDD(DUMPOUT,TYPE(70:79)) 
  DATE(2020316,2022284) 
  START(0930) 
  END(1700) 
/* 

This job copies the records from the “MAN” data sets, and writes them to the DUMPOUT.

The RMF records with types 70 to 79 are copied, within the specified dates and start and end times.

Copy the data from a log stream.

SMF can write data to log streams, for example MQ records go to the MQ stream, and the RMF records go to the RMF Stream.

//SMFDUMP EXEC PGM=IFASMFDL
//DUMPOUT DD DSN=&TEMP,SPACE=(CYL,(10,10),RLSE),DISP=(NEW,PASS)
//SYSPRINT DD SYSOUT=*
//SYSIN DD *
DATE(2018004,2018012)
START(0900)
END(1900)
LSNAME(IFASMF.RMF,OPTIONS(ALL))
OUTDD(DUMPOUT,TYPE(70:79,116))
/*

This step writes the data to a temporary data set.

Sort the data

If you are processing the data from more than one LPAR you will need to sort the data. See here.

Format the RMF data

The RMF control statements are described here

//POST EXEC PGM=ERBRMFPP 
//* use the following if using a temporary data set in same job.
//* MPFINPUT DD DISP=SHR,DSN=*.SMFDUMP.DUMPOUT 
//MFPINPUT DD DISP=SHR,DSN=IBMUSER.RMF 
//PPXSRPTS DD   SYSOUT=*,DCB=(LRECL=200,BLKSIZE=4000) 
//SYSPRINT DD   SYSOUT=* 
//SYSOUT   DD   SYSOUT=* 
//SYSIN DD * 
  SUMMARY(INT,TOT) 
  NODELTA 
  SYSOUT(H) 
  REPORTS(ALL,NOENQ,DEVICE(NOUNITR,NOCOMM)) 
SYSRPTS(WLMGL(RCPER)) 
SYSRPTS(WLMGL(SCPER)) 
/* 
/* SYSRPTS(WLMGL(RCPER)) 
/* SYSRPTS(WLMGL(SCPER,RCLASS,RCPER,SCLASS)) 

This takes the records from MFPINPUT which could be a permanent data set, or a temporary data set passed from a previous job step.

You can have the output go to the spool (by default) or to preallocated data sets. See here

One Minute MVS performance reports of interest are

REPORTS(CPU,DEV(DASD))

For WLM the reports

SYSRPTS(WLMGL(RCPER,SCPER))