The best I/O is no I/O

In the course of an email exchange there was discussion about the performance of z/OS where the DASD was involved in an active-active environment – so every write I/O is mirrored over a network. There was a discussion about avoiding disk I/O for work files. VIO refers to data set allocations that exist in paging storage only. z/OS does not use a real device unless z/OS must page out the data set. If course you need enough real storage so you do not page!

In ISMF you can define a storage group, type of VIO which uses Virtual I/O.

I have a Storage Group of SGVIO which says use VIO if the data set size is less than 2000000 KB. If more than this is needed it will use DASD.

If you are in a mirrored environment, and you have DASD volumes which are just used for temporary files, or paging then these volumes do not need to be mirrored. (But you may want to mirror them in case some one puts a non temporary data set on the volume).

One Minute MVS – tuning stack and heap pools

These days many applications use a stack and heap to manage storage used by an application. For C and Cobol programs on z/OS these use the C run time facilities. As Java uses the C run time facilities, it also uses the stack and heap.

If the stack and heap are not configured appropriately it can lead to an increase in CPU. With the introduction of 64 bit storage, tuning the heap pools and stack is no longer critical. You used to have to carefully manage the stack and heap pool sizes so you didn’t run out of storage.

The 5 second information on what to check, is the number of segments freed for the stack and heap should be zero. If the value is large then a lot of CPU is being used to manage the storage.

The topics are

Stack
Heap

Kinder garden background to stack.

When a C (main) program starts, it needs storage for the variables uses in the program. For example

int i;
for (ii=0;ii<3:ii++)
{}
char * p = malloc(1024);

The variables ii and p are variables within the function, and will be on the functions stack. p is a pointer.

The block of storage from the malloc(1024) will be obtained from the heap, and its address stored in p.

When the main program calls a function the function needs storage for the variables it uses. This can be done in several ways

Each function uses a z/OS GETMAIN request on entry, to allocate storage, and a z/OS FREEMAIN request on exit. These storage requests are expensive.
The main program has a block of storage which functions can use. For the main program uses bytes 0 to 1500 of this block, and the first function needs 500 bytes, so uses bytes 1501 to 2000. If this function calls another function, the lower level function uses storage from 2001 on wards. This is what usually happens, it is very efficient, and is known as a “stack”.

Intermediate level for stack

It starts to get interesting when initial block of storage allocated in the main program is not big enough.

There are several approaches to take when this occurs

Each function does a storage GETMAIN on entry, and FREEMAIN on exit. This is expensive.
Allocate another big block of storage, so successive functions now use this block, just like in the kinder garden case. When functions return to the one that caused a new block to be allocated,
1. this new block is freed. This is not as expensive as the previous case.
2. this block is retained, and stored for future requests. This is the cheapest case. However a large block has been allocated, and may never be used again.

How big a block should it allocate?

When using a stack, the size of the block to allocate is the larger of the user specified size, and the size required for the function. If the specified secondary size was 16KB, and a function needs 20KB of storage, then it will allocate at least 20KB.

How do I get the statistics?

For your C programs you can specify options in the #PRAGMA statement or, the easier way, is to specify it through JCL. You specify C run time options through //CEEOPTS … For example

//CEEOPTS DD *
STACK(2K,12K,ANYWHERE,FREE,2K,2K)
RPTSTG(ON)

Where

STACK(…) is the size of the stack
RPTSTG(ON) says collect and display statistics.

There is a small overhead in collecting the data.

The output is like:

STACK statistics:                                                
  Initial size:                                2048     
  Increment size:                             12288     
  Maximum used by all concurrent threads:  16218808     
  Largest used by any thread:              16218808     
  Number of segments allocated:                2004     
  Number of segments freed:                    2002

Interpreting the stack statistics

From the above data

This shows the initial stack size was 2KB and an increment of 12KB.
The stack was extended 2004 times.
Because the statement had STACK(2K,12K,ANYWHERE,FREE,2K,2K), when the secondary extension became free it was FREEMAINed back to z/OS.

When KEEP was used instead of FREE, the storage was not returned back to z/OS.

The statistics looked like

STACK statistics:                                                
  Initial size:                               2048     
  Increment size:                            12288     
  Maximum used by all concurrent thread:  16218808     
  Largest used by any thread:             16218808     
  Number of segments allocated:               1003     
  Number of segments freed:                      0

What to check for and what to set

For most systems, the key setting is KEEP, so that freed blocks are not released. You can see this a) from the definition b) Number of segments freed is 0.

If a request to allocate a new segment fails, then the C run time can try releasing segments that are not in use. If this happens the “”segments freed” will be incremented.

Check that the “segments freed” is zero, and if not, investigate why not.

When a program is running for a long time, a small number of “segments allocated” is not a problem.

Make the initial size larger, closer to the “Largest used of any thread” may improve the storage utilisation. With smaller segments there is likely to be unused space, which was too small for a functions request, causing the next segment to be used. So a better definition would be

STACK(16M,12K,ANYWHERE,KEEP,2K,2K)

Which gave

STACK statistics:                                                          
  Initial size:                                     16777216               
  Increment size:                                      12288               
  Maximum used by all concurrent threads:           16193752               
  Largest used by any thread:                       16193752               
  Number of segments allocated:                            1               
  Number of segments freed:                                0

Which shows that just one segment was allocated.

Kinder garden background to heap

When there is a malloc() request in C, or a new … in Java, the storage may exist outside of the function. The storage is obtained from the heap.

The heap has blocks of storage which can be reused. The blocks may all be of the same size, or or different sizes. It uses CPU time to scan free blocks looking for the best one to reuse. With more blocks it can use increasing amounts of CPU.

There are heap pools which avoids the costs of searching for the “right” block. It uses a pools of blocks. For example:

there is a heap pool with 1KB fixed size blocks
there is another heap pool with 16KB blocks
there is another heap pool with 256 KB blocks.

If there is a malloc request for 600 bytes, a block will be taken from the 1KB heap pool.

If there is a malloc request for 32KB, a block would be used from the 256KB pool.

If there is a malloc request for 512KB, it will issue a GETMAIN request.

Intermediate level for heap

If there is a request for a block of heap storage, and there is no free storage, a large segment of storage can be obtained, and divided up into blocks for the stack. If the heap has 1KB blocks, and a request for another block fails, it may issue a GETMAIN request for 100 * 1KB and then add 100 blocks of 1KB to the heap. As storage is freed, the blocks are added to the free list in the heap pool.

There is the same logic as for the stack, about returning storage.

If KEEP is specified, then any storage that is released, stays in the thread pool. This is the cheapest solution.
If FREE is specified, then when all the blocks in an additional segment have been freed, then free the segment back to the z/OS. This is more expensive than KEEP, as you may get frequent GETMAIN and FREEMAIN requests.

How many heap pools do I need and of what size blocks?

There is usually a range of block sizes used in a heap. The C run time supports up to 12 cell sizes. Using a Liberty Web server, there was a range of storage requests, from under 8 bytes to 64KB.

With most requests there will frequently be space wasted. If you want a block which is 16 bytes long, but the pool with the smallest block size is 1KB – most of the storage is wasted.
The C run time gives you suggestions on the configuration of the heap pools, the initial size of the pool and the size of the blocks in the pool.

Defining a heap pool

How to define a heap pool is defined here.

You specify the size of overall size of storage in the heap using the HEAP statement. For example for a 16MB total heap size.

HEAP(16M,32768,ANYWHERE,FREE,8192,4096)

You then specify the pool sizes

HEAPPOOL(ON,32,1,64,2,128,4,256,1,1024,7,4096,1,0)

The figures in bold are the size of the blocks in the pool.

32,1 says maximum size of blocks in the pool is 32 bytes, allocate 1% of the heap size to this pool
64,2 says maximum size of blocks in the pool is 64 bytes, allocate 2% of the heap size to this pool
128,4 says maximum size of blocks in the pool is 128 bytes, allocate 4% of the heap size to this pool
256,1 says maximum size of blocks in the pool is 256 bytes, allocate 1% of the heap size to this pool
1024,7 says maximum size of blocks in the pool is 1024 bytes, allocate 7% of the heap size to this pool
4096,1 says maximum size of blocks in the pool is 4096 bytes, allocate 1% of the heap size to this pool
0 says end of definition.

Note, the percentages do not have to add up to 100%.

For example, with the CEEOPTS

HEAP(16M,32768,ANYWHERE,FREE,8192,4096)
HEAPPOOLS(ON,32,50,64,1,128,1,256,1,1024,7,4096,1,0)

After running my application, the data in //SYSOUT is


HEAPPOOLS Summary:                                                         
  Specified Element   Extent   Cells Per  Extents    Maximum      Cells In 
  Cell Size Size      Percent  Extent     Allocated  Cells Used   Use      
  ------------------------------------------------------------------------ 
       32        40    50      209715           0           0           0 
       64        72      1        2330           1        1002           2 
      128       136      1        1233           0           0           0 
      256       264      1         635           0           0           0 
     1024      1032      7        1137           1           2           0 
     4096      4104      1          40           1           1           1 
  ------------------------------------------------------------------------

For the cell size of 32, 50% of the pool was allocated to it,

Each block has a header, and the total size of the 32 byte block is 40 bytes. The number of 40 bytes units in 50% of 16 MB is 8MB/40 = 209715, so these figures match up.

(Note with 64 bit heap pools, you just specify the absolute number you want – not a percentage of anything).

Within the program there was a loop doing malloc(50). This uses cell pool with size 64 bytes. 1002 blocks(cells) were used.

The output also has

Suggested Percentages for current Cell Sizes:
HEAPP(ON,32,1,64,1,128,1,256,1,1024,1,4096,1,0)

Suggested Cell Sizes:
HEAPP(ON,56,,280,,848,,2080,,4096,,0)

I found this confusing and not well documented. It is another of the topics that once you understand it it make sense.

Suggested Percentages for current Cell Sizes

The first “suggested… ” values are the suggestions for the size of the pools if you do not change the size of the cells.

I had specified 50% for the 32 byte cell pool. As this cell pool was not used ( 0 allocated cells) then it suggests making this as 1%, so the suggestion is HEAPP(ON,32,1…

You could cut and paste this into you //CEEOPTS statement.

Suggested Cell Sizes

The C run times has a profile of all the sizes of blocks used, and has suggested some better cell sizes. For example as I had no requests for storage less than 32 bytes, making it bigger makes sense. For optimum storage usage, it suggests of using sizes of 56, 280,848,2080,4096 bytes.

Note it does not give suggested number of blocks. I think this is poor design. Because it knows the profile it could have an attempt at specifying the numbers.

If you want to try this definition, you need to add some values such as

HEAPP(ON,56,1,280,1,848,1,2080,1,4096,1,0)

Then rerun your program, and see what percentage figures it recommends, update the figures, and test again. Not the easiest way of working.

What to check for and what to set

There can be two heap pools. One for 64 bit storage ( HEAPPOOL64) the other for 31 bit storage (HEAPPOOL).

The default configuration should be “KEEP”, so any storage obtained is kept and not freed. This saves the cost of expensive GETMAINS and FREEMAINs.

If the address space is constrained for storage, the C run time can go round each heap pool and free up segments which are in use.

The value “Number of segments freed” for each heap should be 0. If not, find out why (has the pool been specified incorrectly, or was there a storage shortage).

You can specify how big each pool is

for HEAPPOOL the HEAP size, and the percentage to be allocated to each pool – so two numbers to change
for HEAPPOOL64 you specify the size of each pool directly.

The sizes you specify are not that sensitive, as the pools will grow to meet the demand. Allocating one large block is cheaper that allocating 50 smaller blocks – but for a server, this different can be ignored.

With a 4MB heap specified

HEAP(4M,32768,ANYWHERE,FREE,8192,4096)
HEAPP(ON,56,1,280,1,848,1,2080,1,4096,1,0)

the heap report was

 HEAPPOOLS Summary: 
   Specified Element   Extent   Cells Per  Extents    Maximum      Cells In 
   Cell Size Size      Percent  Extent     Allocated  Cells Used   Use 
   ------------------------------------------------------------------------ 
        56        64      1         655           2        1002           2 
       280       288      1         145           1           1           0 
       848       856      1          48           1           1           0 
      2080      2088      1          20           1           1           1 
      4096      4104      1          10           0           0           0 
   ------------------------------------------------------------------------ 
   Suggested Percentages for current Cell Sizes: 
     HEAPP(ON,56,2,280,1,848,1,2080,1,4096,1,0)

With a small(16KB) heap specified

HEAP(16K,32768,ANYWHERE,FREE,8192,4096)
HEAPP(ON,56,1,280,1,848,1,2080,1,4096,1,0)

The output was

HEAPPOOLS Summary:                                                            
  Specified Element   Extent   Cells Per  Extents    Maximum      Cells In    
  Cell Size Size      Percent  Extent     Allocated  Cells Used   Use         
  ------------------------------------------------------------------------    
       56        64      1           4         251        1002           2    
      280       288      1           4           1           1           0    
      848       856      1           4           1           1           0    
     2080      2088      1           4           1           1           1    
     4096      4104      1           4           0           0           0    
  ------------------------------------------------------------------------    
  Suggested Percentages for current Cell Sizes:                               
    HEAPP(ON,56,90,280,2,848,6,2080,13,4096,1,0)

and we can see it had to allocate 251 extents for all the request.

Once the system has “warmed up” there should not be a major difference in performance. I would allocate the heap to be big enough to start with, and avoid extensions.

With the C run time there are heaps as well as heap pools. My C run time report gave

64bit User HEAP statistics:
31bit User HEAP statistics:
24bit User HEAP statistics:
64bit Library HEAP statistics:
31bit Library HEAP statistics:
24bit Library HEAP statistics:
64bit I/O HEAP statistics:
31bit I/O HEAP statistics:
24bit I/O HEAP statistics:

You should check all of these and make the initial size the same as the suggested recommended size. This way the storage will be allocated at startup, and you avoid problems of a request to expand the heap failing due to lack of storage during a buys period.

Advanced level for heap

While the above discussion is suitable for many workloads, especially if they are single threaded. It can get more complex when there are multiple thread using the heappools.

If you have a “hot” or highly active pool you can get contention when obtaining and releasing blocks from the heap pool. You can define multiple pools for an element size. For example

HEAPP(ON,(56,4),1,280,1,848,1,2080,1,4096,1,0)

The (56,4) says make 4 pools with block size of 56 bytes.

The output has

HEAPPOOLS Summary:                                                          
  Specified Element   Extent   Cells Per  Extents    Maximum      Cells In  
  Cell Size Size      Percent  Extent     Allocated  Cells Used   Use       
  ------------------------------------------------------------------------  
       56       64     1           4         251        1002           2  
       56       64     1           4           0           0           0  
       56       64     1           4           0           0           0  
       56       64     1           4           0           0           0  
      280       288      1           4           1           1           0  
      848       856      1           4           1           1           0  
     2080      2088      1           4           1           1           1  
     4096      4104      1           4           0           0           0  
  ------------------------------------------------------------------------

We can see there are now 4 pools with cell size of 56 bytes. The documentation says Multiple pools are allocated with the same cell size and a portion of the threads are assigned to allocate cells out of each of the pools.

If you have 16 threads you might expect 4 threads to be allocated to each pool.

How do you know if you have a “hot” pool.

You cannot tell from the summary, as you just get the maximum cells used.

In the report is the count of requests for different storage ranges.

Pool  2     size:   160 Get Requests:           777707 
  Successful Get Heap requests:    81-   88                 77934 
  Successful Get Heap requests:    89-   96                 59912 
  Successful Get Heap requests:    97-  104                 47233 
  Successful Get Heap requests:   105-  112                 60263 
  Successful Get Heap requests:   113-  120                 80064 
  Successful Get Heap requests:   121-  128                302815 
  Successful Get Heap requests:   129-  136                 59762 
  Successful Get Heap requests:   137-  144                 43744 
  Successful Get Heap requests:   145-  152                 17307 
  Successful Get Heap requests:   153-  160                 28673
Pool  3     size:   288 Get Requests:            65642

I used ISPF edit, to process the report.

By extracting the records with size: you get the count of requests per pool.

Pool  1     size:    80 Get Requests:           462187 
Pool  2     size:   160 Get Requests:           777707 
Pool  3     size:   288 Get Requests:            65642 
Pool  4     size:   792 Get Requests:            18293 
Pool  5     size:  1520 Get Requests:            23861 
Pool  6     size:  2728 Get Requests:            11677 
Pool  7     size:  4400 Get Requests:            48943 
Pool  8     size:  8360 Get Requests:            18646 
Pool  9     size: 14376 Get Requests:             1916 
Pool 10     size: 24120 Get Requests:             1961 
Pool 11     size: 37880 Get Requests:             4833 
Pool 12     size: 65536 Get Requests:              716 
Requests greater than the largest cell size:               1652

It might be worth splitting Pool 2 and seeing if makes a difference in CPU usage at peak time. If it has a benefit, try Pool 1.

You can also sort the “Successful Heap requests” count, and see what range has the most requests. I don’t know what you would use this information for, unless you were investigating why so much storage was being used.

Ph D level for heap

For high use application on boxes with many CPUs you can get contention for storage at the hardware cache level.

Before a CPU can use storage, it has to get the 256 byte cache line into the processor cache. If two CPU’s are fighting for storage in the same 256 bytes the throughput goes down.

By specifying

HEAPP(ALIGN….

It ensures each block is isolated in its own cache line. This can lead to an increase in virtual storage, but you should get improved throughput at the high end. It may make very little difference when there is little load, or on an LPAR with few engines.

I’m sorry I haven’t a clue…

As well as being a very popular British comedy, it is how I sometimes feel about what is happening inside the Liberty Web servers, and products like z/OSMF, z/OS Connect and MQWEB. It feels like a spacecraft in cartoons – there are usually only two controls – start and stop.

One reason for this is that the developers often do not have to use the product in production, and have not sat there, head in hand saying “what is going on ?”.

In this post I’ll cover

What data values to expose.
What attributes of the data, for example maxium, average (which average?).
How to expose data – Java has an excellent easy way of doing this (JMX).
How do I build up the list of what is needed?
What can I use to look at the data.

What data values to expose

As a concept, if you give someone a lever to pull – you need to give them a way of showing the effect of pulling the level.

If you give someone a tuning parameter, they need to know the impact of using the tuning parameter. For example

you implement a pool of blocks of storage.
you can configure the number of maximum number of blocks
if a thread needs some storage, and there is a free block in the pool, then assign the block to the thread. When the thread has finished with it, the thread goes back into the pool.
if all the blocks in the pool are in-use, allocate a block. When the thread has finished with the block – free it.
if you specify a very large number of blocks it could cause a storage shortage

The big questions with this example is “how big do you make the pool”?

To be able to specify the correct pool size you need to know information like

What was the maximum number of blocks used – in total
How many times were additional blocks allocated (and freed)
What was the total number of blocks requested.

You might decide that the pool is big enough if less than1% of requests had to allocate a block.

If you find that the maximum value used was 1% of the size of the pool, you can make the pool much smaller.

If you find that 99% of the requests were allocated/freed, this indicates the pool is much to small and you need to increase the size.

For other areas you could display

The number of authentication requests that were userid+ password, or were from a certificate.
The number of authentication requests which failed.
The list of userid names in the userid cache.
How many times each application was invoked.
The number of times a thread had to wait for a resource.
The elapsed time waiting for a resource, and what the resource was.

What attributes to expose

You look at the data to ask

Do I have a problem now?
Will I have a problem in the future? You need to collect information over time and look at trends.
When we had a problem yesterday, did this component contribute to it? You need to have historical data.

It is not obvious what data attributes you should display.

The “value now” is is easy to understand.
The “average value” is harder. Is this from the start of the application (6 months ago), or a weighted average (99 * previous average + current value)/100. With this weighted average, a change since the previous value indicates the trend.
The maximum value is hard – from when? There may have been a peak at startup, and small peaks since then will not show up. Having a “reset command” can be useful, or have it reset on a timer – such as display and reset every 10 minutes.
If you “reset” the values and display the value before any activity, what do you display? “0”s for all of the values, or the values when the reset command was issued.

Resetting values can make it easier to understand the data. Comparing two 8 digit numbers is much harder than comparing two 2 digit numbers.

How to expose data

Java has a Java Management eXtension (JMX) for reporting management information. It looks very well designed, is easy to use, and very compact! There is an extensive document from Oracle here.

I found Basic Introduction to JMX by Baeldung , was an excellent article with code samples on GitHub. I got these working in Eclipse within an hour!

The principal behind JMX is …

For each field you want to expose you have a get… method.

You define an interface with name class| |”MBean” which defines all of the methods for displaying the data.

public interface myClassMBean {
public String getOwner();
public int getMaxSize();
}

You define the class and the methods to expose the data.

public class myClass implements myClassMBean{
…
// and the methods to expose the data
public String getOwner() {
return fileOwner;
}
public int getMaxSize() {
return fileSize;
}
}

And you tell JMX to implement it

myClass myClassInstance = new myClass(); // create the instance of myClass
MBeanServer server = ManagementFactory.getPlatformMBeanServer();
ObjectName objectName =….
server.registerMBean(myClassInstance, objectName);

Where myClassInstance is a class instance. The JMX code extracts the name of the class from the object, and can the identify all the methods defined in the class||”MBean” interface. Tools like jconsole can then query these methods, and invoke them.

ObjectName is an object like

ObjectName objectName = new ObjectName(“ColinJava:type=files,name=onefile”);

Where “ColinJava” is a high level element, “type” is a category, and “name” is the description of the instance .

That’s it.

When you use jconsole ( or other tools) to display it you get

You could have

MBeanServer server = ManagementFactory.getPlatformMBeanServer();
ObjectName bigPoolName = new ObjectName(“ColinJava:type=threadpool,name=BigPool”);
server.registerMBean(bigpoolInstance, bigPoolName);
ObjectName medPoolName = new ObjectName(“ColinJava:type=threadpool,name=MedPool”);
server.registerMBean(medpoolInstance, medPoolname);
ObjectName smPoolName = new ObjectName(“ColinJava:type=threadpool,name=SmallPool”);
server.registerMBean(smallpoolInstance,smPoolName);

This would display the stats data for three pools

ColinJava
- threadpool
  - Bigpool..
  - MedPool….
  - SmallPool…

And so build up a tree like

ColinJava
- threadpool
  - Bigpool..
  - MedPool….
  - SmallPool…
- Userids
  - Userid+password
  - Certificate
- Applications
  - Application 1
  - Application 2
- Errors
  - Applications
  - Authentication

You can also have set…() methods to set values, but you need to be more careful; checking authorities, and possibly synchronising updates with other concurrent activity.

You can also have methods like resetStats() which show up within jconsole as Operations.

How do I build up the list of what is needed?

It is easy to expose data values which have little value. I remember MQ had a field in the statistics “Number of times the hash table changed”. I never found a use for this. Other times I thought “If only we had a count of ……”

You can collect information from problems reported to you. “It was hard to diagnose because… if we had the count of … the end user could have fixed it without calling us”.

Your performance team is another good source of candidates fields. Part of the performance team’s job is to identify statistics to make it easier to tune the system, and reduce the resources used. It is not just about identifying hot spots.

Before you implement the collection of data, you could present to your team on how the data will be used, and produce some typical graphs. You should get some good feedback, even if it is “I dont understand it”.

What can I use to display the data

There are several ways of displaying the data.

jconsole – which comes as part of Java can display the data in a window
python – you can issue a query can capture the data. I have this set up to capture the data every 10 seconds
other tools using the standard interfaces.

Have a good REST and save a fortune in CPU with Python

Following on from Have a good REST and save a fortune in CPU. The post gives some guidance on reducing the costs of using Liberty based servers from a Python program.

Certificate set up

I used certificate authentication from Linux to z/OS. I used

A certificate defined on Linux using Openssl.
I sent the Linux CA certificate to z/OS and imported it to the TRUST keyring.
I created a certificate on z/OS and installed it into the KEY keyring.
I exported the z/OS CA, sent it to Linux, and created a file called tempca.pem.

Python set up

Define the names of the user certificate private key, and certificate

cf=”colinpaicesECp256r1.pem”
kf=”colinpaicesECp256r1.key.pem”
cpcert=(cf,kf)

Define the name of the certificate for validating the server’s certificate

v=’tempca.pem’

Set up a cookie jar to hold the cookies sent down from the server

jar = requests.cookies.RequestsCookieJar()

Define the URL and request

geturl =”https://10.1.1.2:9443/ibmmq/rest/v1/admin/qmgr/ “

Define the headers

import base64
useridPassword = base64.b64encode(b’colin:passworm’)
my_header = {
‘Content-Type’: ‘application/json’,
‘Authorization’: useridPassword,
‘ibm-mq-rest-csrf-token’ : ‘ ‘
}

An example flow of two requests, using two connections

For example using python

s = requests
response1 = s.get(geturl,headers=my_header,verify=v,cookies=jar,cert=cpcert)
response2 = s.get(geturl,headers=my_header,verify=v,cookies=jar,cert=cpcert)

creates two session, each has a TLS handshake, issue a request, get a response and end.

An example of two requests using one session

For example using python

s = requests.Session()
response1 = s.get(geturl,headers=my_header,verify=v,cookies=jar,cert=cpcert1)
response2 = s.get(geturl,headers=my_header,verify=v,cookies=jar,cert=cpcert2)

The initial request has one expensive TLS handshake, the second request reuses the session.

Reusing this session means there was only one expensive Client Hello,Server Hello exchange for the whole conversation.

Even though the second request specified a different set of certificates, the certificates from when the session was established, using cpcert1 were used. (No surprise here as the certificates are only used when the session is established).

For the authentication, in both cases the first requests received a cookie with the LtpaToken2 cookie in it.

When this was passed up on successive requests, the userid information from the first request was used.

What is the difference?

I ran a workload of a single thread doing 200 requests. The ratios are important, not the absolute values.

	Shared session	One session per requests
TCP flows to server	1	11
CPU cost	1	5
Elapsed time	1	6

Have a good REST and save a fortune in CPU

The REST protocol is a common programming model with the internet. It is basically a one shot model, which scales and has high availability, but can have a very high CPU cost. There are things you can do to reduce the CPU cost. Also, the MQWeb server, has implemented some changes to reduce the cost. See here for the MQ documentation.

The post gives some guidance on reducing the costs, for Liberty based servers.

The traditional model and the REST model

The traditional application model may have a client and a flow to the server

Connect to the server and authenticate
Debit my account by £500 within syncpoint
Credit your account by £500 within syncpoint
Commit the transaction
Do the next transaction etc
At the end of the day, disconnect from the server.

The REST model would be

Connect to the server and authenticate and do (Debit my account by £500, credit your account by £500), disconnect

This model has the advantage that it scales. When you initiate a transaction it can go to any one of the available back-end servers. This spreads the load and improves availability.

With the traditional model, the clients connects any available server at the start of day stays connected all day. If a new server becomes available during the day, it may get no workload.

The downside of the REST model is the cost. Establishing a connection and authenticating can be very expensive. I could not find much useful documentation on how to reduce these costs.

There are two parts of getting a REST connection.

Establishing the connection
Authentication

Establishing the connection

You can have each REST request use a new session for every REST request each of which which involves a full TLS handshake. Two requests could go to different servers, or go to the same server.

You can issue multiple REST request over the same session, to the same backend server.

On my little z/OS, using z/OSMF it takes

about 1 second to create a new connection and issue a request and terminate
about 0.1 seconds to use the shared session, per REST request.

Establishing the TLS session is expensive, as there is a lot of computation to generate the keys.

For MQWEB, the results are very similar.

Authentication

Once the session has been established each REST request requires authentication.

If you are using userid and password, the values are checked with z/OS.

If you are using client certificate authentication the Subject DN is looked up in the security manager, and if there is a DN to userid mapping, the userid is returned.

Once you have a valid userid, the userid’s access can be obtained from the security manager.

All of these values can be cached in the Liberty web server. So the first time a certificate or userid is used, it will take a longer than successive times.

Information about authentication is then encrypted and passed back in the REST response as the LtpaToken2 cookie.

If a REST request passes the cookie back to the server, then the information in the cookie is used by the server, and fewer checks need to be done.

This cookie can expire, and when it does expire the userid and password, or the certificate DN is checked as before, and the cookie will be updated.

If you do not send the LtpaToken2 cookie, this will cause the passed authentication information to be revalidated. If you want to change userid, do not send the the cookie.

Is any of this documented?

There is not a lot of documentation. There is information Configuring the authentication cache in Liberty.

There is a parameter javax.net.ssl.sessionCacheSize. If this is not set the default is 20480.

One minute MVS – synchronous I/O.

This blog is one of a series of blog posts about performance topics on z/OS. This post is about synchronous I/O technology which reduces I/O time from milliseconds down to microseconds.

There is a 2018 redbook with some good content. This 2020 presentation is a good introduction. This gets a bit technical and looks at it from the Storage Controller view point.

History

Some protocols are very chatty. For example when you download a television program to your television, you see it as one download. Under the covers the software sees this television program as many parts, and there will be a conversation like “Here is part 1, the first 10 MBs”… “ok got that – please send the next”. Within this conversation is TCP/IP chatter, “here is packet 4013” … “ok got that” etc.

40 years ago the I/O to disks was a bit like the following conversation between the mainframe and the disk controller. Imaging this as a series of phone calls. The “Hello” is where they initiate a new call, and “Bye” is where they hang up

Mainframe: Hello storage controller, I want to read from disk with volume label A1USR1
Storage controller: Sorry, it is in use, please try later. Bye.
Mainframe : Hello? the telephone line is busy, I’ll retry.
Mainframe: Hello, is disk A1USR1 available now?
Storage controller: yes. I’ve reserved it for you
Mainframe: Please move the disk read heads so they are under cylinder 61
Storage controller: OK will do. Bye.
Storage controller: Hello. OK the heads are under cylinder 61
Mainframe: Read from head 4 (track 4) third record
Storage controller: OK will do. Bye.
Storage controller: Hello. OK, done that – here is the data
Mainframe: Thank you, I’ve finished with the volume A1USR1.
Storage controller: You’re welcome – that took 25 milliseconds. Bye.

Today’s conversation is much faster, as mostly data is in the storage controller’s cache. There are still a few phone calls.

Today’s synchronous I/O

With the new synchronous I/O technology (zHyperLink) the conversation is more like the following. (Using the new premium support phone number)

Mainframe: Hello Storage controller, I want to read from disk with volume label A1USR1 Cylinder 61 track 4.
Storage controller: Just a second – we have it in cache, here is the data. Bye.

Or perhaps the conversation goes like.

Mainframe: Hello Storage controller, I want to read from disk with volume label A9SYS1 Cylinder 995 track 2.
Storage controller: Just a second – I’m sorry we do not have it in cache. Please use the old way of doing it, using the old customer support number. Bye.

Technical background

With the old way of doing I/O there were multi asynchronous requests coming back from the Storage Controller. It needed CPU to start the I/O ( Start subchannel) then suspend the program until the I/O completed and then resume the program. The program could have been re-dispatched on a different CPU, so its data was not in the processor cache.

Sometimes the amount of CPU used was more than the duration of I/O request! With the coupling facility(CF), came technology to issue an I/O request synchronously. With this the “Issue CF request” was one mainframe instruction which started the I/O, waited for the response and then resumed. There was no z/OS interrupt, and there was no dispatcher involved. Generally this used less CPU than the asynchronous request, and was faster.

The duration of the synchronous request depends on the work being done in the CF. The more work it does, the longer the instruction takes. For example the CF is doing some work on some data for a different processor, and locks a resource. This will delay other users of the data, who get locked out. Also as the physical distance between the CF and the mainframe increased, the overall duration increased (due to the speed to light).

At some point it is more efficient to use the asynchronous request; send the request; suspend until it has completed; rather than the send the request and wait.

The CF code has logic to determine if the request should be synchronous or asynchronous. Even though the recent requests have all been asynchronous, it will periodically try a synchronous request to see if it is worth switching to synchronous mode.

Additional hardware

For the mainframe to support this synchronous I/O it needs additional hardware.

It continues to need the FIbreCONnection (FICON) today for those requests that do not support synchronous I/O.
It needs new zHyperLink cables connected from the mainframe to the Storage Controller. You need a direct connection, not via a switch.
The CF and mainframe need to be closer than 150 meters.
You can have duplexed CFs which both have to be within 150 meters.

One Minute MVS performance – TCP/IP

Question: In your car how do you tell if your car has a problem? Answer: You look at the dashboard and see if there is a red light showing. You may not know how to fix it – but you know that you need to get help to fix it.

The aim of this series of blog posts is to show you what to look for in z/OS performance and if you have a problem.

I will cover

What is a TCP/IP performance problem?

People complain about a TCP/IP performance problem when “it” seems slow. This could be caused by a variety of problems

Data between two ends is being discarded. This can occur on an unreliable, or overloaded component, whose default action is to throw away data, knowing it will be resent.
The time taken to get from one end to the other and back (“a ping”) is slow. This can be caused by slow or overloaded components.
There is a lot of data to send, for example a movie, or a web page with lots of javascript or graphics.
Or all of the above.

There is a quote “Never under estimate the bandwith with of a lorry full of tapes”. It might take 10 hours, but a truck 6 ft wide by 20 ft long could hold 300,000 1TB tapes and deliver 8 TBytes/second (with a round trip time of 20 hours). Which is more than the internet can provide!

You need to know

Are packets being thrown away? You see this from the number of packets which were resent.
What is the round trip time? (You could use ping – but you may not be able to)
Is data being sent efficiently – in big blocks?

TCP/IP concepts

With TCP/IP there is a connection between a sender and a receiver. The sender sends numbered packets of data to the receiver. The receiver sends an acknowledgement that a packet has been received.

The following is a representation of the flow

The sender sends packet 1
The sender sends packet 2
The sender sends packet 3
The receiver receives packet 1 and sends an acknowledgement for packet 1
The sender sends packet 4
The receiver receives packet 2 and sends an acknowledgement for packet 2
The sender waits until the acknowledgement of packet 1 has been received
The sender sends packet 5 and waits till the acknowledgement of packet2 has been received
etc

This way it is self limiting. It means the sender cannot send more than the receiver can handle.

If a packet goes missing, eventually the sender gets a time out, and resends it.

There are two parts to “performance”.

FTP like: How much data can be sent per second. This is of interest to FTP and MQ, where there is mainly a one way transmission of lots of data. The round trip time is not so critical if you can have a lot of data in transit.
Transactional: Send some data and wait for the remote end to respond, for example a web browser. The amount of data may be measured in KB, but the round trip time is important.

The term “window” is often used in TCP/IP.

The term “send window” on the sender side represents the total number of packets yet to be acknowledged by the receiver. With a bigger window, there is more data in the pipe line, and the throughput goes up. With a window of 1, one packet is sent and the sender waits for the acknowledgement before sending the next. With this, if there is a high latency, the overall throughput will be low.

More details

One of the factors that affects performance is the receive buffer size. If this was set to 4KB, it means that an application can read up to 4 KB of data at a time. This receive buffer size is sent to the sender, and basically says “send chunks up to this size – as that is all the receiver can take” – this sets the send-buffer-size.

The term Dynamic Right Sizing(DRS) allows the TCP receive buffer size to expand if the network conditions are favourable.

The term Outbound Right Sizing(ORS) allows the TCP send buffer size to expand if the network conditions are favourable.

Another term used is congestion window. If too much data is sent, or the network is unreliable, packets will get lost or thrown away. The congestion window is a measure of how much data can be in-flight. If packets get lost, the congestion window is made smaller. If packets are not lost, then it will try to increase the congestion window. This is a very rough indication of the quality of the network.

FTP like performance

There are several factors which can improve the throughput down a connection

Make packets bigger. In the early days of TCP/IP a typical packet was 256 bytes. These days a typical default packet size can be 64KB or more.
- One of the Smarts in the protocol is called dynamic right sizing, where TCP will send increasing larger packets until the receiver says “big enough”. The packet size can change with load.
How much data to send before waiting for the acknowledgement. For a reliable connection, where data is never lost, it is efficient to send a lot of data before waiting. This is called a large send window.
If the connection is unreliable, it may be more efficient to have only a small send window, before waiting for the acknowledgement.

Transactional work

Having big buffers may not improve throughput, for example with a web page, the data may all fit into 2KB. In this case having a buffer size of 16KB or 64 KB may make no difference to throughput or performance.
Typically if one packet contains all the data, then this will be acknowledge as soon as it arrives.
Some web pages with a lot of javascript or images, may require big buffers, and many packets.

How to see what is going on

You can use the well known “ping” command to send data to the remote end, and get the response. This gives a measure of the network time.

I found most of the data for looking at performance, is available from the netstat command. I found it useful to capture the output of the command in a file or data set.

What connections are connected to this server?

I use the netstat command in TSO , because my fingers are more used to it, and the command options are more memorable than the omvs command ( for example with omvs netstat, do I need the -a or -A option)

netstat conn (port 1414
netstat conn report hlq colin ( port 1414
netstat conn report dsn ‘colin.output’ ( port 1414

These all gave the same output. The report hlq colin creates a data set colin.netstat.conn. The data set name is from the hlq, ‘netstat’, and the subcommand. You can specify a data set name using the ‘dsn’ option.

For omvs you can use

netstat -c -p TCPIP -P 1414 > filename

That lists all of the connections for port 1414.

The command gave me

MVS TCP/IP NETSTAT CS V2R4       TCPIP Name: TCPIP           09:18:34    
User Id  Conn     Local Socket           Foreign Socket         State    
-------  ----     ------------           --------------         -----    
CSQ9CHIN 00000023 10.1.1.2..1414         10.1.0.2..60538        Establsh 
CSQ9CHIN 00000022 0.0.0.0..1414          0.0.0.0..0             Listen

There is one connection established from 10.1.0.2 port 60538 to the server with the port listening on 1414.

The commands below give a lot of information about the connection

netstat all report hlq colin (ipport 10.1.0.2+60538
netstat -A -p TCPIP -B 10.1.0.2+60538 > all.port1

Output from the netstat command

The fields are described at the bottom of this page.

Both commands gave me the same output.

There is a lot of data. I’ve broken it into sections with comments after the interesting fields.

  MVS TCP/IP NETSTAT CS V2R4       TCPIP Name: TCPIP           09:23:29 
  Client Name: CSQ9CHIN                 Client Id: 00000023 
  Local Socket: 10.1.1.2..1414          Foreign Socket: 10.1.0.2..60538 
  BytesIn:            0000002988        BytesOut:           0000002912 
  SegmentsIn:         0000000019        SegmentsOut:        0000000011

09:23:29 is the time when request was made. If you repeat the command you can get the interval between commands, and so calculate rates.
You get the client (job) name CSQ9CHIN.
The listener socket for the job (local socket) 10.1.1.2 with port 1414.
The foreign socket – the remote end of the connection. IP address 10.1.0.2 port 60538.
You can get the data rate If you repeat the command, calculate the deltas BytesIn and BytesOut, and divide by the time between measurement.

  StartDate:          06/16/2021        StartTime:          10:00:21 
  Last Touched:       10:20:37          State:              Establsh 
  RcvNxt:             2019327903        SndNxt:             0864946572 
  ClientRcvNxt:       2019327903        ClientSndNxt:       0864946572 
  InitRcvSeqNum:      2019324914        InitSndSeqNum:      0864943659 
  CongestionWindow:  0000018720        SlowStartThreshold: 0000065535

Look at the congestion window. Big is good. Small may indicate small amounts of data being sent or it may indicate network problems, either slow connections or packets are being dropped.

IncomingWindowNum:  2019458463        OutgoingWindowNum:  0865008524 
SndWl1:             2019327903        SndWl2:             0864946572 
SndWnd:             0000061952        MaxSndWnd:          0000064256

Check the send window. A small (1KB) send window can indicate poor configuration at the remote client, or only small amounts of data are being sent.

SndUna:             0864946572        rtt_seq:            0864946064 
MaximumSegmentSize: 0000001440        DSField:            00 
Round-trip information:
  Smooth trip time: 6.000              SmoothTripVariance: 12.000

Monitor the smooth route trip time (in milliseconds) this the local end to the remote end, and back. The variance gives a measure of the spread of response times. These are not strictly averages.

If you had a million requests taking 1 millisecond, and then had a long request taking 1000 milliseconds. The “Average” response time would change by a very small amount (to 1.09 milliseconds). The smoothed (or weighted average) may be something like – (99 * previous average + current value) /100. In this case the “average” goes up to 10.9 milliseconds, which is noticeable different.

ReXmt:              0000000000        ReXmtCount:         0000000000

The re transmits should be zero – or not changing. If this number increases it means the network has lost packets.

DupACKs:            0000000000        RcvWnd:             0000130560

The receive window is usually set to 2 * receive buffer.

SockOpt:            88                TcpTimer:           00

Check SockOpt. Check bit 0x08. If set this indicates “delayed acknowledgement disabled”. See Nagle algorithm. This value being set is good.

If this is not set, then sender can delay sending data for up to about 200 ms, and so combine data from different applications into the same packet for the same destination. This reduces network traffic as there are fewer packets, but it delays the data being sent.

TcpSig:             04                TcpSel:             40 
TcpDet:             E4                TcpPol:             00 
TcpPrf:            81                TcpPrf2:            20 
TcpPrf3:            00

For FTP type applications check the TCP Performance Flag TcpPrf. This says if Dynamic Right sizing (using bigger buffers) is enabled. The flag bits are x80 – enabled, x40 Active, x20 Active but disabled. X80 |X40 is good.

The TCP performance flag2 TcpPrf2. This is for outbound right sizing (ORS). A non zero value is good.

DelayAck:           Yes 
QOSPolicy:          No 
TTLSPolicy:         No 
RoutingPolicy:      No 
ReceiveBufferSize:  0000065536        SendBufferSize:     0000065536

These buffer sizes should be large with 64KB or larger, if so the system can dynamically increase them.

They can be configured at the TCP/IP level, or by the application. If they are 64KB or higher then TCP Dynamic Right Sizing can be used (adjust the buffers to match the load).

ReceiveDataQueued:  0000000000 
SendDataQueued:     0000000000

These should always be zero.

Received data queued means the application is slow to retrieve the data
Send data queued – the application has issued a send – but TCP/IP cannot process it.

SendStalled:        No 
Ancillary Input Queue: N/A

Send stalled should always be no.

What do you need to check?

SendStalled, ReceiveDataQueued,SendDataQueued should all be 0. They usually are 0. They would be non zero if there was a problem right now. If the problem gets better, these values would be 0.
Check ReXmt = The total number of times a packet has been retransmitted for this connection. This count is historical for the life of the connection.
- If this is zero then there have been no re transmits, and so no packets lost.
- If this is non zero, then it could be a historical problem. Wait and reissue the netstat command. If the ReXmt value has changed, this indicates packets are being lost.
Check the round trip time (and variance). Is the value what you expected? If there is traffic flowing on the connection, display the value multiple times, and see if there is significant variation.
Check ReceiveBufferSize and SendBufferSize. Values of 64KB or larger are good. Small is not good.
Check congestion window.

It is good to have some data for a normal day, and a problem day. For example if the packets are often lost, then this may not be the problem. If the SendBufferSize is only 8KB today and was 64KB last week – this would a good place to start looking. So capture and save NETSTAT reports for typical sessions.

What about connections into z/OS

Windows has a netstat command.

On Linux Netstat has been superseded with ss for example

ss –info dst 10.1.1.2
ss –info dst 10.1.1.2:1414
ss –info src 101.0.2
This is ss dash dash info …

gives similar information for connections going to 10.1.1.2, or the address and port 10.1.1.2:1414

Example netstat output from a slow FTP in connection

Client Name: IBMUSER                  Client Id: 000006FE 
Local Socket: 10.1.1.2..1109          Foreign Socket: 10.1.0.2..35508 
  BytesIn:            0220191104        BytesOut:           0000000000
  SegmentsIn:         0000152946        SegmentsOut:        0000083051
  StartDate:          06/28/2021        StartTime:          13:47:56 
  Last Touched:       14:24:28          State:              Establsh 
  RcvNxt:             3569682809        SndNxt:             2105824963
  ClientRcvNxt:       3569577977        ClientSndNxt:       2105824963
  InitRcvSeqNum:      3349491704        InitSndSeqNum:      2105824962
  CongestionWindow:   0000005760        SlowStartThreshold: 0000065535
  IncomingWindowNum:  3569946679        OutgoingWindowNum:  2105889219
  SndWl1:             3569681369        SndWl2:             2105824963
  SndWnd:             0000064256        MaxSndWnd:          0000064256
  SndUna:             2105824963        rtt_seq:            2105824962
  MaximumSegmentSize: 0000001440        DSField:            00 
  Round-trip information: 
    Smooth trip time: 3.000             SmoothTripVariance: 2.000 
  ReXmt:              0000000000        ReXmtCount:         0000000000
  DupACKs:            0000000000        RcvWnd:             0000263870 
  SockOpt:            A0                TcpTimer:           00 
  TcpSig:             04                TcpSel:             40 
  TcpDet:             E0                TcpPol:             00 
  TcpPrf:             E0                TcpPrf2:            28 
  TcpPrf3:            00 
  DelayAck:           Yes 
  QOSPolicy:          No 
  TTLSPolicy:         No 
  RoutingPolicy:      No 
  ReceiveBufferSize:  0000184351        SendBufferSize:     0000184320 
  ReceiveDataQueued:  0000104832 
    OldQDate:         06/28/2021        OldQTime:           14:24:27 
  SendDataQueued:     0000000000 
  SendStalled:        No 
  Ancillary Input Queue: N/A 
  Application Data:   EZAFTP0S D IBMUSER   C      FSSH

Comments

Congestion window low
Smooth trip time: 3.00 good
ReXmt: 0 good
Receive buffr 184351- good
Receive buffer queued 104832 – BAD

I cut the CPU cost of doing nothing.

I was running z/OSMF and saw that the CPU costs where high when it was sitting there doing nothing. I managed to reduce the CPU costs by more than half. This would apply to other Liberty based web servers, such as MQWEB, and z/OS Connect.

I could see from the MVS system trace there was a lot of activity creating a thread, and deleting a thread, a lot of costs associated with these activities, such as allocating and freeing storage.

I increased the number of threads so that this allocating a thread and delete a thread activity disappeared.

In the xml configuration file (based from server.xml) was the default

<executor name=”LargeThreadPool” id=”default” coreThreads=”100″
maxThreads=”0″ keepAlive=”60s” stealPolicy=”STRICT”
rejectedWorkPolicy=”CALLER_RUNS” />

I changed this to

<executor name=”LargeThreadPool” id=”default”
coreThreads=”300″ maxThreads=”600″ keepAlive=”60s”
stealPolicy=”STRICT” rejectedWorkPolicy=”CALLER_RUNS” />

and restarted the server.

The options are documented here. There is an option keepAlive which defaults to 60 seconds. If a thread has been idle for this time, the thread is a candidate to be freed to reduce the pool back to corethreads size.

I was alerted to this problem when I looked at an MVS system trace. This is described here.

There is a discussion how sun thread pools work in this post. It is not obvious. This may or may not be how this executor works.

What value should you use?

This is a hard question, as Liberty does not provide this information directly.

I used the Health Checker connects from Eclipse to the JVM and extracts information about the JVM and applications.

This shows that at rest there was a lot of activity. I increased it to 250 threads and restarted the server and got

So better … but still some activity. I increased it to 300 threads, and the graph was flat.

I set up USER.Z24A.PROCLIB(CEEOPT) with

RPTSTG(ON),
RPTOPTS(ON)

in my z/OSMF job I had

//CEEOPTS DD DISP=SHR,DSN=USER.Z24A.PROCLIB(CEEOPT)

This printed out a lot of useful information about the stack and heap usage. It the bottom it said

Largest number of threads concurrently active: 397

The number of threads includes threads from the pool I had specified, plus other threads that z/OSMF creates. The health check showed there were 372 threads, event though coreThreads was set to “300”.

I also used jconsole to display information about the highest thread usage. The URL was service:jmx:rest://10.1.1.2:10443/IBMJMXConnectorREST. It displays peak threads and live threads.

Security

I found the security of both jconsole, and health check, was weak (userid and password). I was unable to successfully set up a TLS certificate logon to the server.

The information from rptstg was only available at shutdown.

Why does increasing the number of threads reduce the CPU when idle?

The thread pool has logic to remove unused threads and shrink it to the coreThreads size. If the pool size is too small it has to create threads and delete threads according to the load. See here. The keepAlive mentioned at the top is how long a thread can be idle for, before it can be considered a candidate for deletion.

Summary

Monitor the CPU used when idle and see if increasing the threadpool to 300 helps.

What’s going on? – getting performance data from a z/OS systrace

On my little z/OS system, one address space was using a lot of CPU – but doing nothing. What was going on? The address space was a z/OSMF based on the Liberty Web server.

The blog post tells you how to take a dump, and use IPCS to display useful information from the system trace. The system trace contains deep down information like

Task A was dispatched on this processor at this time.
It issued a request MVS to get a block of storage, and time.
The request completed, with this return code, at this time.
Task A was interupted at this time
Task B was dispatched

There is a lot of detailed information, and it is overwhelming when you first look at it. This blog post shows how you can get summary information from the trace – while ignoring all of the detailed, scary stuff. It does not require any prior knowledge of IPCS or dumps.

Take your dump
Go into IPCS
Systrace output

Take your dump

DUMP COMM=(COLINS DUMP)
R xx,jobname=IZUSVR1

This gives output like

IEE600I REPLY TO 01 IS;JOBNAME=IZUSVR1
IEA045I AN SVC DUMP HAS STARTED AT TIME=16.24.56 DATE=06/21/2021 044
FOR ASID (0044) QUIESCE = YES
…
IEA611I COMPLETE DUMP ON SYS1.S0W1.Z24A.DMP00004

Go into IPCS

I find it easier to use a wide(132) screen for IPCS.

This may be in your ISPF panels, or you might need to issue a command before starting ISPF. You might need to talk to your system programmer.

You get the primary menu

 ------------------- z/OS 02.04.00 IPCS PRIMARY OPTION MENU
 OPTION  ===>                                              
                                                           
    0  DEFAULTS    - Specify default dump and options      
    1  BROWSE      - Browse dump data set                  
    2  ANALYSIS    - Analyze dump contents                 
    3  UTILITY     - Perform utility functions             
    4  INVENTORY   - Inventory of problem data             
    5  SUBMIT      - Submit problem analysis job to batch  
    6  COMMAND     - Enter subcommand, CLIST or REXX exec  
    T  TUTORIAL    - Learn how to use the IPCS dialog      
    X  EXIT        - Terminate using log and list defaults

Select option 0

------------------------- IPCS Default Values ---------------------------------
 Command ===>                                                                   
                                                                                
   You may change any of the defaults listed below.  The defaults shown before  
   any changes are LOCAL.  Change scope to GLOBAL to display global defaults.   
                                                                                
   Scope   ==> LOCAL   (LOCAL, GLOBAL, or BOTH)                                 
                                                                                
   If you change the Source default, IPCS will display the current default      
   Address Space for the new source and will ignore any data entered in         
   the Address Space field.                                                     
                                                                                
   Source  ==> DSNAME('SYS1.CTRACE1')
   Address Space   ==> RBA
   Message Routing ==> NOPRINT TERMINAL NOPDS
   Message Control ==> CONFIRM VERIFY FLAG(WARNING)
   Display Content ==> NOMACHINE REMARK REQUEST NOSTORAGE SYMBOL ALIGN
                                                                                
 Press ENTER to update defaults.                                                
                                                                                
 Use the END command to exit without an update.

Replace the source with DSN(‘your dumpname’).

Change Scope from LOCAL to BOTH

Press enter to update. Use =6 on the command line to get to the IPCS command window.

Enter a free-form IPCS subcommand or a CLIST or REXX exec invocation below: 
                                                                            
===>                                                                        
                                                                            
                                                                            
----------------------- IPCS Subcommands and Abbreviations -----------------
ADDDUMP           | DROPDUMP, DROPD   | LISTDUMP, LDMP    | RENUM,    REN   
ANALYZE           | DROPMAP,  DROPM   | LISTMAP,  LMAP    | RUNCHAIN, RUNC  
ARCHECK           | DROPSYM,  DROPS   | LISTSYM,  LSYM    | SCAN            
ASCBEXIT, ASCBX   | EPTRACE           | LISTUCB,  LISTU   | SELECT          
ASMCHECK, ASMK    | EQUATE,   EQU, EQ | LITERAL           | SETDEF,   SETD  
CBFORMAT, CBF     | FIND,     F       | LPAMAP            | STACK           
CBSTAT            | FINDMOD,  FMOD    | MERGE             | STATUS,   ST    
CLOSE             | FINDUCB,  FINDU   | NAME              | SUMMARY,  SUMM  
COPYDDIR          | GTFTRACE, GTF     | NAMETOKN          | SYSTRACE        
COPYDUMP          | INTEGER           | NOTE,     N       | TCBEXIT,  TCBX  
COPYTRC           | IPCS HELP, H      | OPEN              | VERBEXIT, VERBX 
CTRACE            | LIST,     L       | PROFILE,  PROF    | WHERE,    W

If you use the command “systrace” you will see the scary internal trace. PF3 out of it.
Use the command

systrace jobname(IZUSVR1) PERFDATA(DOWHERE) time(LOCAL)

Go to the bottom of the report ( type m and press PF8) and type

REPORT VIEW

This gives you the report in an editor session, so you can exclude, delete, sort count etc.

This gives a lot of data. It is in sections, the sections are…FLocal

Summary of the dump

Analysis from 06/21/2021 16:24:46.391102 to 16:24:56.042146 9.651044 seconds

This gives the time of day, and the interval of the trace is 9.65 seconds.

Summary of CPU usage by engine

CPU  Type Pol  Park   SRB Time              TCB Time             Idle Time           
---- ---- ---- ---- --------------------- --------------------- ---------------------
0000 CP   High No       0.190562   1.974%     0.828988   8.589%     8.603271  89.143%
0001 CP   High No       0.098836   1.024%     0.393259   4.074%     9.143735  94.743%
0002 CP   High No       0.086573   0.897%     0.415063   4.300%     9.136385  94.667%
0003 zIIP High No       0.015463   0.160%     2.227832  23.083%     7.398707  76.662%
0004 zIIP High No       0.000000   0.000%     1.094373  11.339%     8.551280  88.604%
---- ---- ---- ---- --------------------- --------------------- ---------------------
                        0.391434              4.959518             42.833380

This shows

Most of the time was spent in TCB “application thread” mode (4.959 seconds of CPU) rather than SRB “system thread” mode (0.391 seconds of CPU).
One ZIIP was busy 23 % of the time, the other ZIIP was busy 11 % of the time.

Summary of CPU overall over 5 engines

 SRB time      :     0.391434 
 TCB time      :     4.959518 
 Idle time     :    42.833380 
 CPU Overhead  :     0.070886 
                 ------------ 
         Total :    48.255220

This summarises the data

4.95 seconds of CPU in TCB mode in 9.65 seconds of trace
42 seconds idle
5 engines * 9.65 seconds duration = 48.25

CPU break down by ASID/Jobname

CPU breakdown by ASID: 
                                                    
ASID Jobname    SRB Time     TCB Time    Total Time 
---- -------- ------------ ------------ ------------
0001 *MASTER*     0.011086     0.017940     0.029027
0003 RASP         0.000186     0.000000     0.000186
0005 DUMPSRV      0.035545     0.008959     0.044504
0006 XCFAS        0.021590     0.074411     0.096001
...   
0044 IZUSVR1      0.021217     3.638295     3.659513
0045 COLIN        0.000000     0.000000     0.000000
0046 RMF          0.010238     0.020204     0.030442
0047 RMFGAT       0.019961     0.160512     0.180474
              ------------ ------------ ------------
                  0.391434     4.959518     5.350953

Most of the CPU was in ASID 44 for job IZUSVR1.

Breakdown by system thread (SRB) by address space/jobname

SRB breakdown by ASID: (WHERE command bypassed for CPU usage less than 0.100000): 
                                                                                   
ASID: 0001   Jobname: *MASTER* 
IASN      SRB PSW      # of SRBs     Time 
---- ----------------- --------- ------------ 
0001 070C0000 83D0E8A0         2     0.000314 
...  
ASID: 0003   Jobname: RASP 
IASN      SRB PSW      # of SRBs     Time

Ignore this unless the SRB usage was high.

Breakdown of CPU by used thread by address space/jobname


TCB breakdown by ASID: 
                                                   
ASID Jobname  TCB Adr  # of DSPs     Time 
---- -------- -------- --------- ------------ 
0001 *MASTER* 008EDE88         1     0.000535
... 
ASID Jobname  TCB Adr  # of DSPs     Time 
---- -------- -------- --------- ------------ 
0044 IZUSVR1  008C8E88        22     0.013143 
0044 IZUSVR1  008AD7A0        30     0.006694 
0044 IZUSVR1  008B97B8        37     0.015926 
0044 IZUSVR1  008BA3E8        50     0.017547 
0044 IZUSVR1  008B2628        15     0.007748 
0044 IZUSVR1  008C4840        19     0.008433 
0044 IZUSVR1  008BD2D8        20     0.008551 
0044 IZUSVR1  008CDC68        14     0.008107 
0044 IZUSVR1  008C8328        15     0.006540 
0044 IZUSVR1  008CAC68        16     0.006612 
0044 IZUSVR1  008C9E88        14     0.006634 
0044 IZUSVR1  008B5C68        14     0.005668 
0044 IZUSVR1  008CBBE0        28     0.015650 
0044 IZUSVR1  008ADE00        17     0.005861 
0044 IZUSVR1  008B9470        15     0.006014 
0044 IZUSVR1  008BEA48        17     0.017092 
0044 IZUSVR1  008C6CF0        20     0.010093
...
0044 IZUSVR1  008CC2D8       548     0.827491 
0044 IZUSVR1  008D2E88        25     0.445230 
0044 IZUSVR1  008D2510       819     0.412526 
0044 IZUSVR1  008CEE88        14     0.158703
0044 IZUSVR1  008D3E88         8     0.003189 
0044 IZUSVR1  008C4CF0        18     0.013237 
                                 ------------ 
                                     3.638295

There were 166 TCB’s which did something in the time period.
TCB with address 008D2510 was dispatched 819 times times in 9 seconds – using 0.4 seconds of CPU! This was being dispatched 100 times a second, and used 5 milliseconds of CPU on average per dispatch. This looks high considering the system was not doing any work.
TCB with address 008d2E88 was dispatched 25 times in 9 seconds, and used 0.44 seconds of CPU or 17 ms of CPU per dispatch. This is doing more work per dispatch than the previous TCB.

Display lock usage

Lock events for CEDQ   
  None found                
Lock events for CSMF       
  None found               
Lock events for CLAT        
  None found              
Lock events for CMS         
  None found                
Lock events for OTHR        
  None found

Nothing of interest here.

Display local lock usage – locking the job

Lock events for LOCL of ASID 0010 OMVS 
Lock ASID Jobname  TCB/WEB  Type    PSW Address    IASN  Suspended at    Resumed at     Suspend Time 
---- ---- -------- -------- ---- ----------------- ---- --------------- --------------- ------------ 
CML  0044 IZUSVR1  008C33E8 TCB  00000000_04868084 0010 16:24:49.612051 16:24:49.612076     0.000025 
CML  0044 IZUSVR1  008B4938 TCB  00000000_048687E4 0010 16:24:49.612090 16:24:49.612570     0.000480 
... 
---- ---- -------- -------- ---- ----------------- ---- --------------- --------------- ------------ 
Suspends:      6  Contention Time:     0.000821    0.008%               WU Suspend Time:    0.000823 
                                                                                                        
Lock events for LOCL of ASID 0044 IZUSVR1 
Lock ASID Jobname  TCB/WEB  Type    PSW Address    IASN  Suspended at    Resumed at     Suspend Time 
---- ---- -------- -------- ---- ----------------- ---- --------------- --------------- ------------ 
LOCL 0044 IZUSVR1  008D3E88 TCB  00000000_010CCD62 0044 16:24:46.404417 16:24:46.404561     0.000144 
LOCL 0044 IZUSVR1  008D3E88 TCB  00000000_010ADA78 0044 16:24:46.410188 16:24:46.412182     0.001993
Suspends:     83  Contention Time:     0.042103    0.436%               WU Suspend Time:    0.079177

The LOCal Lock (LOCL) is the MVS lock used to serialise on the address space, for example updating some MVS control blocks. For example if MVS wants to cancel an address space, it has to get the Local lock, to make sure that critical work completes.

For the OMVS addess space, address space IZUZVSR1 got the lock 6 times, and was delayed for 0.823 milliseconds waiting for the local lock.
For the IZUSVR1 address space, 83 TCBs got the local lock, and were suspended for a total of 79 milliseconds.

Display timer events (CPU Clock comparator CLKC and timer TIMR)

ASID: 0044   Jobname: IZUSVR1 
SRB/TCB  IASN   Interrupt PSW    Count   Where processing 
-------- ---- -------- -------- -------- ----------------------------------------
00000000 0044 070C0000 81D83CB8        2 IEANUC01.ASAXMPSR+00     READ/WRITE NUCLEUS 
...
008CEE88 0044 078D0401 945F2B28       11 AREA(Subpool252Key00)+CB28 EXTENDED PRIVATE 
008CEE88 0044 078D2401 945F99BE        1 AREA(Subpool252Key00)+0139BE EXTENDED PRIVATE 
008CEE88 0044 078D2401 FA63F71E        1 SPECIALNAME+03F71E     EXTENDED PRIVATE 
008CEE88 0044 078D0401 FAB00178        1 SPECIALNAME+0178     EXTENDED PRIVATE 
008CEE88 0044 078D2401 FAB447BE        1 SPECIALNAME+0447BE     EXTENDED PRIVATE 
008CEE88 0044 078D0401 FAD9E660        1 SPECIALNAME+29E660     EXTENDED PRIVATE 
008B17B8 0044 070C0000 81C92030        1 IEANUC01.IAXVP+4048     READ ONLY NUCLEUS 
008B27C0 0044 072C1001 91AF2460        1 BBGZAFSM+7520     EXTENDED CSA 
...
008D2E88 0044 078D0401 945F2B28       22 AREA(Subpool252Key00)+CB28     EXTENDED PRIVATE 
008D2E88 0044 078D0401 FB036F08        1 SPECIALNAME+036F08     EXTENDED PRIVATE 
...
008D2E88 0044 078D0401 FC145732        1 AREA(Subpool229Key00)+A732     EXTENDED PRIVATE

This displays

The TCB
The virtual adress where the interrupt occurred
These entries are in time sequence, and so we can see the second entry had 11 interupts in quick succession (count is 11).
The Where processing, is a guess at converting the address into a module name. Sometimes it works, for example Module IEANUC01, csect ASAXMPSR, offset 0. Sometimes it cannot tell, for example from Java code.

This shows 2 things

The application said wake me up in a certain time period
The TCB was executing and z/OS interrupted it because it needed to go and dispatch some other work. This gives a clue as to hot spots in the code. If the same address occurs many times – the code was executing here many times. I look in the raw systrace to see if this is a TIMer (boring) or a CLKC interesting. Interesting gives you a profile of what code is CPU intensive.
You can delete all the records outside of this block, then sort 15 32 to sort on PSW address. For my IPCS dump the address 078D0401 945F2B28 occurred 35 times.

I/O activity

 Device   SSCH Issued    I/O Occurred     Duration   
 ------ --------------- --------------- ------------ 
  0A99  16:24:48.009819 16:24:48.010064     0.000244 
  0A99  16:24:48.033619 16:24:48.033856     0.000236 
  0A99  16:24:48.051014 16:24:48.051081     0.000066 
  0A99  16:24:48.057377 16:24:48.057434     0.000056 
  0A99  16:24:48.080331 16:24:48.080430     0.000098 
                                        ------------ 
                                            0.000702 
                                                     
    Events for 0A99 :            5                   
    Quickest I/O    :     0.000056                   
    Slowest  I/O    :     0.000244                   
    Total           :     0.000702                   
    Average         :     0.000140

This says for the device 0A99 there were 5 I/O requests, total time 0.7 milliseconds

I used the REPORT VIEW to get the data into ISPF edit,
deleted all the records above the I/O section
Used X ALL
F TOTAL ALL
This shows the totals for all I/Os. Most totals were under 1 ms. One I/O was over 5 seconds.
Displaying the detailed records above this TOTAL record showed one I/O took over 5 seconds!

End of report

End of PERFDATA analysis.

Advanced topic: Look at hot spots

I had seen that PSW 078D0401 945F2B28 was hot. If you go back to the IPCS command panel, you may be able to use the command

L 945F2B28

To display the storage. This will not work. You have to remove the top bit (80), so

L 145F2B28

may work.

If the first character is a letter (A-F) then you need to put a 0 on the front for example

L 0D2345678

You might need to put the address space in as well for example

L 145F2B28 ASID ID(x’44’)

You can say

L 145F2B28 ASID ID(x’44’) LENGTH(4096)

To display large sections of storage

Dig into the trace

You can use

systrace jobname(izusvr1)tcb(x’008CC2D8′)

to display all entries for the specified TCB and jobname.

Go to the bottom ( type Max press PF8)
use the report view command to get into and edit session
Columns 79 to 88 contain a description of some of the system calls
use X ALL;f ‘ ‘ 79 84 all;del all nx to delete all lines without a description

This gives you a picture of the MVS services being used.

The best way to save money, is not to spend it. The same is true for CPU.

I was trying to understand why a z/OSMF address space, using the Liberty web server (written in Java) was using a lot of CPU – when it was doing no work. I looked into the MVS system trace and saw some interesting behaviour. If you are trying to investigate a high CPU usage in a z/OS Job, I hope the following may help you with where you need to start looking.

If you are looking at a Java program, knowing there is a problem does not help you with what is causing the problem. There is Java, which uses C code, which uses USS services, which uses MVS services which is what you see in the system trace. The symptom is a long way from the source code. You might be able to correlate the time stamp in the system trace with the Java trace.

The ‘interesting’ behaviour…

Getmain/freemain storage requests. These are heavy weight requests for getting and freeing storage. Once warmed up, I would expect no storage requests.
“Storage Obtain”/”Storage Release” requests. These are medium weight requests for getting and freeing storage. Once warmed up, I would expect no storage requests.
Attach task and detach tasks, or pthread_create(). I would have expected tasks would have been attached at start up, and there would be no need for more tasks. I can see that under load more tasks may be required until the system stabilises.
Many timer pops a second. There were 50 time pops every second (one every 20 milliseconds). Is this efficient? No! It may be more efficient to have the duration between timer pops increase if there is no load, so a timer pop 10 times a second may be acceptable when the system is idle – and reset it to a short interval when the system is busy, or change the programming model to be wait-post rather than spinning the wheels.

I’ll discuss these in more detail below.

Storage requests, GETMAINs, FREEMAINs, STORAGE requests.

For a non trivial application such as a web server, MQ, DB2 etc, I would not expect to see any storage requests in the system trace once the system has warmed up. When I worked for MQ development, we went through the system trace, and every time we found a GETMAIN or STORAGE OBTAIN, we worked to eliminate it, until there were non left.

Use a stack

Instead of each subroutine using GETMAIN or “Storage request” to get a block of storage for its variables, I would expect a program stack to be used. For example the top level program for the thread allocates a 1MB block of storage and uses this as a stack. The top level program uses the first 2KB from this buffer. The first subroutine uses this buffer from 2KB for 3KB. If this subroutine calls another subroutine, the lower level subroutine uses this buffer from 5KB for 2KB. This is a very efficient way of managing storage and each subroutine needs only 10’s of instructions to get and release storage from the stack.

A problem can occur if the stack is not big enough, and there is logic like “If no space in the stack – then GETMAIN a block of storage”. If this happens the request quickly becomes expensive.

C (Language Environment) programs on z/OS can set the stack size, and when the system is shutdown, print out statistics on the stack usage.

Use the heap

A subroutine may need some “external storage”, which exist outside of the subroutine, for example store entries in a table for the life of the job. A heap (or heappool) is a very efficient way of managing the storage. If your program gets some storage, it does not return it, when the block has been finished with, the program “adds it to the heap” so it can be reused.

A simple heap example.

This might be an array of 3 pointers;

storage[0] is a chain of free 1KB blocks,
storage[1] is a chain of free blocks from 1K+1 bytes to 10KB,
storage[2] is a chain of free blocks from 10K+1 bytes to 50KB.

If your program needs a 512 byte block – it looks to see if there is a free block chained from storage[0], if not allocate a 1KB block (not 512 byte). When it has finished with the block, put it onto the storage[0] chain.
Over time the number of elements on each chain is sufficient to run the workload, and there should be no more storage requests. An increase in throughput may increase the demand for storage, and so during this “warm up” period, there may be more storage requests.

C run time statistics

For C programs on z/OS you can get the C runtime component to print out statistics on the stack and heap usage, and gives recommendations on the best size to specify.

In the //CEEOPTS data set you can specify the following

RPTOPTS(ON) This prints out the options
RPTSTG(ON) This reports statistics on the stack and heap pools
HEAPPOOLS.(..) Specify size of the heap pools.
HEAPPOOLS64(…)
STACK64(…)
STACK(…)

You may want to use HEAPPOOLS64(ALIGN…) and HEAPPOOLS(ALIGN…) when there are multiple threads so the blocks are hardware cache friendly, and you do not have two CPUs fighting for the same hardware cache data.

Ive blogged One Minute MVS – tuning stack and heap pools.

Smart programs

MVS can call exit programs, for example when an asynchronous event has happened, such as a timer has expired. These programs are expected to allocate storage for their variables ,do some work, give back the storage, and return. This can be very expensive – you have the cost of getting and freeing a block of storage just to set a few bits.

You may be able to write your exit program so it only uses registers, and does not need any virtual storage for variables. If this is not possible then consider passing a block of storage into the called program. For example the RACF Admin function

CALL IRRSEQ00 (Work_area,… )
Work_area: The name of a 1024-byte work area for SAF and RACF usage. The work area must be in the primary address space.

Example exit program

You could use the assembler macro STIMERM (Set TIMER). You specify the time interval, the address of the exit, and a user parameter. This user parameter is passed to the exit program when it gets control.

This could be a pointer to a WAIT ECB block,
or a pointer to a structure, one element in the structure is the WAIT ECB block, another element is the address of a block of storage the exit can use.

Attach task and detach task.

It is expensive to attach and detach tasks, so it is important to do it as little as possible. From a USS perspective the attach is from pthread_create.

A common design template to eliminate the attach/detach model is to have a pool of threads to do work.

A work request comes in, the dispatcher task gets a worker thread from the pool, and gives the work request to it. When the worker has finished it puts itself back in the pool.
If there was no worker thread available, check the configuration for the maximum number of threads, If this limit has not been reached, create a new worker thread.
If the was no worker thread available and the number of threads was at the limit, then wait until a worker thread is free.
Some thread pools have logic to shrink the pool if it gets too big. Without this logic a thread pool could be very large because it hit a peak usage weeks ago, and the pool has only been little used since.

Having a pool means that some of the expensive set up is done only once per thread, for example connect to DB2 or connect to MQ. You also avoid the expensive create (attach) of a thread, and delete (detach) of the thread. The application has logic like

Dispatching application attaches a new thread.
start thread
- perform the expensive set up – for example connect to DB2 or connect to MQ
- add task to the thread pool
- do until told to shutdown
  - wait for work
  - do the work
- end do
- disconnect from DB2 or MQ
- thread returns and is detached

Problems with the thread pool

One problem with using a thread pool is if the minimum pool size is too small. Smart thread pools have options like

lowest number of threads in thread pool
maximum number of thread thread pool
maximum idle time of a thread. If there are more threads than the lowest number of threads, and a thread has been idle for longer than this time then free the thread.

You can get the”thrashing” on a low usage system

The lowest number of threads is specified as 10 threads.
The main program needs 50 threads – it uses 10 from the pool, and allocates 40 new threads. These are added to the pool when the work has finished.
The clean-up process periodically checks the pool. If there are more threads than the lowest number, then purge ones which have been idle for more than the specified idle-time. 40 threads are purged
Repeat:
The main program needs 50 threads- it uses 10 from the pool, and allocates 40 new threads. These are added to the pool when the work has finished.
The clean-up process periodically checks the pool. If there are more threads than the lowest number, then purge ones which have been idle for more than the specified idle-time. 40 threads are purged

In this case there is a lot of attach/purge activity.

Making the pool size 50, or the maximum idle time very large will prevent this thrashing…

The lowest number of threads is specified as 50 threads.
The main program needs 50 threads – it uses 50 from the pool.
The clean-up process periodically checks the pool. The pool size is OK – do nothing.
Repeat:
The main program needs 50 threads- it uses 50 from the pool.
The clean-up process periodically checks the pool. The pool size is OK – do nothing

In this case the number of threads stays constant and you do not get the create/delete (attach/purge) of threads.

In one test this reduced the CPU time used when idling by more than 50 %.

Many timer pops a second

In my system trace I can see a task wakes up, it sets a timer for 20 milliseconds later, and suspends itself. This is very inefficient. This should be a wait-post model instead of an application in a loop, sleeping and checking something.

When investigating this you need to think about the speed of your box. Consider an application which just does

setting a timer to wake up in 10 milliseconds
it wakes up a thread which does nothing – but set a timer for 10 ms later (or 100 times a second)

On my slow box this could take me 1 ms of CPU to do this once, – or 100 ms of CPU for 100 times a second. One engine would be busy 10% of the time.

If I had a box which was 10 times faster and only took 0.1 ms of CPU to do the same work. For 100 iterations this would be 10 ms of CPU or 1% of an engine. To some people this is at the “noise level” and not worth looking at.

To you 1% CPU per second is “noise level”, to me the noise level of 10% CPU per second is a flashing red light, a loud klaxon and people in body armour running past.