One Minute MVS performance – CPU at the LPAR level

Question: In your car how do you tell if your car has a problem? Answer: You look at the dashboard and see if there is a red light showing. You may not know how to fix it – but you know that you need to get help to fix it.

The aim of this series of blog posts is to show you what to look for in z/OS performance and if you have a problem.

I will cover

For some of these you need data from z/OS. This post describes how to get the SMF data, and format it using RMF.

CPU

There are two basic things you need to check

  1. Has my LPAR got all the CPU it wanted – has the hyper-visor restricted the CPU?
  2. How busy are my CPUs?

Has my LPAR got all the CPU it wanted

An LPAR can be configured to have dedicated engines, or share a pool of engines. Dedicated engines means that the engine is always there when it is needed. If the LPAR is using a shared engine, it may not always be available when needed.

An example to explain the concept

You have a class from 10am to 11 am.  You go in, and sit down.  The teacher starts the class.  the teacher’s phone rings and goes out of the classroom. You play with your phone until the teacher comes back after 40 minutes. (The teacher went to teach in a different class room.)
How long were you in class for and how much work did you do?

  • You were in class for 1 hour.
  • You did 20 minutes work.

This concept is the same as any LPAR with shared engines.

  • The 1 hour class is a time slice as seen by z/OS.
  • The “processor” (teacher) was used in the time slice for only 20 minutes
  • For 40 minutes the “processor” was doing work elsewhere.

How do you get the report to show these figures.?

You need the RMF CPU report. It has “C P U A C T I V I T Y “ at the top of the page.

Look at the section

---CPU---    ---------------- TIME % ----------------   
NUM  TYPE    ONLINE    LPAR BUSY    MVS BUSY   PARKED   
 0    CP     100.00    46.68        46.32        0.00   
 1    CP     100.00    38.98        38.78        0.00   
 2    CP     100.00    34.91        34.62        0.00   
TOTAL/AVERAGE          40.19       39.90               
 3    IIP    100.00    94.43        94.70        0.00   
 4    IIP    100.00    93.50        93.74        0.00   
TOTAL/AVERAGE          93.96       94.22               

LPAR BUSY is how much teacher time you got

MVS Busy is how much time you were in the classroom for.

  • If MVS BUSY TIME = LPAR BUSY TIME, perfect, what you needed you got.
  • If MVS BUSY TIME > LPAR BUSY TIME, MVS had to wait for an engines, the system may need more CPU, a small difference(5%) is OK.
  • If MVS BUSY TIME >> LPAR BUSY TIME, For much of the time, there was no engine when MVS needed This will have a major impact on your work. If your end user work is not meeting targets, you need more CPUs, or give your LPAR a higher dispatching priority.

These values should be similar: MVS BUSY TIME 39.60 is close to LPAR BUSY 40.19, and for the ZIIP, 93.96 is close to 94.22.

When these figures are significantly different, stop, and fix the problem. This can make all other performance data look bad. For example, disk response time, and timing in application trace entries.

How busy are my CPUs?

The TOTAL/Average will be close to 100 % on a busy system. 95% busy is OK, Make a note that the system may be short of CPU.

These are average values. The individual values could be spiky. For example at 100% busy for 4 minutes, 80% busy for 1 minute, or an average of 96% busy over 5 minutes. Consider using an online monitoring to see if you have big peaks and trough.

More advanced topic for information.

The following section gives you information on how much work was waiting. It is hard to say what is good or bad, as it could look bad, but all the performance goals are being met.

How much work was waiting?

-----------------------DISTRIBUTION OF IN-READY WORK UNIT QUEUE--------------
 NUMBER OF              0    10   20   30   40   50   60   70   80   90   100
 WORK UNITS     (%)     |....|....|....|....|....|....|....|....|....|....|  
                                                                             
<=  N          26.3     >>>>>>>>>>>>>>                                       
 =  N +   1   12.9     >>>>>>>                                              
 =  N +   2    10.1     >>>>>>                                               
 =  N +   3    10.1     >>>>>>                                               
<=  N +   5    12.5     >>>>>>>                                              
<=  N +  10    11.0     >>>>>>                                               
<=  N +  15     6.0     >>>>                                                 
<=  N +  20     5.2     >>>                                                  
<=  N +  30     1.6     >                                                    
<=  N +  40     0.6     >                                                    
<=  N +  60     1.1     >                                                    
<=  N +  80     1.1     >                                                    
<=  N + 100     0.8     >                                                    
<=  N + 120     0.1     >                                                    
<=  N + 150     0.0                                                          
>   N + 150     0.0                                                          

N is the number of CPUs. I have 5 on my system.

The data is sampled. If system was sampled 10 times a second, every 0.1 of a second RMF counts the number of tasks in the “ready to dispatch queue”, and increments the value in the appropriate box; if there were 5 tasks executing and one task waiting, increment the N+1 element;

  • 26.3 % of the time, there were no tasks waiting for CPU.
  • 12.9 % of the time, there was 1 task waiting for CPU. See the bold data in the data above. (N+1 12.9 >>)
  • 10.1 % of the time, there were 2 tasks waiting for CPU
  • 5.2 % of the time there were between 16 and 20 tasks waiting for CPU
  • 0.1 % of the time there were between 101 and 120 task waiting for CPU

Remember this could be waiting for CP, or IIP.

If there are hundreds to tasks waiting for CPU you should make a note. It may not be a problem.

If there are under 50 tasks waiting for CPU, this should be OK.

On a busy system there will always be work waiting to run. Compare the pictures from a busy time and a not so busy time.

Is this important?

I once did some measurements with MQ on a machine with 16 processors, on average the engines were about 5% busy. A performance person from IBM said that my workload showed a shortage of CPU! 5 % busy on 16 processors – was I really short of CPU?

My application received some data, and posted 30 threads to come and process the data. The first 15 threads could be dispatched because there were 15 unused CPUs. 15 threads had to wait.

This showed up in the above report at line N+15 of the tasks were waiting 20% of the time.

Out of the 30 tasks that were dispatched, one processed the work,the other 29 went back to sleep.

We changed the program to post no more threads the number of CPUs (16) in the LPAR, and had a significant saving in CPU.

One minute MVS performance – getting batch RMF reports

There is an introduction to getting RMF reports docucmented here.

You can display information about your SMF environment, using

D SMF,O

This tells you if you are using SMF datasets, or log streams ( in the coupling facility) for the RMF data.

Copy the data from SMF dataset

// SET SMFPDS=SYS1.S0W1.MAN3 
// SET SMFSDS=SYS1.S0W1.MAN4 
//* 
//SMFDUMP  EXEC PGM=IFASMFDP 
//DUMPINA  DD   DSN=&SMFPDS,DISP=SHR,AMP=('BUFSP=65536') 
//DUMPINB  DD   DSN=&SMFSDS,DISP=SHR,AMP=('BUFSP=65536') 
//* MPOUT  DD   DISP=(NEW,PASS),DSN=&RMF,SPACE=(CYL,(1,1)) 
//DUMPOUT  DD   DISP=SHR,DSN=IBMUSER.RMF SPACE=(CYL,(1,1)) 
//SYSPRINT DD   SYSOUT=* 
//SYSIN  DD * 
  INDD(DUMPINA,OPTIONS(DUMP)) 
  INDD(DUMPINB,OPTIONS(DUMP)) 
  OUTDD(DUMPOUT,TYPE(70:79)) 
  DATE(2020316,2022284) 
  START(0930) 
  END(1700) 
/* 

This job copies the records from the “MAN” data sets, and writes them to the DUMPOUT.

The RMF records with types 70 to 79 are copied, within the specified dates and start and end times.

Copy the data from a log stream.

SMF can write data to log streams, for example MQ records go to the MQ stream, and the RMF records go to the RMF Stream.

//SMFDUMP EXEC PGM=IFASMFDL
//DUMPOUT DD DSN=&TEMP,SPACE=(CYL,(10,10),RLSE),DISP=(NEW,PASS)
//SYSPRINT DD SYSOUT=*
//SYSIN DD *
DATE(2018004,2018012)
START(0900)
END(1900)
LSNAME(IFASMF.RMF,OPTIONS(ALL))
OUTDD(DUMPOUT,TYPE(70:79,116))
/*

This step writes the data to a temporary data set.

Sort the data

If you are processing the data from more than one LPAR you will need to sort the data. See here.

Format the RMF data

The RMF control statements are described here

//POST EXEC PGM=ERBRMFPP 
//* use the following if using a temporary data set in same job.
//* MPFINPUT DD DISP=SHR,DSN=*.SMFDUMP.DUMPOUT 
//MFPINPUT DD DISP=SHR,DSN=IBMUSER.RMF 
//PPXSRPTS DD   SYSOUT=*,DCB=(LRECL=200,BLKSIZE=4000) 
//SYSPRINT DD   SYSOUT=* 
//SYSOUT   DD   SYSOUT=* 
//SYSIN DD * 
  SUMMARY(INT,TOT) 
  NODELTA 
  SYSOUT(H) 
  REPORTS(ALL,NOENQ,DEVICE(NOUNITR,NOCOMM)) 
SYSRPTS(WLMGL(RCPER)) 
SYSRPTS(WLMGL(SCPER)) 
/* 
/* SYSRPTS(WLMGL(RCPER)) 
/* SYSRPTS(WLMGL(SCPER,RCLASS,RCPER,SCLASS)) 

This takes the records from MFPINPUT which could be a permanent data set, or a temporary data set passed from a previous job step.

You can have the output go to the spool (by default) or to preallocated data sets. See here

One Minute MVS performance reports of interest are

REPORTS(CPU,DEV(DASD))

For WLM the reports

SYSRPTS(WLMGL(RCPER,SCPER))

Straight lines good – curved lines bad.

It was a great day, it was 0830, the sun was shining, the view was amazing, as I was walking back the long way from the shops. There was a good hill en route, it starts nice and steep, then gradually leveled off to flat. I sat on the convenient seat at the top of the hill to admire the view and got my breath back. The shape of the hill reminded me of a graph in a performance report I had seen a few days before; It started steeply at the bottom left of the graph, and levelled off at the top right of the graph and the description was “so we can see it scales”. As I sat there enjoying the sun, I realised that it may scale – but it does not scale very well.

What does it mean to scale?

I remember 40+ years ago sitting in a meeting where the performance team were presenting status. They showed many graphs, one of which showed the transaction rate as they increased the work load. The graph started going up at an angle, nice and straight, and then levelled out until they achieved 100 transactions a second, and they could do no more than this. They explained that this was because of a lock being held for 1/10 of a second while some work was done.

They worked with the developer who realised that they could use a counter instead of a lock, and use the hardware instruction to atomically increase and decrease the counter. As a result the graph was now a straight line. Round of applause etc, well done etc.

The performance guy then went on to say, so in order to find the next bottlenecks we will need a box that is even bigger and even faster!

I also remember this performance guy said his job was to make curved lines straight!

You can hide the truth with charts

People often use graphs to show information, but often tell the wrong story. For example if you plot the total cost vs transaction rate you get a straight line

Great – this looks linear!

However If you plot the cost per transaction vs rate you get

In actual fact there is an increase in the cost per transaction as the rate increases. So good scaling – but not perfect. It is linear – when it should have been flat!

If you want to scare people – change the axis

It is the same data as the previous chart – but now the axis is 20 to 22.5 instead of 0 to 25, so any difference appears much larger.

So depending on what story you want to tell, and if you want to hide the truth, think about what data you present, and the axis you use.

Every one wants a bigger box or two

On the subject of wanting a bigger box…

One of the senior tester’s presented to management about testing strategy. He said you can do testing in one of two ways:

  1. If we were careful, and picked the right set of results we could show all test working. We could quietly ignore the tests that failed.
  2. No matter what we did, we could not seriously break it. Even when we deliberately misconfigured it during testing it carried on running – slowly – but it carried on running.

The manager said “hmmm I’m always being asked for graphs to present status so the first would be better, but from a customer perspective the second would be better”.

The tester said “That’s the right answer, we now need two bigger boxes. One to run the workload, the other to drive it. We need to achieve higher throughput than our customers”.

One customer told me that their test box was larger than production, and they needed to prove that they could run at expected peak production throughput + 50% with no major degradations – they needed many very big boxes.

How to become a performance expert in 3 easy lessons

and many hard lessons.

I had emails from two people, with different experiences of doing performance on z/OS. One person has recently started, and is not sure what is involved. The other person has been doing lots of work with customers explaining that his product is not the cause of the performance problems.

I thought it might be interesting for people who might be tempted to work in performance, to see the route to becoming an expert.

What does “performance” mean?

Performance work covers many different areas, and once you are competent in one product area it is not too difficult to cover additional areas.

“Performance” covers

Making sure it scales close to linearly

If you double the throughput, the costs per transaction should be similar. As the throughput increases, the response time does not increase significantly. You can have many threads running concurrently.

If the workload has disk I/O then you need to have multiple threads, so while one task is waiting for I/O another task can be using the CPU.

You need a box with multiple CPUs to detect contention. If you have only one or two engines you may not detect concurrency issues.

Work to remove contention until you can drive the CPUs at 100% busy (and then you ask for a bigger box). If you cannot drive the box at 100% find out why, resolve it and repeat.

Reduce CPU

Once you have eliminated as much contention as possible, you need to investigate where the CPU is being used, and try to eliminate any hot spots. This might be

  • Change algorithms – use a hash table instead of a linked list.
  • Avoid unnecessary work. Do you really need to store intermediate values in a database?
  • Can you tune the services being used. For example tune the database, add an index to a table.
  • Rearrange the code, for example have the “hot code” located in the same few pages. Avoid lots of error handling code in the mainline code – branch out of the mainline to handle it.
  • Remove debug code, or put debug code within if (debug enabled) then { debug code}.

Work with customers problems

Understand what areas the users have problems with, identify “problem areas” which take time to identify the problem.

Enhance the design

From your testing, and the experience with customer problem propose improvements to help diagnose problems for example

  • Capture the number, the average time, and the maximum time of database requests. Report this as a statistics or in response to a display command.
  • Record the number of times a resource, such as a lock, was not available, record total count of requests, number of blocked requests, time spent waiting. This code may never be executed, but if it is, you get useful information about the size of the problem.

Provide useful information to the end user

These are often known as “performance reports”. It is easy to produce reports that people cannot use – I have done it many times. Producing reports with nice graphs are often not easy to use, as they do not match your scenario.

You need to consider the questions the end users will have.

  • I want to run an ill defined workload (I do not know all the details), how big a box do I need (how many CPUs), to support 1000 requests a second.
  • What should I look at to tell me if things are running well or not.
  • What are common symptoms, and what actions can I take to solve performance problems.
  • What things do we need to consider to make it run well? For example table layout, how many requests per commit, how often you need to sign on.

Performance roles

The roles below are typical of the sort of activities a performance person will do

Run tests

The first tasks a person usually does when becoming a performance person is to run tests, and collect the data. This may involve writing scripts and tools so it can all be automated. For example on z/OS you might use Netview to run scripts, capture responses, and take actions when there are problems. This could all be done using Rexx scripts in TSO, and possibly using a REST interface.

Good automation will collect all of the key metrics into one place, for example a spread sheet, so the analyst can simply press a button or two to be able to display the data.

There may be a management report produced daily or weekly to show that performance overall has improved – or has not got worse.

Look at a component

You need to look at components within the whole environment, for example this week, look at the z/OSMF SDSF interface, next week the logon process.

You need to drive a high volume workload using this component. You need to focus on the component, for example with a REST requests 90% of the cost may be in the logon and establish a session. This makes it hard to focus on the other 10%. Sign on once, and have an application that just issues requests to the component.

When I was testing MQ under CICS, the duration of an MQPUT took 50 microseconds, and the cost of starting the CICS transactions was 1000 microseconds. I changed the transaction to process 1000 messages, so the transaction now took about 50 milliseconds, and most of the work was in the MQPUT area, and not in the CICS transaction overhead.

Capture the response time of the transactions and plot it over time. You should get a flat line. If the response increases over time, you might have a storage leak, and so it takes longer to get storage.

You may find it does not scale. Turning trace on can give an indication where the problem is. You often get function entry and exit trace, with time stamps, so you can post process the output to calculate the duration within the function. Trace often does not scale, so you cannot always believe the output.

You may want to instrument a private copy of the code. Obtain the time on entry and exit to the function, and across major calls to external requests. Calculate the duration of the calls, add logic to say “If duration > 10 millisecond then throw exception”, or accumulate the data in a global control block. When I did this, I found the trace code was adding significant delays, and the root cause of the problem was an insignificant line of code, which got an exclusive latch for an update!

I added code to measure the average duration of file I/O, and output this in the statistics. This made solving some problems very easy – you have an I/O problem. See here, it is taking 10 ms to write a page of data!

Unless you are testing the startup times, you should allow the system under test to “warm up”, so the hardware cache is in a steady state, database tables are in memory etc.

I found it useful to warm it up, then take 5 sets of measurements each of 1-5 minutes. When displaying the data, the results should all be similar. If not, you need to find out why. You should also run these tests once a week, and whenever you change a component, such as putting fixes onto your system, or change the hardware. Some example of things that can change your results

  • Overnight the Operations Team run backups and cause a different disk response time
  • The order the LPARs were ipled has changed. Last week your system had 6 CPUs in one book (so all very close to each other) this week your system has 3 CPUs in one book – and 3 CPUs in a different book – 1 metre away.
  • The network between your driving system and the test system has changed, or has a different load.

Usually the performance machines have their own dedicated hardware, processors, disks, connections to the disks, network.

Develop skills in other products

My background is MQ performance on z/OS. I had to learn about the performance characteristics of z/OS, DB2, TCP/IP, IMS, and understand the tools these products provide. Once you understand one trace, other traces are basically similar. The hard part is capturing the trace.

MQ passes messages from system to system. There were several problems where the “network was slow”. This meant we had to understand what was happening under the covers. Some good problems with easy fixes included

  • There was a TCP performance “improvement” where one end would delay sending a packet for a few milliseconds, as it is more efficient to send one big packet rather than several smaller packets. This meant that every MQ message sent over the network had a couple of millisecond delay. This fix was easy – disable this feature.
  • TCP/IP by default uses small buffers (256 bytes). You can configure a session to have very large buffers and tell it to automatically tune the best buffer size ( up to MB sized buffers).

Work with customers on their performance problems

The work involves working on performance problems where you do not have any of your specially written code included in it. You may need to turn on the product trace for a few seconds, then turn it off, and then process the output. Many customers do not run with trace on because of the overhead and major impact on throughput.

You can acquire the skills to talk to customers on the phone about their problems. It is very good to feedback what you heard. “Let me check what you just said … when you do … you get … “

Over time you will build up a list of questions to ask.

Once the problem has been resolved, consider what would have made it easier to find the root cause. Can you get development to put in some statistics, so next time this happens, you can tell the customer to check a value.

In the early days on MQ, we used to get many problems, because the in-memory buffer was too small. Development put out a fix, so that every 10 minutes or so it would report if it had detected a buffer full problem since the last message. After this fix was rolled out, we had no more of these problems.

There is no limit as to how far you can go

Once you have skills in one component you can apply these skills to other products or components. For example I spent some time looking at MQ on Linux so I could understand (and blog) on the performance data produced. (The performance data was “here are some numbers, we are not going to tell you what they mean”).

I’ve also been looking at Java performance, which lead me to look at the zFS file system, and the statistics it provides (it provides some – but they are not very useful).

You can also go deep. I knew about z architecture instructions and how some are fast and some are slow. I attended a taskforce with lots of hardware people. I met the team leader for the “load instructions”, and found that the “load instruction” was not an instruction – it is more like a subroutine with logic, for example

  • Find which CPU which currently “owns” this data in the CPU cache, and go and get it
  • Lock the page
  • Go and get this value from another page
  • Add the two values
  • Unlock both the pages

The subroutine had to communicate with other CPUs in the LPAR, worry about its own CPU cache etc. Deep Stuff!

Once you know this sort of stuff, it helps you program, for example it is better not to share a field if you do not have to. When a multi threading program uses a buffer to trace into, do not have one buffer which they all share, but give each thread its own buffer. This way the hardware will not be fighting over the buffer, and the data for each application can be kept on the same CPU as the program. This is obvious once you know!

Collect statistics at the thread level, and not at the global level. Merge them at display time. You know the reason why.

The hardware can start to execute instructions out of order – as long as they “commit” in the right order.

The z hardware has instrumentation which samples the executing system, and can tell you why instructions were delayed. For example

  • Data had to be obtained from the L2 cache on the chip
  • The instruction needed to be interpreted and added to the Translation Lookaside Buffer

This is a bit deep for many people, especially if they are at the level of using “printf” in their programs to display debug information.

“Me, with the brain the size of a planet ….”

This is a quote from Marvin the paranoid Android in the Hitchhiker’s guide to the galaxy. With performance work you can go deep, or you can go wide, but you would need a bigger brain than I had to go deep and wide – but it is a fascinating area.

Some of the mysteries of Java shared classes

When Java executes a program it read in the jar file, breaks it into the individual classes, converts the byte codes into instructions, and when executing it may replace instructions with more efficient instructions (Jitting). It can also convert the byte codes into instructions ahead of time, so called Ahead Of Time (AOT) compilation.

With shared classes, the converted byte codes, any Jitted code, and any AOT code can be saved in a data space.

  • When the java program runs a second time, it can reuse the data in the dataspace, avoid the overhead of the reading the jar file from the file system, and coverting the byte codes into instructions.
  • The data space can be hardened to a file, and restored to a data space, so can be used across system IPLs.

Using this, it reduced the start-up time of my program by over 20 seconds on my slow zPDT system. The default size of the cache is 16MB – one of my applications needed 100 MB, so most of the benefits of he shared classes could not be exploited if the defaults were used.

This blog post describes more information about this, and what tuning you can do.

Issuing commands to manage the shared classes cache

Commands to manage the shared classes cache are issued like

java -Xshareclasses:cacheDir=/tmp,name=client6,printStats

which can be done using JCL

// SET V=’listAllCaches’
// SET V=’printStats’
// SET C=’/tmp’
// SET N=’client6′
//S1 EXEC PGM=BPXBATCH,REGION=0M,
// PARM=’SH java -Xshareclasses:cacheDir=&C,name=&N,verbose,&V’
//STDERR DD SYSOUT=*
//STDOUT DD SYSOUT=*

Enabling share classes

You specify -Xsharedclasses information as a parameter to the program, for example in the command line or in a jvm properties file.

To use the shared classes capability you have to specify all of the parameters on one line, like

-Xshareclasses:verbose,name=client6,cacheDirPerm=0777,cacheDir=/tmp

Having it like

-Xshareclasses:name=client6,,cacheDirPerm=0777,cacheDir=/tmp
-Xshareclass:verbose

means the name, etc all take their defaults. Only shareclass:verbose would be used.

Changing share classes parameters

You can have more than one cache; you specify a name. You specify a directory were an image is stored when the cache is hardened to disk.

Some of the options like name= and cacheDir= are picked up when the JVM starts, Other parameters like cacheDirPerm are only used when the cache is (re-)created.

You can delete the cache in two ways.

Delete the cache from your your Java program

When you are playing around, you can add reset to the end of the -Xshareclasses string to caused the cache to be deleted and recreated.This gives output like

JVMSHRC010I Shared cache “client6” is destroyed
JVMSHRC158I Created shared class cache “client6”
JVMSHRC166I Attached to cache “client6”, size=20971328 bytes

This was especially useful when tuning the storage allocations.

Delete the cache independently

java -Xshareclasses:cacheDir=/tmp,name=client6,destroy

How to allocate the size of the cache

You specify the storage allocations using -Xsc.. (where sc stands for shareclasses)

If you have -Xsharedcache:verbose… specified then when the JVM shuts down you get

JVMSHRC168I Total shared class bytes read=11660. Total bytes stored=5815522
JVMSHRC818I Total unstored bytes due to the setting of shared cache soft max is 0.
Unstored AOT bytes due to the setting of -Xscmaxaot is 1139078.
Unstored JIT bytes due to the setting of -Xscmaxjitdata is 131832.

This shows the values of maxaot and maxjitdata are too small they were

-Xscmx20m
-Xscmaxaot2k
-Xscmaxjitdata2k

Whem the values were big enough I got

JVMSHRC168I Total shared class bytes read=12960204. Total bytes stored=8885038
JVMSHRC818I Total unstored bytes due to the setting of shared cache soft max is 0.
Unstored AOT bytes due to the setting of -Xscmaxaot is 0.
Unstored JIT bytes due to the setting of -Xscmaxjitdata is 0.

How big a cache do I need?

If you use -Xshareclasses:verbose… it will display messages

for example

JVMSHRC166I Attached to cache “client6”, size=2096960 bytes
JVMSHRC269I The system does not support memory page protection

JVMSHRC096I Shared cache “client6” is full. Use -Xscmx to set cache size.
JVMSHRC168I Total shared class bytes read=77208. Total bytes stored=2038042

Message JVMSHRC096I Shared cache “client6” is full. Use -Xscmx to set cache size, tells you the cache is full – but no information about how big it needs to be.

You can use

java -Xshareclasses:cacheDir=/tmp,name=client6,printStats

to display statistics like

-Xshareclasses persistent cache disabled]                                         
[-Xshareclasses verbose output enabled]                                            
JVMSHRC159I Opened shared class cache "client6"                                    
JVMSHRC166I Attached to cache "client6", size=2096960 bytes                        
JVMSHRC269I The system does not support memory page protection                     
JVMSHRC096I Shared cache "client6" is full. Use -Xscmx to set cache size.          
                                                                                   
Current statistics for cache "client6": 
cache size                           = 2096592                       
softmx bytes                         = 2096592                       
free bytes                           = 0                             
ROMClass bytes                       = 766804                        
AOT bytes                            = 6992                          
Reserved space for AOT bytes         = -1                            
Maximum space for AOT bytes          = 1048576                       
JIT data bytes                       = 212                           
Reserved space for JIT data bytes    = -1                            
Maximum space for JIT data bytes     = 1048576                       
Zip cache bytes                      = 1131864                       
Startup hint bytes                   = 0                             
Data bytes                           = 13904                         
Metadata bytes                       = 12976                         
Metadata % used                      = 0%                            
Class debug area size                = 163840                        
Class debug area used bytes          = 119194                        
Class debug area % used              = 72%

Cache is 100% full  
                                                                             

This show the cache is 100% full, and how much space is used for AOT and JIT. The default value of -Xscmx I had was almost 16MB. I made it 200MB and this was large enough.

I could not find a way of getting my program to issue printStats.

How do I harden the cache?

You can use use the

java -Xshareclasses:cacheDir=/tmp,name=zosmf,verbose,snapshotCache

command to create the cache on disk. Afterwards the listAllCaches command gave

Cache name level        cache-type     feature 
client6    Java8 64-bit non-persistent cr
client6    Java8 64-bit snapshot       cr

Showing the non persistent data space, and the snapshot file.

You can use the restoreFromSnapshot to restore from the file to the data cache; before you start your Java program. You would typically do this after an IPL.

How can I tell what is going on and if shared classes is being used?

The java options “-verbose:dynload,class

reports on the

  • dynamic loading of the files, and processing them,
  • what classes are being processed.

For example

<Loaded java/lang/reflect/AnnotatedElement from /Z24A/usr/lpp/java/J8.0_64/lib/rt.jar>
< Class size 3416; ROM size 2672; debug size 0>
< Read time 1196 usec; Load time 330 usec; Translate time 1541 usec>
class load: java/lang/reflect/AnnotatedElement from: /Z24A/usr/lpp/java/J8.0_64/lib/rt.jar
class load: java/lang/reflect/GenericDeclaration from: /Z24A/usr/lpp/java/J8.0_64/lib/rt.jar

dynload gave

<Loaded java/lang/reflect/AnnotatedElement from /Z24A/usr/lpp/java/J8.0_64/lib/rt.jar>
< Class size 3416; ROM size 2672; debug size 0>
< Read time 1196 usec; Load time 330 usec; Translate time 1541 usec>

this tells you a jar file was read from the file system, and how long it took to process it.

class gave

class load: java/lang/reflect/AnnotatedElement from: /Z24A/usr/lpp/java/J8.0_64/lib/rt.jar
class load: java/lang/reflect/GenericDeclaration from: /Z24A/usr/lpp/java/J8.0_64/lib/rt.jar

This shows two classe were extracted from the jar file.

In a perfect system you will get the class load entries, but not <Loaded….

Even when I had a very large cache size, I still got dynload entries. These tended to be loading class files rather than jar files.

For example there was a dynload entry for com/ibm/tcp/ipsec/CaApplicationProperties. This was file /usr/lpp/zosmf./installableApps/izuCA.ear/izuCA.war/WEB-INF/classes/com/ibm/tcp/ipsec/CaApplicationProperties.class

If you can make these into a .jar file you may get better performance. (But you may not get better performance, as it may take more time to load a large jar file).

I also noticed that there was dynload for com/ibm/xml/crypto/IBMXMLCryptoProvider which is in /Z24A/usr/lpp/java/J8.0_64/lib/ext/ibmxmlcrypto.jar, so shared classes has some deeper mysteries!

What happens if the .jar file changes?

As part of the class load, it checks the signature of the file on disk, matches the signature on the data space. If they are different the data space will be updated.

Looking at the performance of Liberty products and what you can do with the information

The Liberty Web Server on z/OS can produce information on the performance of “transactions”. It may not be as you expect, and it may not be worth collecting it. I started looking into this to see why certain transactions were slow on z/OSMF.

This blog post covers

  • How does the web server work?
  • How can I see what is going on?
  • Access logs
  • SMF 120
  • WLM
  • Do I want to use this in production?

How does it work?

A typical Web Server transaction might use a database to debit your account, and to credit my account. I think of this as a fat transaction because the application does a lot of work,

z/OSMF runs on top of Liberty, and allows you to run SDSF and ISPF within a web brower, and has tools to help automate work. You can also use REST services, so you can send a self contained HTTP request to z/OSMF, for example request information about data sets, or jobs running on the system. Both of these send a request to z/OSMF, which might send a request to a TSO userid, get the response back and pass the response back to the requester. I think of this as a thin transaction because the application running on the web server is mainly acting as a router or broker. What the end user sees as a “transaction” may be many micro services – each of which is a REST requests.

How can I see what is going on?

You can see what “transactions” or requests have been issued from

  • the access log – a flat file of requests
  • SMF 120 records
  • WLM reports.

Access logs

In the <httpEndpoint … accessLoggingRef=”accessLogging”…>, the accessLogging is a reference to an <httpAccessLogging… statement. This creates a flat files containing a record of all inbound HTTP client requests. If you have multiple httpEndpoint statements, you can have multiple accessLogging files. You can control the number and size of the files.

This has information with fields like

  • 10.1.0.2 – the IP address the request came in from
  • COLIN – the userid
  • {16/May/2021:14:47:40 +0000} – the date, time and time zone of the event
  • PUT /zosmf/tsoApp/tso/ping/COLIN-69-aabuaaac – the request
  • HTTP/1.1 – the type of HTTP request
  • 200 – the HTTP return code
  • 78 – the number of bytes sent back.

You can have multiple httpEndpoint definitions, for example specifing different IP address and port combinations. These definitions can point to a shared or indivdual httpAccessLogging statement, and so can share (or not) the flat files. This allows you to specify that for one port you will use the accessLogging, and another port does not have access logging.

The SMF 120 records.

The request logging writes a record to SMF for each request.

<server>
<featureManager>
<feature>zosRequestLogging-1.0</feature>
</featureManager>
</server

I have not been able to find a formatter for these records from IBM, so I have written my own, it is available on github.

It produces output like

Server version      :2                                                              
System              :S0W1                                                           
Syplex              :ADCDPL                                                         
Server job id       :STC02771                                                       
Server job name     :IZUSVR1                                                        
config dir          :/var/zosmf/configuration/servers/zosmfServer/                  
codelevel           :19.0.0.3  
                                                     
Start time          :2021/05/16 14:42:33.955288                                     
Stop  time          :2021/05/16 14:42:34.040698                                     
Duration in secs    : 0.085410                                                      
WLMTRan             :ZCI4                                                           
Enclave CPU time    : 0.042570                                                      
Enclave ZIIP time   : 0.042570                                                      
WLM Resp time ratio :10.000000                                                      
userid long         :COLIN                                                          
URI                 :/zosmf/IzuUICommon/externalfiles/sdsf/index.html               
CPU Used Total      : 0.040111                                                      
CPU Used on CP      : 0.000000                                                      
CPU Delta Z**P      : 0.040111 
                                                     
WLM Classify type   :URI                                                            
WLM Classify by     :/zosmf/IzuUICommon/externalfiles/sdsf/index.html
WLM Classify type   :Target Host                            
WLM Classify by     :10.1.1.2                               
WLM Classify type   :Target Port                            
WLM Classify by     :10443   
                               
Response bytes      :7003                                   

Origin port         :57706                                  
Origin              :10.1.0.2                                              

This has information about

  • The Liberty instance, where it is running and configuration information
  • Information about the transaction, start time, stop time, duration and CPU time used
  • WLM information about the request. It was classified as
    • URI:…index.html
    • Target Host:10.1.1.2
    • Target port:10443
  • 7003 bytes were sent to the requester
  • the request came from 10.1.0.2 port 57706

The SMF formatter program also summarises the records and this shows there are records for

  • /zosmf/IzuUICommon/externalfiles/sdsf/js/ui/theme/images/zosXcfsCommand_enabled_30x30.png
  • /zosmf/IzuUICommon/externalfiles/sdsf/js/ui/theme/sdsf.css
  • /zosmf/IzuUICommon/externalfiles/sdsf/sdsf.js
  • /zosmf/IzuUICommon/externalfiles/sdsf/IzugDojoCommon.js
  • /zosmf/IzuUICommon/persistence/user/com.ibm.zos.sdsf/JOBS/JOBS

This shows there is a record for each part of the web page, icons, java scripts and style sheets.

Starting up SDSF within z/OSMF created 150 SMF records! Refreshing the data just created 1 SMF record. The overhead of creating all the SMF records for one “business transaction” may be too high for production use.

As far as I can tell this configuration is server wide. You cannot enable it for a specific IP address and port combination.

WLM reports

Much of the data produced in the records above can be passed to WLM. This can be used to give threads appropriate priorities, and can produce reports.

You enable WLM support using

  • <featureManager>
    • <feature>zosWlm-1.0 </feature>
  • </featureManager>
  • <zosWorkloadManager collectionName=”MOPZCET”/>
  • <wlmClassification>
    • <httpClassification transactionClass=”ZCI6″ resource=”/zosmf/desktop/”/>
    • <httpClassification transactionClass=”ZCI1″ resource=”/**/sdsf/**/*”/>
    • <httpClassification transactionClass=”ZCI3″ resource=”/zosmf/webispf//”/>
    • <httpClassification transactionClass=”ZCI4″ resource=”/**/*”/>
    • <httpClassification transactionClass=”ZCI2″ resource=”IZUGHTTP” />
    • <httpClassification transactionClass=”ZCI5″ port=”10443″/>
  • </wlmClassification>

Where the httpClassification maps a z/OSMF resource to an 8 character transaction class. THe records are process from the top until there is a match. For example port=10443 would not be used because of the generic resource=/**/* definition.

These need to be mapped into the WLM configuration…

WLM configuration

You can configure WLM through the WLM configuration panels.

option 6. Classification rules.

option 3 to modify CB(Component broker)

          -------Qualifier--------                 -------Class--------  
 Action   Type      Name     Start                  Service     Report   
                                          DEFAULTS: STCMDM      TCI2
  ____  1 CN        MOPZCET  ___                    ________    THRU
  ____  2   TC        ZCI1     ___                  SCI1        RCI1
  ____  2   TC        ZCI2     ___                  SCI2        RCI2
  ____  2   TC        ZCI3     ___                  SCI3        RCI3
  ____  2   TC        ZCI4     ___                  SCI4        RCI4
  ____  2   TC        ZCI5     ___                  SCI5        RCI5
  ____  2   TC        ZCI6     ___                  SCI6        RCI6
  ____  2   TC        THRU     ___                  THRU        THRU

For the Type=CN, Name=MOPZCET, this value ties up with the <zosWorkloadManager collectionName=”MOPZCET” above. Type=CN is for CollectionName.
For the subtype 2 TC Name ZCI4, This is for TransactionClass which ties up with a httpClassification transactionClass statement.

The service class SCI* controls the priority of the work, the Report class RCI* allow you to produce a report by this name.

If you make a change to the WLM configuration you can save it from the front page of the WLM configuration panels, Utilities-> 1. Install definition, and activate it using Utilities-> 3. Activate service policy.

If you change the statements in Liberty or z/OSMF I believe you have to restart the server.

How to capture the data

The data is written out to SMF records on a timer, or on the SMF end-of-interval broadcast. If you change the interval, SMF sends an end-of-interval broadcast and writes the records to SMF. For example on my dedicate test system I use the operator command setsmf intval(10) to change the interval to 10 minutes. After the next test, I use setsmf intval(15) etc..

The data is kept in SMF buffers, and you may have to wait for a time, before the data is written out to external storage. It SMF data is being produced on a regular basis, it will be flushed out.

How to report the data

I copy the SMF data to a temporary data

//IBMURMF  JOB 1,MSGCLASS=H RESTART=POST 
//* DUMP THE SMF DATASETS 
// SET SMFPDS=SYS1.S0W1.MAN1 
// SET SMFSDS=SYS1.S0W1.MAN2 
//* 
//SMFDUMP  EXEC PGM=IFASMFDP 
//DUMPINA  DD   DSN=&SMFPDS,DISP=SHR,AMP=('BUFSP=65536') 
//DUMPINB  DD   DSN=&SMFSDS,DISP=SHR,AMP=('BUFSP=65536') 
//* MPOUT  DD   DISP=(NEW,PASS),DSN=&RMF,SPACE=(CYL,(1,1)) 
//DUMPOUT  DD   DISP=SHR,DSN=IBMUSER.RMF SPACE=(CYL,(1,1)) 
//SYSPRINT DD   SYSOUT=* 
//SYSIN  DD * 
  INDD(DUMPINA,OPTIONS(DUMP)) 
  INDD(DUMPINB,OPTIONS(DUMP)) 
  OUTDD(DUMPOUT,TYPE(70:79)) 
  DATE(2020316,2021284) 
  START(1539) 
  END(2359) 
/* 

and display the report classes

//POST EXEC PGM=ERBRMFPP 
//* PINPUT DD DISP=SHR,DSN=*.SMFDUMP.DUMPOUT 
//MFPINPUT DD DISP=SHR,DSN=IBMUSER.RMF 
//SYSIN DD * 
SYSRPTS(WLMGL(RCPER)) 
/* 

The output in //PPXSRPTS was

REPORT CLASS=RCI4                                                      
-TRANSACTIONS--  TRANS-TIME HHH.MM.SS.FFFFFF  
AVG        0.01  ACTUAL                69449  
MPL        0.01  EXECUTION             68780  
ENDED       160  QUEUED                  668  
END/S      0.13  R/S AFFIN                 0  
#SWAPS        0  INELIGIBLE                0  
EXCTD         0  CONVERSION                0  
                 STD DEV              270428  
                                              
----SERVICE----   SERVICE TIME  ---APPL %---  
IOC           0   CPU    2.977  CP      0.05  
CPU        2551   SRB    0.000  IIPCP   0.05  
MSO           0   RCT    0.000  IIP     0.19  
SRB           0   IIT    0.000  AAPCP   0.00  
TOT        2551   HST    0.000  AAP      N/A  
/SEC          2   IIP    2.338                
ABSRPTN     232   AAP      N/A                
TRX SERV    232                               
                                              

There were 160 “transactions” within the time period or 0.13 per second, The average response time was 69449 microseconds, with a standard deviation of 270428. This is a very wide standard deviation, so there was a mixture of long, and short response times.

The total CPU for this report class was 2.977 seconds of CPU, or 0.019 seconds per “transaction”.

Do I want to use this in production?

I think that the amount of data produced is managable for a low usage system. For a production environment where there is a lot of activity then the amount of data produced, and the cost of producing the data may be excessive. This could be an example of the cost of collecting the data is much larger than the cost of running the workload.

As z/OSMF acts as a broker, passing requests between end points you may wish just to use your existing reporting structures.

I used z/OSMF to display SDSF data, and set up and ISPF session within the web browse. This created two TSO tasks for my userid. If you include my traditional TSO session and the two from z/OSMF this is three TSO sessions in total running on the LPAR.

Why do they ship java products on z/OS with the handbrake on? And how to take the brake off.

I noticed that it takes seconds to start MQ on my little z/OS machine, but minutes (feels like days) to start anything with Liberty Web server.  This include the MQWEB, z/OSMF,  and Z/OSConnect.  I mentioned this to an IBM colleague who asked if I was using Java Shared classes.  These get loaded into z/OS shared pages.

When I implemented it, my Liberty server came up in half the time!

I found this blog post which was very helpful, and showed me where to look for more information.  I subsequently found this document (from 2006!)

The kinder garden overview of how Java works.

  • You start with a program written in the Java language.
  • When you run this, Java converts it into byte codes
  • These byte codes get converted to native instructions  – so a byte code “push onto the stack” may become 8  390 assembler instructions.
  • This code can be optimised, for example code which is executed frequently can have the assembler instructions rewritten to go faster.  It might put code inline instead of out in a subroutine.
  • If you are using Java shared classes, this code can be written out and reused by other applications, or if you restart the server, it can reused what it created before.  Reusing the shared classes means that programs benefit because the byte codes have already been converted into native code, and optimisations have been done on the hot code.

What happens on z/OS?

By default, z/OS writes the code to virtual memory and does not save anything to disk.  If you restart your Java application within the same IPL, it can exploit the shared classes which have been converted to native code, and optimised – great- good design.   I found the second time I started the web server it took half the time.  However I IPL once a day, and start my web server once a day. I do not benefit from having it start faster a second time – as I only started it once per session. By default when you re-ipl, the shared classes code is discarded, and so next time you need the code, it has to be to convert to native instructions again, and it loses any optimisation which had been done.

What is the solution?

It is two easy steps:!

  1. Tell Java to write the information from memory to disk – to take a snaphot.
  2. After IPL tell Java to load memory from the disk image – to restore a snapshot.

It is as simple as that.

Background.

It is all to do with the java -Xshareclasses.

With your application you tell Java where to store information about the shared classed.  It defaults to Cache=/tmp/ name=javasharedresources.

In my jvm.options I overrode the defaults and specified

-Xshareclasses:nonFatal 
-Xshareclasses:groupAccess
-Xshareclasses:cacheDirPerm=0777
-Xshareclasses:cacheDir=/tmp,name=mqweb

If you give each application a name (such as mqweb)  you can isolate the cache to an application and not disrupt another JVM if you change the cache.  For example if you restore from a snapshot, only users of that “name” will be affected.

List what is in the cache

You can use the USS command,

java -Xshareclasses:cacheDir=/tmp/,listAllCaches

I used a batch job to do the same thing.

//IBMJAVA  JOB  1 
// SET V='listAllCaches' 
// SET C='/tmp/' 
//S1       EXEC PGM=BPXBATCH,REGION=0M, 
// PARM='SH java -Xshareclasses:cacheDir=&C,&V' 
//STDERR   DD   SYSOUT=* 
//STDOUT   DD   SYSOUT=*            

The output below, shows the cache name is mqweb.  Once you have created a snapshot it has an entry for it.

Listing all caches in cacheDir /tmp/                                                                          
                                                                                                              
Cache name       level         cache-type      feature         OS shmid       OS semid 
mqweb            Java8 64-bit  non-persistent  cr              8197           4101 

For MQWEB the default parameters are -Xshareclasses:cacheDir=/u/mqweb/servers/.classCache,name=liberty-%u” where /u/mqweb is the WLP parameter, where my parameter are defined, and %u is the userid the server is running under, so in my case liberty=START1.

When I had /u/mqweb/servers/.classCache, then the total command line was too long for BPXBATCH.   (Putting it into STDPARM gave me IEC020I 001-4 on the instream STDPARM because the resolved line wa greater than 80 characters.   I resolved this by adding -Xshareclasses:cacheDir=/u/mqweb,name=cache to the jvm.options file.

To take a snapshot


//IBMJAVA  JOB  1 
// SET C='/tmp/' 
// SET N='mqweb' 
// SET V='restoreFromSnapshot' 
// SET V='listAllCaches'
// SET V='snapshotCache' //S1 EXEC PGM=BPXBATCH,REGION=0M, // PARM='SH java -Xshareclasses:cacheDir=&C,name=&N,&V' //STDERR DD SYSOUT=* //STDOUT DD SYSOUT=* //

This job took a few seconds to run.

I believe you have to take the snapshot while your java application is executing – but I do not know for definite.

Restore a snapshot

To restore a snapshot just use restoreFromSnapshot in the above JCL. This took a few seconds to run. 

How to use it.

If you put the restoreFromSnaphot JCL at the start of the web server, it will preload it whenever you use your server.

If you take a snapshot every day before shutting down your server, you will get a copy with the latest optimisations.  If you do not take a new snapshot it continues to use the old one.

If you want to not use the shared cache you can get rid of it using the command destroySnapshot.

Is my cache big enough?

If you use the printStats request you get information like

Current statistics for cache "mqweb":                                                
...                                                                                     
cache size                           = 104857040                                     
softmx bytes                         = 104857040                                     
free bytes                           = 70294788 
...
Cache is 32% full                                     
                                                      
Cache is accessible to current user = true                                                 

The documentation says

When you specify -Xshareclasses without any parameters and without specifying either the -Xscmx or -XX:SharedCacheHardLimit options, a shared classes cache is created with a default size, as follows:

  • For 64-bit platforms, the default size is 300 MB, with a “soft” maximum limit for the initial size of the cache (-Xscmx) of 64MB, …

I had specified -Xscmx100m  which matches the value reported.

What is in the cache?

You can use the printAllStats command.  This displays information like

Classpath

1: 0x00000200259F279C CLASSPATH
/usr/lpp/java/J8.0_64/lib/s390x/compressedrefs/jclSC180/vm.jar
/usr/lpp/java/J8.0_64/lib/se-service.jar
/usr/lpp/java/J8.0_64/lib/math.jar

Methods for a class
  • 0x00000200259F24A4 ROMCLASS: java/util/HashMap at 0x000002001FF7AEB8.
  • ROMMETHOD: size Signature: ()I Address: 0x000002001FF7BA88
  • ROMMETHOD: put Signature: (Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; Address: 0x000002001FF7BC50

This shows

  • there is a class HashMap. 
  • It has a method size() with no parameters returning an Int.  It is at…. in memory
  • There is another method put(Object o1, Object o2)  returning an Object.  It is at … in memory
Other stuff

There are sections with JITHINTS and other performance related data.

 

Taking the brakes off ZFS on z/OS – move it to OMVS

From z/OS 2.2 there is a performance advantage in running the ZFS file system as part of OMVS, rather than its own address space.  The IBM documentation says When running zFS in the OMVS address space, each file system vnode operation (such as creating a directory entry, removing a directory entry, or reading from a file) will have better overall performance. Each operation will take the same amount of time while inside zFS itself. The performance benefit occurs because z/OS UNIX can call zFS for each operation in a more efficient manner.  This will be relevant when you application is doing a lot of file IO – for example using a web server.

This move is not documented – but it is really easy!  It is mentioned here. Instructions are hidden in the installation instructions here.

Before I started

The IBM doc says You can determine if zFS is in its own address space by issuing D OMVS,PFS. If the output shows an ASNAME value, zFS is running as a colony address space.


OMVS     0010 ACTIVE             OMVS=(00,01,BP,IZ,RZ,BB)                
PFS CONFIGURATION INFORMATION                                            
 PFS TYPE   ENTRY      ASNAME    DESC      ST    START/EXIT TIME         
 ...   
  ZFS       IOEFSCM    ZFS       LOCAL     A     2021/02/17 17.35.06 

The steps I took…

  1. I added KERNELSTACKS(ABOVE) to USER.Z24A.PARMLIB(BPXPRM00).
  2. Being ultra cautious I re-ipled.
  3. The documentation talks about putting IOEZPRM DD in OMVS, then goes on to say As the preferred alternative to the IOEZPRM DDNAME specification, delete the IOEZPRM DDNAME and use the IOEPRMxx parmlib member.  So I did not change the OMVS proc.  When I reipled it worked and I got the message IOEZ00374I No IOEZPRM DD specified in OMVS proc. Parmlib search being used. 
  4. I edited USER.Z24A.PARMLIB(BPXPRM00) and removed the ASNAME in  FILESYSTYPE TYPE(ZFS) ENTRYPOINT(IOEFSCM)ASNAME(ZFS) Well I actually made a copy of the original line and put it between /* and */, then deleted the text.
  5. I reipled.

Afterwards

The  D OMVS,PFS command now gives  N/A instead of the Address Space Name

OMVS     0010 ACTIVE             OMVS=(00,01,BP,IZ,RZ,BB)                
PFS CONFIGURATION INFORMATION                                         
 PFS TYPE   ENTRY      ASNAME    DESC      ST    START/EXIT TIME         
...
  ZFS       IOEFSCM    N/A       LOCAL     A     2021/02/17 17.55.47  

Easy!

The hardest part was making sure I had an IPLable SARES1 in case I got it wrong!

Issuing commands…

I used to issue commands like f zfs,query,all. Now that the ZFS address space does not exist, you need to use  f omvs,pfs=zfs,query,all.

What should I monitor for MQ on z/OS – buffer pool

For the monitoring of MQ on z/OS, there are a couple of key metrics you need to keep an eye on for the buffer pools, as you sit watching the monitoring screen.

A quick overview of MQ buffer pools

An inefficient queue manager could have all messages stored on disk, in page sets.  An efficient queue manager would cache hot pages in memory to they can be accessed without doing any I/O.

The MQ Buffer Manager component does this caching, by using buffers  (so no surprise there).

A simplified view of the operation is as follows

  1. Messages are broken down into 4KB pages.
  2. Getting the contents of a pageset page, if the page is not in the buffers, the buffer manager reads it from disk into a buffer.
  3. If a page is requested and the contents are not required (for example it is about to be overwritten as part of a put) it does not need to read it from disk.
  4. If the page is updated, for example a new message, or a field in the message is updated during get processing,  the page is not usually written to disk immediately.   The write to disk is deferred (using a process called Deferred Write Process – another non surprise in the naming convention).  This has the advantage that there can be many updates to a page before it is written to disk.
  5. Any buffer page which has not been changed, and is not currently being used, is a free page.

If the system crashes, non persistent messages are thrown away, and persistent messages can be rebuilt from the log.

To reduce the restart time, pages containing persistent data are written out to the page set periodically.   This is driven by the log data set filling up which causes a checkpoint.  Updates which have been around for more than 2 checkpoints are written to disk.  During restart the page set knows how far back in the log restart needs to go.

In both cases, checkpoint and buffer pool filling up (when there are less than 15 % of free =85% in use) , once a page has been successfully written to the page set, the buffer is marked as free.

Pages for non-persistent messages can be written out to disk.

If the buffer pool is critically short of free buffers, and there are less than 5% free buffers, then pages are written to the page set immediately rather than use the deferred write process.  This allows some application work to be done while the buffer manger is trying to make more free pages available.

What is typical behaviour?

The buffer pool is working well.

When a message is got, the buffer is already in the buffer pool so there is no I/O to read from the page set.

The buffer pool is less than 85% full (has more than 15% free pages), there is periodic I/O to the page set because pages with persistent data are written to the page set at checkpoint.

The buffer pool fills up and has more 85% in-use pages.

This can occur if the number and size of the messages being processed is bigger than the size of the buffer pool. This could be a lot of messages in a unit of work,  big messages using lots of pages, or lots of transactions putting and getting messages.  It can also occur when there are lots of puts, but no gets.

If the buffer pool has between 85%  and 95% of in-use pages( between 15%  and 5% free pages),  the buffer manager is working hard to keep free pages available.

There will be I/O intermittently at checkpoints, and a steady I/O as the buffer manager writes the pages to the page set.

If messages are being read from disk, there will be read activity from the pageset, but the buffer pool page can be reused as soon as the data has been copied from the buffer pool page.

The buffer pool has less than 5% free pages.

The buffer manager is in overdrive.  It is working very hard to keep free pages in the buffer pool.   There will be pages written out to the page set as it tries to increase the number of free pages.  Gets may require reads from page set I/O.  All of this I/O can cause I/O contention and all the page set I/Os slow down, and so MQ API request using this buffer pool slow down.

What should be monitored

Most cars these days have a “low fuel light” and a “take me to a garage” light.  For monitoring we can provide similar.

Monitor the % buffer pool full.

  • If it is below 85% things are OK
  • If it is between 85% and 95% then this needs to be monitored, it may be “business as usual”.
  • If it is >=95% this needs to be alerted.  It can be caused by applications or channels not processing messages

Monitoring the number of pages written does not give much information.

It could be caused by a checkpoint activity, or because the buffer pool is filling up.

Monitoring the number of pages read from the page set can provide some insight.

If you compare today with this time last week you can check the profile is similar.

If the buffer pool is below 85% used,

  • Messages being got are typically in the buffer pool so there is no read I/O.
  • If there is read I/O this could be for messages which were not in the buffer pool – for example reading the queue after a queue manager restart.

If the buffer pool than 85% in-use  and less than 95% in-use  this can be caused by a large message work load coming in, and MQ is being “elastic” and queuing the messages.  Even short lived messages may be read from disk.  The number of read I/Os give an indication of throughput.  Compare this with the previous week to see if the profile is similar.

If the buffer pool is more than 95% in-use this will have an impact on performance, as every page processed is likely to have I/O to the page set, and elongated I/O response time due to the contention.

What to do

You may want “operators notes” either on paper or online which describe the expected status of the buffer pools on individual queue managers.

  1. PROD QMPR
    1. BP 1 green less than 85% busy
    2. BP 2 green less than 85% busy
    3. BP3 green except for Friday night when it goes amber  read I/O rate 6000 I/O per minute.
  2. TEST QMTE
    1. BP 1 green less than 85% busy
    2. BP 2 green less than 85% busy
    3. BP 3 usually amber – used for bulk data transfer

What to do the buffer statistics mean?

There are statistic on the buffer pool usage.

  1. Buffer pool number.
  2. Size of buffer pool – when the data was produced.
  3. Low – the lowest number of free pages in the interval.  100* (Size – log)/Size  gives you % full.
  4. Now – the current number of free pages in the interval.
  5. Getp – the number of requests ‘get page with contents’.  If the page is not in the buffer pool then it is read from the page set.
  6. Getn.   A new page is needed.   The contents are not relevant as it is about to be overwritten.  If the page is not in the buffer pool, just allocate a buffer, and do not read from the page set.
  7. STW – set write intent.  This means the page was got for update. I have not seen a use for this. For example
    1. A put wants to put part of a message on the page
    2. A get is being done and it wants to set the “message has been got” flag.
    3. The message has been got, and so pointers to a page need to be updated.
  8. RIO -the number of read requests to the page set.  If this is greater than zero
    1. The request is for messages which have not been used since the queue manager started
    2. The buffer pool had reached 85%, pages had been moved out to the page set, and the buffer has been reused.
  9. WIO the number of write I/Os that were done.  This write could be due to a checkpoint, or because the buffer pool filled up.
  10. TPW total pages written, a measure of how busy the buffer pool was.   This write could be due to a checkpoint, or because the buffer pool filled up.
  11. IMW – immediate write.  I have not used this value, sometimes I observe it is high, but it is not useful.  This can be caused by
    1. the buffer pool being over 95% full, so all application write I/O is synchronous,
    2. or a page was being updated during the last checkpoint, and it needs to be written to the page set when the update has finished.  This should be a small number.  Frequent checkpoints (eg every minute) can increase this value.
  12. DWT – the number of times the Deferred Write processor was started.  This number has little value.
    1. The DWP could have started and been so busy that it never ended, so this counter is 1.
    2. The DWP could have started, written a few pages and stopped – and done this repeatedly, in this case the value could be high.
  13. DMC.   The number of times the buffer pool crossed the 95% limit.  If this is > 0 it tells you the buffer pool crossed the limit
    1. This could have crossed just once, and stayed over 95%
    2. This could have gone above 95%, then below 95% etc.
  14. SOS – the buffer pool was totally short on storage – there were no free pages.  This is a critical indicator.  You may need to do capacity planning, and make the buffer pool bigger, or see if there was a problem where messages were not being got.
  15. STL – the number of times a “free buffer” was reused.  A buffer was associated with a page of a page set.   The buffer has been reused and is now for a different page.  If STL is zero, it means all pages that were used were in the buffer pool
  16. STLA – A measure of contention when pages are being reused.  This is typically zero.

Summary

Now you know as much as I do about buffer pools you’ll see that the %full (or %free) is the key measure.  If the buffer pool is > 85% used pages, then the I/O rate is a measure as to how hard the buffer manager is running.

Do performance measurements need to be accurate and precise?

Like many performance questions the answer is “it depends”.

This question is one of the “simple” questions – which most people get wrong. For example you can have a very precise answer which is inaccurate.

Precision

People often give the wrong precision. For example

  • What is the difference between 1 and 1.0?
  • What is the difference between 999 and 1000, and 1000.0?

What is the difference between 1 and 1.0?

With decimal points you can tell the precision from the number of digits after the decimal point.

  • The value 1 is a number between 0.5 and 1.49, with a range of almost 1
  • the number 1.0 is between 0.95 and 1.049. with a range of almost 0.1

What is the difference between 999 and 1000, and 1000.0?

With numbers without a decimal point, the convention is the number of significant non zero digits.

  • 999 is in the range 998.5 to 999.49 a range of almost 1
  • 1000 has one significant digit – so this can represent a number in the range 500 to 1490, a range of almost 1000!
  • 1000.0 represents a number in the range 999.5 to 1000.49 a range of almost 1

Is precision important ?

If you have a set of measurements with values 9.6 9.8,9.9, 10.0, 10.1, 10.2 10.4 the average is 10.0. Saying the average is 10.0 is misleading as it implies the value is between 0.995 and 10.049. The value have a spread – and in this case, saying the value is 10, is better than saying 10.0. You could say 10.0 ± 0.4 to give an even better answer.

Is accuracy important?

Depending on what you want to deliver from your measurements, the CPU figures you collect may not be accurate enough – or they may be good enough not to worry about.

Why are the CPU used figures inaccurate?

There is a gap between what is reported and what is actually used. For a high proportion of measurements the reported CPU is accurate enough. For some measurements you have to factor in this missing usage.

Below is a simplistic view of work being dispatched on an operating system.

  • The dispatcher scans the list of work to find the next work to dispatch
  • Get “Before – clock time”
  • Get “Before – processor CPU time”
  • Invoke the work. It does calculations, starts an I/O request and waits for the response, by returning control to the dispatcher
  • Get “After – processor CPU time”
  • Get “After – clock time”
  • Calculate the difference in clock times and accumulate this in the “work – active time”
  • Calculate the different in the processor CPU times to determine how much CPU time was used for the work, and accumulate this in the “work – CPU time”
  • The dispatcher scans the list of work to find the next work to dispatch
  • etc

The elapsed time figures, and the CPU time used are both accurate, but they do not include the elapsed time or the CPU time the dispatcher took to initiate the work and update the statistics. We can give a better estimate of the true costs.

When running a measurement z/OS can report the CPU used by each engine across the duration of the measurement. We might have

  • Total CPU used across the duration 50 seconds
  • Total CPU used for job A 40 seconds
  • Total CPU used for job B 9.5 seconds
  • Difference in CPU 0.5 seconds

A common way of allocating this “missing CPU” is to allocate it to the jobs depending on the amount of CPU the jobs used in the period. Job A would have 40 + (0.5 *40/50) = 40.4, and job B would have 9.5 + (0.5 * 9.5/50) = 9.6, a total of 50.

Is this completely accurate? – no but it is more accurate than before!

Is this important?

If you want to see the effects of defining an index on an SQL table, a throughput improvement of 100 times or two time, it is good enough to say yes – it improved performance. This does not need to be precise or very accurate.

If you have enhanced your program to remove bottlenecks and can show the CPU per transaction has dropped from 1.2 to 1.1 units then this relative comparison is valid – even if the cost is not entirely accurate.

If you want to give numbers for other people to use, such as the cost of 1000 transactions a second is 1.2 seconds of CPU on this configuration, you need to ensure the figures are accurate.

Do performance measurements need to be accurate and precise?

For many measurements the numbers do not need to be very precise, and the accuracy may be good enough.