One Minute MVS performance – DASD

Question: In your car how do you tell if your car has a problem? Answer: You look at the dashboard and see if there is a red light showing. You may not know how to fix it – but you know that you need to get help to fix it.

The aim of this series of blog posts is to show you what to look for in z/OS performance and if you have a problem.

I will cover

For some of these you need data from z/OS. This post describes how to get the SMF data, and format it using RMF.

DASD has changed in 40 years

40 years ago “disk storage” was on huge rotating disks and you had to carefully manage where you put your datasets -both which disk, and whereabouts on the disk. For example people would put the hot dataset in the “centre” of the disk to minimise the time to move the heads.

For the last 20 years people use the term “storage” because most I/O activity goes to cache in the disk controller, and the disk controller writes the data out to PC sized disks – which in turn may be solid state, and have no moving parts.

A pictorial view of disks

You have the processor running z/OS
Plugged into the side of the processor is the I/O adapter
Plugged into this I/O adapter are a lot of channels (think optical fibre cables)
Theses cables can be plugged into a switch – think of a plug board or telephone exchange. This allows channels from 2 processors plugged into the switch, and have one cable down to the storage controller . You could avoid a switch and have cables directly from the processor to the storage controller. Each processor would need its own set of cables.
The storage controller manages all of the IO
- It has a lot of cache so most I/O may go to the cache. During a read, the storage controller will read from the disks if the data is not in the cache.
- It has many PC type of disks. These disks could be solid state, or have rotating disks
- If you have mirrored disks, the storage controller talks to a remote storage controller
Within each channel are many logical sub channels. Each disk has at least one sub-channel allocated to it. A disk can have multiple sub-channels allocated to it. There can be a pool of sub-channels which are used as needed to allowed parallel I/O to a disk.

The I/O journey

Your application wants to read the first record of a file.
Once the file has been opened, the application can issue the read.
z/OS knows where the data set is on disk (eg VOLID A4USR1, Cylinder 200, track 4)
z/OS builds up a set of commands (such as locate disk, locate cylinder 200, locate track 4, read data, read data, read data) to get the data and issues the Start Sub channel request, passing the list of I/O commands.
This is queued to the I/O adapter.
The original application is suspended (until the I/O is complete)
The I/O adapter looks for a free sub-channel for the disk, or gets one from the sub-channel pool.
The I/O adapter takes the list of commands, and executes them one at a time.
When the I/O adapter has finished the list of commands, it sends an interrupt to the mainframe saying “this subchannel has finished”.
z/OS wakes up, looks at the interrupt, and resumes the application.

Today you have to consider 3 areas where you can get delays, you need to be an expert if you want to look at more detail.

Waiting in the I/O adapter before being able to get a sub channel. This is known as IOSQ – IO subsystem Queueing.
Establishing the connection from processor to the storage controller
Transferring the data the connect time.

This is complicated by being able to use disks 50 km away, which adds to the delay time.

RMF Reports

In the RMF MFR000… report with section D I R E C T A C C E S S D E V I C E A C T I V I T Y. (I search for IOSQ).


                  DEVICE   AVG  AVG   AVG  AVG  AVG   AVG  AVG  AVG    %      %    
 VOLUME PAV  LCU  ACTIVITY RESP IOSQ  CMR  DB   INT   PEND DISC CONN   DEV    DEV  
 SERIAL   1       RATE     TIME TIME  DLY  DLY  DLY   TIME TIME TIME   CONN   UTIL 
 A4RES1   1       102.896  .044 .003  .001 .000       .004 .000 .036   0.38   0.38 
 A4RES2   1        27.293  .036 .000  .001 .000       .003 .000 .032   0.09   0.09 
 USER00   1        25.331  .031 .003  .001 .000       .004 .000 .024   0.06   0.06 
 A4SYS1   1       365.102  .026 .005  .001 .000       .004 .000 .017   0.62  24.52

Key fields

Volume Serial such as A4RES1 is the volid of the disk
PAV – I’ll mention this below.
Device Activity Rate – how many requests (start sub channel) from z/OS, per second
Average response time in milliseconds
Average IOSQ – how long did it have to wait in z/OS and the I/O adapter before the request was sent down to the storage controller

The times are in milliseconds.

There are often thousands of volumes in a z/OS environment some are heavily used, some are not used. See below on how to find the hot volumes.

I typically look at the volumes with the highest I/O. If the hot volumes have good response time, the not so hot should be OK.

If you think of the sub-channel connection between the mainframe and the volid in the storage controller, there can only be one I/O requests at a time per sub-subchannel. You can have multiple connections down to a volume. These are known as PAV, or Parallel Access Volumes. The PAV is the average number of sub-channels in use.

The first field you look at is the IOSQ. This is the time between z/OS starting the request, and before the I/O could be started to the storage controller. This should be small 10s of microseconds ( 0.0xx in the report above). If this value is larger than this, you need to speak to your Storage Manager or z/OS Systems Programmer.

The second field you look at is the % DEV UTIL. How busy was the connection to the storage controller. A value of 100% means that it was running flat out. If the utilisation is around 70-80% it may be a OK – just something to note. More PAVs can increase throughput for a busy disk.

The next figure you look at is the RESP TIME. This is the response time the application sees. For local disk, response times of under 1 millisecond are OK. If you have remote disks, and synchronous I/O then the response time will be longer.

Finding the hot volumes

I take the RMF report and extract the DASD records.

For SDSF where the output is in the spool
- I use Status to list all of the jobs, (Output or Hold work just as well)
- Put ? in front of the job to show all of the spool data sets
- use the SE command to Spool Edit the report
For a dataset I use the View prefix command in ISPF 3.4

Put DD in line prefix area on line 1
Find ‘D I R E C T’
Put DD in line prefix area, press enter, to delete the lines above it
Find ‘D I R E C T’ last
put d9999 in the line prefix area following the data (My report has ‘P A G I N G’), and press enter.
You should now have only DASD records
Put ‘cols’ in the line command area, note the columns of the DAR (50 to 58)
In the command line type SORT 50 58 D on Device Activity Rate.
This shows you the top usage volumes. Check the response times. Under 1 millisecond is good for locally attached disks. It can be down to 0.1 ms
If the response time is 1 ms or larger…
- Check columns 60-65 (AVG IOSQ TIME) this should be 0. If this is non zero it means there was queueing in z/OS before it got to the disks. If there was only one I/O request to the volume, then there would IOSQ would be zero. If there are multiple I/O requests then you can get IOSQ queuing time.
- Any IOSQ could be reduced by moving data sets to other volumes, or adding more paths(sub-channels) between the mainframe and the disks. Each disk requires at least one subchannel. You can allocate more in a pool – which are used when needed, but this is a z/OS system programmer/Storage manager job.
- As a performance person you can control which disks you use, and can spread the load.
- Avg CMR (ComMand Response) is the time to get from the processor down to the Storage Controller, and the controller to respond with “I’ve got the request” This should be small. This value allows you to see if delays are due to getting to the Storage controller, or within the controller.

If you do this for all disks you get an overall view of the data. Now you can select the DASD volumes you are using and check those.

If you find you have a long response time, then it is hard to find out the root cause. There are many links in the end to end chain. See here for more information.

One Minute MVS performance – CPU at the LPAR level

The aim of this series of blog posts is to show you what to look for in z/OS performance and if you have a problem.

I will cover

For some of these you need data from z/OS. This post describes how to get the SMF data, and format it using RMF.

CPU

There are two basic things you need to check

Has my LPAR got all the CPU it wanted – has the hyper-visor restricted the CPU?
How busy are my CPUs?

Has my LPAR got all the CPU it wanted

An LPAR can be configured to have dedicated engines, or share a pool of engines. Dedicated engines means that the engine is always there when it is needed. If the LPAR is using a shared engine, it may not always be available when needed.

An example to explain the concept

You have a class from 10am to 11 am. You go in, and sit down. The teacher starts the class. the teacher’s phone rings and goes out of the classroom. You play with your phone until the teacher comes back after 40 minutes. (The teacher went to teach in a different class room.)
How long were you in class for and how much work did you do?

You were in class for 1 hour.
You did 20 minutes work.

This concept is the same as any LPAR with shared engines.

The 1 hour class is a time slice as seen by z/OS.
The “processor” (teacher) was used in the time slice for only 20 minutes
For 40 minutes the “processor” was doing work elsewhere.

How do you get the report to show these figures.?

You need the RMF CPU report. It has “C P U A C T I V I T Y “ at the top of the page.

Look at the section

---CPU---    ---------------- TIME % ----------------   
NUM  TYPE    ONLINE    LPAR BUSY    MVS BUSY   PARKED   
 0    CP     100.00    46.68        46.32        0.00   
 1    CP     100.00    38.98        38.78        0.00   
 2    CP     100.00    34.91        34.62        0.00   
TOTAL/AVERAGE          40.19       39.90               
 3    IIP    100.00    94.43        94.70        0.00   
 4    IIP    100.00    93.50        93.74        0.00   
TOTAL/AVERAGE          93.96       94.22

LPAR BUSY is how much teacher time you got

MVS Busy is how much time you were in the classroom for.

If MVS BUSY TIME = LPAR BUSY TIME, perfect, what you needed you got.
If MVS BUSY TIME > LPAR BUSY TIME, MVS had to wait for an engines, the system may need more CPU, a small difference(5%) is OK.
If MVS BUSY TIME >> LPAR BUSY TIME, For much of the time, there was no engine when MVS needed This will have a major impact on your work. If your end user work is not meeting targets, you need more CPUs, or give your LPAR a higher dispatching priority.

These values should be similar: MVS BUSY TIME 39.60 is close to LPAR BUSY 40.19, and for the ZIIP, 93.96 is close to 94.22.

When these figures are significantly different, stop, and fix the problem. This can make all other performance data look bad. For example, disk response time, and timing in application trace entries.

How busy are my CPUs?

The TOTAL/Average will be close to 100 % on a busy system. 95% busy is OK, Make a note that the system may be short of CPU.

These are average values. The individual values could be spiky. For example at 100% busy for 4 minutes, 80% busy for 1 minute, or an average of 96% busy over 5 minutes. Consider using an online monitoring to see if you have big peaks and trough.

One minute MVS performance – getting batch RMF reports

There is an introduction to getting RMF reports docucmented here.

You can display information about your SMF environment, using

D SMF,O

This tells you if you are using SMF datasets, or log streams ( in the coupling facility) for the RMF data.

Copy the data from SMF dataset

// SET SMFPDS=SYS1.S0W1.MAN3 
// SET SMFSDS=SYS1.S0W1.MAN4 
//* 
//SMFDUMP  EXEC PGM=IFASMFDP 
//DUMPINA  DD   DSN=&SMFPDS,DISP=SHR,AMP=('BUFSP=65536') 
//DUMPINB  DD   DSN=&SMFSDS,DISP=SHR,AMP=('BUFSP=65536') 
//* MPOUT  DD   DISP=(NEW,PASS),DSN=&RMF,SPACE=(CYL,(1,1)) 
//DUMPOUT  DD   DISP=SHR,DSN=IBMUSER.RMF SPACE=(CYL,(1,1)) 
//SYSPRINT DD   SYSOUT=* 
//SYSIN  DD * 
  INDD(DUMPINA,OPTIONS(DUMP)) 
  INDD(DUMPINB,OPTIONS(DUMP)) 
  OUTDD(DUMPOUT,TYPE(70:79)) 
  DATE(2020316,2022284) 
  START(0930) 
  END(1700) 
/*

This job copies the records from the “MAN” data sets, and writes them to the DUMPOUT.

The RMF records with types 70 to 79 are copied, within the specified dates and start and end times.

Copy the data from a log stream.

SMF can write data to log streams, for example MQ records go to the MQ stream, and the RMF records go to the RMF Stream.

//SMFDUMP EXEC PGM=IFASMFDL
//DUMPOUT DD DSN=&TEMP,SPACE=(CYL,(10,10),RLSE),DISP=(NEW,PASS)
//SYSPRINT DD SYSOUT=*
//SYSIN DD *
DATE(2018004,2018012)
START(0900)
END(1900)
LSNAME(IFASMF.RMF,OPTIONS(ALL))
OUTDD(DUMPOUT,TYPE(70:79,116))
/*

This step writes the data to a temporary data set.

Sort the data

If you are processing the data from more than one LPAR you will need to sort the data. See here.

Format the RMF data

The RMF control statements are described here

//POST EXEC PGM=ERBRMFPP 
//* use the following if using a temporary data set in same job.
//* MPFINPUT DD DISP=SHR,DSN=*.SMFDUMP.DUMPOUT 
//MFPINPUT DD DISP=SHR,DSN=IBMUSER.RMF 
//PPXSRPTS DD   SYSOUT=*,DCB=(LRECL=200,BLKSIZE=4000) 
//SYSPRINT DD   SYSOUT=* 
//SYSOUT   DD   SYSOUT=* 
//SYSIN DD * 
  SUMMARY(INT,TOT) 
  NODELTA 
  SYSOUT(H) 
  REPORTS(ALL,NOENQ,DEVICE(NOUNITR,NOCOMM)) 
SYSRPTS(WLMGL(RCPER)) 
SYSRPTS(WLMGL(SCPER)) 
/* 
/* SYSRPTS(WLMGL(RCPER)) 
/* SYSRPTS(WLMGL(SCPER,RCLASS,RCPER,SCLASS))

This takes the records from MFPINPUT which could be a permanent data set, or a temporary data set passed from a previous job step.

You can have the output go to the spool (by default) or to preallocated data sets. See here

One Minute MVS performance reports of interest are

REPORTS(CPU,DEV(DASD))

For WLM the reports

SYSRPTS(WLMGL(RCPER,SCPER))

Straight lines good – curved lines bad.

It was a great day, it was 0830, the sun was shining, the view was amazing, as I was walking back the long way from the shops. There was a good hill en route, it starts nice and steep, then gradually leveled off to flat. I sat on the convenient seat at the top of the hill to admire the view and got my breath back. The shape of the hill reminded me of a graph in a performance report I had seen a few days before; It started steeply at the bottom left of the graph, and levelled off at the top right of the graph and the description was “so we can see it scales”. As I sat there enjoying the sun, I realised that it may scale – but it does not scale very well.

What does it mean to scale?

I remember 40+ years ago sitting in a meeting where the performance team were presenting status. They showed many graphs, one of which showed the transaction rate as they increased the work load. The graph started going up at an angle, nice and straight, and then levelled out until they achieved 100 transactions a second, and they could do no more than this. They explained that this was because of a lock being held for 1/10 of a second while some work was done.

They worked with the developer who realised that they could use a counter instead of a lock, and use the hardware instruction to atomically increase and decrease the counter. As a result the graph was now a straight line. Round of applause etc, well done etc.

The performance guy then went on to say, so in order to find the next bottlenecks we will need a box that is even bigger and even faster!

I also remember this performance guy said his job was to make curved lines straight!

You can hide the truth with charts

People often use graphs to show information, but often tell the wrong story. For example if you plot the total cost vs transaction rate you get a straight line

Great – this looks linear!

However If you plot the cost per transaction vs rate you get

In actual fact there is an increase in the cost per transaction as the rate increases. So good scaling – but not perfect. It is linear – when it should have been flat!

If you want to scare people – change the axis

It is the same data as the previous chart – but now the axis is 20 to 22.5 instead of 0 to 25, so any difference appears much larger.

So depending on what story you want to tell, and if you want to hide the truth, think about what data you present, and the axis you use.

Every one wants a bigger box or two

On the subject of wanting a bigger box…

One of the senior tester’s presented to management about testing strategy. He said you can do testing in one of two ways:

If we were careful, and picked the right set of results we could show all test working. We could quietly ignore the tests that failed.
No matter what we did, we could not seriously break it. Even when we deliberately misconfigured it during testing it carried on running – slowly – but it carried on running.

The manager said “hmmm I’m always being asked for graphs to present status so the first would be better, but from a customer perspective the second would be better”.

The tester said “That’s the right answer, we now need two bigger boxes. One to run the workload, the other to drive it. We need to achieve higher throughput than our customers”.

One customer told me that their test box was larger than production, and they needed to prove that they could run at expected peak production throughput + 50% with no major degradations – they needed many very big boxes.

How to become a performance expert in 3 easy lessons

and many hard lessons.

I had emails from two people, with different experiences of doing performance on z/OS. One person has recently started, and is not sure what is involved. The other person has been doing lots of work with customers explaining that his product is not the cause of the performance problems.

I thought it might be interesting for people who might be tempted to work in performance, to see the route to becoming an expert.

What does “performance” mean?

Performance work covers many different areas, and once you are competent in one product area it is not too difficult to cover additional areas.

“Performance” covers

Making sure it scales close to linearly

If you double the throughput, the costs per transaction should be similar. As the throughput increases, the response time does not increase significantly. You can have many threads running concurrently.

If the workload has disk I/O then you need to have multiple threads, so while one task is waiting for I/O another task can be using the CPU.

You need a box with multiple CPUs to detect contention. If you have only one or two engines you may not detect concurrency issues.

Work to remove contention until you can drive the CPUs at 100% busy (and then you ask for a bigger box). If you cannot drive the box at 100% find out why, resolve it and repeat.

Reduce CPU

Once you have eliminated as much contention as possible, you need to investigate where the CPU is being used, and try to eliminate any hot spots. This might be

Change algorithms – use a hash table instead of a linked list.
Avoid unnecessary work. Do you really need to store intermediate values in a database?
Can you tune the services being used. For example tune the database, add an index to a table.
Rearrange the code, for example have the “hot code” located in the same few pages. Avoid lots of error handling code in the mainline code – branch out of the mainline to handle it.
Remove debug code, or put debug code within if (debug enabled) then { debug code}.

Work with customers problems

Understand what areas the users have problems with, identify “problem areas” which take time to identify the problem.

Enhance the design

From your testing, and the experience with customer problem propose improvements to help diagnose problems for example

Capture the number, the average time, and the maximum time of database requests. Report this as a statistics or in response to a display command.
Record the number of times a resource, such as a lock, was not available, record total count of requests, number of blocked requests, time spent waiting. This code may never be executed, but if it is, you get useful information about the size of the problem.

Provide useful information to the end user

These are often known as “performance reports”. It is easy to produce reports that people cannot use – I have done it many times. Producing reports with nice graphs are often not easy to use, as they do not match your scenario.

You need to consider the questions the end users will have.

I want to run an ill defined workload (I do not know all the details), how big a box do I need (how many CPUs), to support 1000 requests a second.
What should I look at to tell me if things are running well or not.
What are common symptoms, and what actions can I take to solve performance problems.
What things do we need to consider to make it run well? For example table layout, how many requests per commit, how often you need to sign on.

Performance roles

The roles below are typical of the sort of activities a performance person will do

Run tests

The first tasks a person usually does when becoming a performance person is to run tests, and collect the data. This may involve writing scripts and tools so it can all be automated. For example on z/OS you might use Netview to run scripts, capture responses, and take actions when there are problems. This could all be done using Rexx scripts in TSO, and possibly using a REST interface.

Good automation will collect all of the key metrics into one place, for example a spread sheet, so the analyst can simply press a button or two to be able to display the data.

There may be a management report produced daily or weekly to show that performance overall has improved – or has not got worse.

Look at a component

You need to look at components within the whole environment, for example this week, look at the z/OSMF SDSF interface, next week the logon process.

You need to drive a high volume workload using this component. You need to focus on the component, for example with a REST requests 90% of the cost may be in the logon and establish a session. This makes it hard to focus on the other 10%. Sign on once, and have an application that just issues requests to the component.

When I was testing MQ under CICS, the duration of an MQPUT took 50 microseconds, and the cost of starting the CICS transactions was 1000 microseconds. I changed the transaction to process 1000 messages, so the transaction now took about 50 milliseconds, and most of the work was in the MQPUT area, and not in the CICS transaction overhead.

Capture the response time of the transactions and plot it over time. You should get a flat line. If the response increases over time, you might have a storage leak, and so it takes longer to get storage.

You may find it does not scale. Turning trace on can give an indication where the problem is. You often get function entry and exit trace, with time stamps, so you can post process the output to calculate the duration within the function. Trace often does not scale, so you cannot always believe the output.

You may want to instrument a private copy of the code. Obtain the time on entry and exit to the function, and across major calls to external requests. Calculate the duration of the calls, add logic to say “If duration > 10 millisecond then throw exception”, or accumulate the data in a global control block. When I did this, I found the trace code was adding significant delays, and the root cause of the problem was an insignificant line of code, which got an exclusive latch for an update!

I added code to measure the average duration of file I/O, and output this in the statistics. This made solving some problems very easy – you have an I/O problem. See here, it is taking 10 ms to write a page of data!

Unless you are testing the startup times, you should allow the system under test to “warm up”, so the hardware cache is in a steady state, database tables are in memory etc.

I found it useful to warm it up, then take 5 sets of measurements each of 1-5 minutes. When displaying the data, the results should all be similar. If not, you need to find out why. You should also run these tests once a week, and whenever you change a component, such as putting fixes onto your system, or change the hardware. Some example of things that can change your results

Overnight the Operations Team run backups and cause a different disk response time
The order the LPARs were ipled has changed. Last week your system had 6 CPUs in one book (so all very close to each other) this week your system has 3 CPUs in one book – and 3 CPUs in a different book – 1 metre away.
The network between your driving system and the test system has changed, or has a different load.

Usually the performance machines have their own dedicated hardware, processors, disks, connections to the disks, network.

Develop skills in other products

My background is MQ performance on z/OS. I had to learn about the performance characteristics of z/OS, DB2, TCP/IP, IMS, and understand the tools these products provide. Once you understand one trace, other traces are basically similar. The hard part is capturing the trace.

MQ passes messages from system to system. There were several problems where the “network was slow”. This meant we had to understand what was happening under the covers. Some good problems with easy fixes included

There was a TCP performance “improvement” where one end would delay sending a packet for a few milliseconds, as it is more efficient to send one big packet rather than several smaller packets. This meant that every MQ message sent over the network had a couple of millisecond delay. This fix was easy – disable this feature.
TCP/IP by default uses small buffers (256 bytes). You can configure a session to have very large buffers and tell it to automatically tune the best buffer size ( up to MB sized buffers).

Work with customers on their performance problems

The work involves working on performance problems where you do not have any of your specially written code included in it. You may need to turn on the product trace for a few seconds, then turn it off, and then process the output. Many customers do not run with trace on because of the overhead and major impact on throughput.

You can acquire the skills to talk to customers on the phone about their problems. It is very good to feedback what you heard. “Let me check what you just said … when you do … you get … “

Over time you will build up a list of questions to ask.

Once the problem has been resolved, consider what would have made it easier to find the root cause. Can you get development to put in some statistics, so next time this happens, you can tell the customer to check a value.

In the early days on MQ, we used to get many problems, because the in-memory buffer was too small. Development put out a fix, so that every 10 minutes or so it would report if it had detected a buffer full problem since the last message. After this fix was rolled out, we had no more of these problems.

There is no limit as to how far you can go

Once you have skills in one component you can apply these skills to other products or components. For example I spent some time looking at MQ on Linux so I could understand (and blog) on the performance data produced. (The performance data was “here are some numbers, we are not going to tell you what they mean”).

I’ve also been looking at Java performance, which lead me to look at the zFS file system, and the statistics it provides (it provides some – but they are not very useful).

You can also go deep. I knew about z architecture instructions and how some are fast and some are slow. I attended a taskforce with lots of hardware people. I met the team leader for the “load instructions”, and found that the “load instruction” was not an instruction – it is more like a subroutine with logic, for example

Find which CPU which currently “owns” this data in the CPU cache, and go and get it
Lock the page
Go and get this value from another page
Add the two values
Unlock both the pages

The subroutine had to communicate with other CPUs in the LPAR, worry about its own CPU cache etc. Deep Stuff!

Once you know this sort of stuff, it helps you program, for example it is better not to share a field if you do not have to. When a multi threading program uses a buffer to trace into, do not have one buffer which they all share, but give each thread its own buffer. This way the hardware will not be fighting over the buffer, and the data for each application can be kept on the same CPU as the program. This is obvious once you know!

Collect statistics at the thread level, and not at the global level. Merge them at display time. You know the reason why.

The hardware can start to execute instructions out of order – as long as they “commit” in the right order.

The z hardware has instrumentation which samples the executing system, and can tell you why instructions were delayed. For example

Data had to be obtained from the L2 cache on the chip
The instruction needed to be interpreted and added to the Translation Lookaside Buffer

This is a bit deep for many people, especially if they are at the level of using “printf” in their programs to display debug information.

“Me, with the brain the size of a planet ….”

This is a quote from Marvin the paranoid Android in the Hitchhiker’s guide to the galaxy. With performance work you can go deep, or you can go wide, but you would need a bigger brain than I had to go deep and wide – but it is a fascinating area.

Some of the mysteries of Java shared classes

When Java executes a program it read in the jar file, breaks it into the individual classes, converts the byte codes into instructions, and when executing it may replace instructions with more efficient instructions (Jitting). It can also convert the byte codes into instructions ahead of time, so called Ahead Of Time (AOT) compilation.

With shared classes, the converted byte codes, any Jitted code, and any AOT code can be saved in a data space.

When the java program runs a second time, it can reuse the data in the dataspace, avoid the overhead of the reading the jar file from the file system, and coverting the byte codes into instructions.
The data space can be hardened to a file, and restored to a data space, so can be used across system IPLs.

Using this, it reduced the start-up time of my program by over 20 seconds on my slow zPDT system. The default size of the cache is 16MB – one of my applications needed 100 MB, so most of the benefits of he shared classes could not be exploited if the defaults were used.

This blog post describes more information about this, and what tuning you can do.

You have to use the same level of Java to display information, as used by the program. For example

export PATH="/usr/lpp/java/new/J8.0_64/bin:$PATH"
/usr/lpp/java/new/J8.0_64/bin/java -Xshareclasses:cacheDir=/tmp,printStats,cache=rseapi

If the levels are different it may just report JVMSHRC005I No shared class caches available.

Issuing commands to manage the shared classes cache

What commands are there ?

The command

/usr/lpp/java/J17.0_64/bin/java -Xshareclasses:help

gives 100 lines of help including the syntax.

This help text includes

Related command line options: 
                                                                                                                         
  -Xscmx<x>       set size (or soft max size if option -XX:SharedCacheHardLimit= is present) 
                  of new shared class cache to <x> 
  -Xscdmx<x>      set size of shared class cache debug attribute area to <x> 
  -Xscminaot<x>   set minimum shared classes cache space reserved for AOT data to <x> 
  -Xscmaxaot<x>   set maximum shared classes cache space allowed for AOT data to <x> 
  -Xscminjitdata<x>   set minimum shared classes cache space reserved for JIT data to <x> 
  -Xscmaxjitdata<x>   set maximum shared classes cache space allowed for JIT data to <x> 
  -Xzero:nosharebootzip   do not share bootstrap jar entry caches in the shared class cache 
  -XX:SharedCacheHardLimit=<x>   set size of new shared class cache to <x> 
  -XX:ShareClassesEnableBCI   create shared class cache with support for byte-code instrumentation 
  -XX:ShareClassesDisableBCI   create shared class cache without support for byte-code instrumentation 
  -XX:+ShareAnonymousClasses   enable storing and finding anonymous classes in the shared class cache 
  -XX:-ShareAnonymousClasses   disable storing and finding anonymous classes in the shared class cache 
  -XX:+ShareUnsafeClasses   enable storing and finding non-anonymous unsafe classes in the shared class cache 
  -XX:-ShareUnsafeClasses   disable storing and finding non-anonymous unsafe classes in the shared class cache 
  -XX:+UseGCStartupHints   enable storing and finding GC startup hints in the shared class cache 
  -XX:-UseGCStartupHints   disable storing and finding GC startup hints in the shared class cache

Commands to manage the shared classes cache are issued like

java -Xshareclasses:cacheDir=/tmp,name=client6,printStats

which can be done using JCL

// SET V=’listAllCaches’
// SET V=’printStats’
// SET C=’/tmp’
// SET N=’client6′
//S1 EXEC PGM=BPXBATCH,REGION=0M,
// PARM=’SH java -Xshareclasses:cacheDir=&C,name=&N,verbose,&V’
//STDERR DD SYSOUT=*
//STDOUT DD SYSOUT=*

Enabling share classes

You specify -Xsharedclasses information as a parameter to the program, for example in the command line or in a jvm properties file.

To use the shared classes capability you have to specify all of the parameters on one line, like

-Xshareclasses:verbose,name=client6,cacheDirPerm=0777,cacheDir=/tmp

Having it like

-Xshareclasses:name=client6,,cacheDirPerm=0777,cacheDir=/tmp
-Xshareclass:verbose

means the name, etc all take their defaults. Only shareclass:verbose would be used.

Changing share classes parameters

You can have more than one cache; you specify a name. You specify a directory were an image is stored when the cache is hardened to disk.

Some of the options like name= and cacheDir= are picked up when the JVM starts, Other parameters like cacheDirPerm are only used when the cache is (re-)created.

You can delete the cache in two ways.

Delete the cache from your your Java program

When you are playing around, you can add reset to the end of the -Xshareclasses string to caused the cache to be deleted and recreated.This gives output like

JVMSHRC010I Shared cache “client6” is destroyed
JVMSHRC158I Created shared class cache “client6”
JVMSHRC166I Attached to cache “client6”, size=20971328 bytes

This was especially useful when tuning the storage allocations.

Delete the cache independently

java -Xshareclasses:cacheDir=/tmp,name=client6,destroy

How to allocate the size of the cache

You specify the storage allocations using -Xsc.. (where sc stands for shareclasses)

-Xscmx100m the size of the shared cache
-Xscmaxaot20m the maximum number of bytes in the class cache to reserve for Ahead of Time (AOT) data.
–Xscmaxjitdata1m the maximum number of bytes in the class cache that can be used for JIT data.

If you have -Xsharedcache:verbose… specified then when the JVM shuts down you get

JVMSHRC168I Total shared class bytes read=11660. Total bytes stored=5815522
JVMSHRC818I Total unstored bytes due to the setting of shared cache soft max is 0.
Unstored AOT bytes due to the setting of -Xscmaxaot is 1139078.
Unstored JIT bytes due to the setting of -Xscmaxjitdata is 131832.

This shows the values of maxaot and maxjitdata are too small they were

-Xscmx20m
-Xscmaxaot2k
-Xscmaxjitdata2k

When the values were big enough I got

JVMSHRC168I Total shared class bytes read=12960204. Total bytes stored=8885038
JVMSHRC818I Total unstored bytes due to the setting of shared cache soft max is 0.
Unstored AOT bytes due to the setting of -Xscmaxaot is 0.
Unstored JIT bytes due to the setting of -Xscmaxjitdata is 0.

How big a cache do I need?

If you use -Xshareclasses:verbose… it will display messages

for example

JVMSHRC166I Attached to cache “client6”, size=2096960 bytes
JVMSHRC269I The system does not support memory page protection
…
JVMSHRC096I Shared cache “client6” is full. Use -Xscmx to set cache size.
JVMSHRC168I Total shared class bytes read=77208. Total bytes stored=2038042

Message JVMSHRC096I Shared cache “client6” is full. Use -Xscmx to set cache size, tells you the cache is full – but no information about how big it needs to be.

You can use

java -Xshareclasses:cacheDir=/tmp,name=client6,printStats

to display statistics like

-Xshareclasses persistent cache disabled]                                         
[-Xshareclasses verbose output enabled]                                            
JVMSHRC159I Opened shared class cache "client6"                                    
JVMSHRC166I Attached to cache "client6", size=2096960 bytes                        
JVMSHRC269I The system does not support memory page protection                     
JVMSHRC096I Shared cache "client6" is full. Use -Xscmx to set cache size.          
                                                                                   
Current statistics for cache "client6": 
cache size                           = 2096592                       
softmx bytes                         = 2096592                       
free bytes                           = 0                             
ROMClass bytes                       = 766804                        
AOT bytes                            = 6992                          
Reserved space for AOT bytes         = -1                            
Maximum space for AOT bytes          = 1048576                       
JIT data bytes                       = 212                           
Reserved space for JIT data bytes    = -1                            
Maximum space for JIT data bytes     = 1048576                       
Zip cache bytes                      = 1131864                       
Startup hint bytes                   = 0                             
Data bytes                           = 13904                         
Metadata bytes                       = 12976                         
Metadata % used                      = 0%                            
Class debug area size                = 163840                        
Class debug area used bytes          = 119194                        
Class debug area % used              = 72%

Cache is 100% full

This show the cache is 100% full, and how much space is used for AOT and JIT. The default value of -Xscmx I had was almost 16MB. I made it 200MB and this was large enough.

I could not find a way of getting my program to issue printStats.

How do I harden the cache?

You can use use the

java -Xshareclasses:cacheDir=/tmp,name=zosmf,verbose,snapshotCache

command to create the cache on disk. Afterwards the listAllCaches command gave

Cache name level        cache-type     feature 
client6    Java8 64-bit non-persistent cr
client6    Java8 64-bit snapshot       cr

Showing the non persistent data space, and the snapshot file.

You can use the restoreFromSnapshot to restore from the file to the data cache; before you start your Java program. You would typically do this after an IPL.

How can I tell what is going on and if shared classes is being used?

The java options “-verbose:dynload,class

reports on the

dynamic loading of the files, and processing them,
what classes are being processed.

For example

<Loaded java/lang/reflect/AnnotatedElement from /Z24A/usr/lpp/java/J8.0_64/lib/rt.jar>
< Class size 3416; ROM size 2672; debug size 0>
< Read time 1196 usec; Load time 330 usec; Translate time 1541 usec>
class load: java/lang/reflect/AnnotatedElement from: /Z24A/usr/lpp/java/J8.0_64/lib/rt.jar
class load: java/lang/reflect/GenericDeclaration from: /Z24A/usr/lpp/java/J8.0_64/lib/rt.jar

dynload gave

this tells you a jar file was read from the file system, and how long it took to process it.

class gave

class load: java/lang/reflect/AnnotatedElement from: /Z24A/usr/lpp/java/J8.0_64/lib/rt.jar
class load: java/lang/reflect/GenericDeclaration from: /Z24A/usr/lpp/java/J8.0_64/lib/rt.jar

This shows two classe were extracted from the jar file.

In a perfect system you will get the class load entries, but not <Loaded….

Even when I had a very large cache size, I still got dynload entries. These tended to be loading class files rather than jar files.

For example there was a dynload entry for com/ibm/tcp/ipsec/CaApplicationProperties. This was file /usr/lpp/zosmf./installableApps/izuCA.ear/izuCA.war/WEB-INF/classes/com/ibm/tcp/ipsec/CaApplicationProperties.class

If you can make these into a .jar file you may get better performance. (But you may not get better performance, as it may take more time to load a large jar file).

I also noticed that there was dynload for com/ibm/xml/crypto/IBMXMLCryptoProvider which is in /Z24A/usr/lpp/java/J8.0_64/lib/ext/ibmxmlcrypto.jar, so shared classes has some deeper mysteries!

What happens if the .jar file changes?

As part of the class load, it checks the signature of the file on disk, matches the signature on the data space. If they are different the data space will be updated.

Looking at the performance of Liberty products and what you can do with the information

The Liberty Web Server on z/OS can produce information on the performance of “transactions”. It may not be as you expect, and it may not be worth collecting it. I started looking into this to see why certain transactions were slow on z/OSMF.

This blog post covers

How does the web server work?
How can I see what is going on?
Access logs
SMF 120
WLM
Do I want to use this in production?

How does it work?

A typical Web Server transaction might use a database to debit your account, and to credit my account. I think of this as a fat transaction because the application does a lot of work,

z/OSMF runs on top of Liberty, and allows you to run SDSF and ISPF within a web brower, and has tools to help automate work. You can also use REST services, so you can send a self contained HTTP request to z/OSMF, for example request information about data sets, or jobs running on the system. Both of these send a request to z/OSMF, which might send a request to a TSO userid, get the response back and pass the response back to the requester. I think of this as a thin transaction because the application running on the web server is mainly acting as a router or broker. What the end user sees as a “transaction” may be many micro services – each of which is a REST requests.

How can I see what is going on?

You can see what “transactions” or requests have been issued from

the access log – a flat file of requests
SMF 120 records
WLM reports.

Access logs

In the <httpEndpoint … accessLoggingRef=”accessLogging”…>, the accessLogging is a reference to an <httpAccessLogging… statement. This creates a flat files containing a record of all inbound HTTP client requests. If you have multiple httpEndpoint statements, you can have multiple accessLogging files. You can control the number and size of the files.

This has information with fields like

10.1.0.2 – the IP address the request came in from
COLIN – the userid
{16/May/2021:14:47:40 +0000} – the date, time and time zone of the event
PUT /zosmf/tsoApp/tso/ping/COLIN-69-aabuaaac – the request
HTTP/1.1 – the type of HTTP request
200 – the HTTP return code
78 – the number of bytes sent back.

You can have multiple httpEndpoint definitions, for example specifing different IP address and port combinations. These definitions can point to a shared or indivdual httpAccessLogging statement, and so can share (or not) the flat files. This allows you to specify that for one port you will use the accessLogging, and another port does not have access logging.

The SMF 120 records.

The request logging writes a record to SMF for each request.

<server>
<featureManager>
<feature>zosRequestLogging-1.0</feature>
</featureManager>
</server

I have not been able to find a formatter for these records from IBM, so I have written my own, it is available on github.

It produces output like

Server version      :2                                                              
System              :S0W1                                                           
Syplex              :ADCDPL                                                         
Server job id       :STC02771                                                       
Server job name     :IZUSVR1                                                        
config dir          :/var/zosmf/configuration/servers/zosmfServer/                  
codelevel           :19.0.0.3  
                                                     
Start time          :2021/05/16 14:42:33.955288                                     
Stop  time          :2021/05/16 14:42:34.040698                                     
Duration in secs    : 0.085410                                                      
WLMTRan             :ZCI4                                                           
Enclave CPU time    : 0.042570                                                      
Enclave ZIIP time   : 0.042570                                                      
WLM Resp time ratio :10.000000                                                      
userid long         :COLIN                                                          
URI                 :/zosmf/IzuUICommon/externalfiles/sdsf/index.html               
CPU Used Total      : 0.040111                                                      
CPU Used on CP      : 0.000000                                                      
CPU Delta Z**P      : 0.040111 
                                                     
WLM Classify type   :URI                                                            
WLM Classify by     :/zosmf/IzuUICommon/externalfiles/sdsf/index.html
WLM Classify type   :Target Host                            
WLM Classify by     :10.1.1.2                               
WLM Classify type   :Target Port                            
WLM Classify by     :10443   
                               
Response bytes      :7003                                   

Origin port         :57706                                  
Origin              :10.1.0.2

This has information about

The Liberty instance, where it is running and configuration information
Information about the transaction, start time, stop time, duration and CPU time used
WLM information about the request. It was classified as
- URI:…index.html
- Target Host:10.1.1.2
- Target port:10443
7003 bytes were sent to the requester
the request came from 10.1.0.2 port 57706

The SMF formatter program also summarises the records and this shows there are records for

/zosmf/IzuUICommon/externalfiles/sdsf/js/ui/theme/images/zosXcfsCommand_enabled_30x30.png
/zosmf/IzuUICommon/externalfiles/sdsf/js/ui/theme/sdsf.css
/zosmf/IzuUICommon/externalfiles/sdsf/sdsf.js
/zosmf/IzuUICommon/externalfiles/sdsf/IzugDojoCommon.js
/zosmf/IzuUICommon/persistence/user/com.ibm.zos.sdsf/JOBS/JOBS

This shows there is a record for each part of the web page, icons, java scripts and style sheets.

Starting up SDSF within z/OSMF created 150 SMF records! Refreshing the data just created 1 SMF record. The overhead of creating all the SMF records for one “business transaction” may be too high for production use.

As far as I can tell this configuration is server wide. You cannot enable it for a specific IP address and port combination.

WLM reports

Much of the data produced in the records above can be passed to WLM. This can be used to give threads appropriate priorities, and can produce reports.

You enable WLM support using

- <feature>zosWlm-1.0 </feature>
</featureManager>
<zosWorkloadManager collectionName=”MOPZCET”/>

- <httpClassification transactionClass=”ZCI6″ resource=”/zosmf/desktop/”/>
- <httpClassification transactionClass=”ZCI1″ resource=”/**/sdsf/**/*”/>
- <httpClassification transactionClass=”ZCI3″ resource=”/zosmf/webispf//”/>
- <httpClassification transactionClass=”ZCI4″ resource=”/**/*”/>
- <httpClassification transactionClass=”ZCI2″ resource=”IZUGHTTP” />
- <httpClassification transactionClass=”ZCI5″ port=”10443″/>
</wlmClassification>

Where the httpClassification maps a z/OSMF resource to an 8 character transaction class. THe records are process from the top until there is a match. For example port=10443 would not be used because of the generic resource=/**/* definition.

These need to be mapped into the WLM configuration…

WLM configuration

You can configure WLM through the WLM configuration panels.

option 6. Classification rules.

option 3 to modify CB(Component broker)

          -------Qualifier--------                 -------Class--------  
 Action   Type      Name     Start                  Service     Report   
                                          DEFAULTS: STCMDM      TCI2
  ____  1 CN        MOPZCET  ___                    ________    THRU
  ____  2   TC        ZCI1     ___                  SCI1        RCI1
  ____  2   TC        ZCI2     ___                  SCI2        RCI2
  ____  2   TC        ZCI3     ___                  SCI3        RCI3
  ____  2   TC        ZCI4     ___                  SCI4        RCI4
  ____  2   TC        ZCI5     ___                  SCI5        RCI5
  ____  2   TC        ZCI6     ___                  SCI6        RCI6
  ____  2   TC        THRU     ___                  THRU        THRU

For the Type=CN, Name=MOPZCET, this value ties up with the <zosWorkloadManager collectionName=”MOPZCET” above. Type=CN is for CollectionName.
For the subtype 2 TC Name ZCI4, This is for TransactionClass which ties up with a httpClassification transactionClass statement.

The service class SCI* controls the priority of the work, the Report class RCI* allow you to produce a report by this name.

If you make a change to the WLM configuration you can save it from the front page of the WLM configuration panels, Utilities-> 1. Install definition, and activate it using Utilities-> 3. Activate service policy.

If you change the statements in Liberty or z/OSMF I believe you have to restart the server.

How to capture the data

The data is written out to SMF records on a timer, or on the SMF end-of-interval broadcast. If you change the interval, SMF sends an end-of-interval broadcast and writes the records to SMF. For example on my dedicate test system I use the operator command setsmf intval(10) to change the interval to 10 minutes. After the next test, I use setsmf intval(15) etc..

The data is kept in SMF buffers, and you may have to wait for a time, before the data is written out to external storage. It SMF data is being produced on a regular basis, it will be flushed out.

How to report the data

I copy the SMF data to a temporary data

//IBMURMF  JOB 1,MSGCLASS=H RESTART=POST 
//* DUMP THE SMF DATASETS 
// SET SMFPDS=SYS1.S0W1.MAN1 
// SET SMFSDS=SYS1.S0W1.MAN2 
//* 
//SMFDUMP  EXEC PGM=IFASMFDP 
//DUMPINA  DD   DSN=&SMFPDS,DISP=SHR,AMP=('BUFSP=65536') 
//DUMPINB  DD   DSN=&SMFSDS,DISP=SHR,AMP=('BUFSP=65536') 
//* MPOUT  DD   DISP=(NEW,PASS),DSN=&RMF,SPACE=(CYL,(1,1)) 
//DUMPOUT  DD   DISP=SHR,DSN=IBMUSER.RMF SPACE=(CYL,(1,1)) 
//SYSPRINT DD   SYSOUT=* 
//SYSIN  DD * 
  INDD(DUMPINA,OPTIONS(DUMP)) 
  INDD(DUMPINB,OPTIONS(DUMP)) 
  OUTDD(DUMPOUT,TYPE(70:79)) 
  DATE(2020316,2021284) 
  START(1539) 
  END(2359) 
/*

and display the report classes

//POST EXEC PGM=ERBRMFPP 
//* PINPUT DD DISP=SHR,DSN=*.SMFDUMP.DUMPOUT 
//MFPINPUT DD DISP=SHR,DSN=IBMUSER.RMF 
//SYSIN DD * 
SYSRPTS(WLMGL(RCPER)) 
/*

The output in //PPXSRPTS was

REPORT CLASS=RCI4                                                      
-TRANSACTIONS--  TRANS-TIME HHH.MM.SS.FFFFFF  
AVG        0.01  ACTUAL                69449  
MPL        0.01  EXECUTION             68780  
ENDED       160  QUEUED                  668  
END/S      0.13  R/S AFFIN                 0  
#SWAPS        0  INELIGIBLE                0  
EXCTD         0  CONVERSION                0  
                 STD DEV              270428  
                                              
----SERVICE----   SERVICE TIME  ---APPL %---  
IOC           0   CPU    2.977  CP      0.05  
CPU        2551   SRB    0.000  IIPCP   0.05  
MSO           0   RCT    0.000  IIP     0.19  
SRB           0   IIT    0.000  AAPCP   0.00  
TOT        2551   HST    0.000  AAP      N/A  
/SEC          2   IIP    2.338                
ABSRPTN     232   AAP      N/A                
TRX SERV    232

There were 160 “transactions” within the time period or 0.13 per second, The average response time was 69449 microseconds, with a standard deviation of 270428. This is a very wide standard deviation, so there was a mixture of long, and short response times.

The total CPU for this report class was 2.977 seconds of CPU, or 0.019 seconds per “transaction”.

Do I want to use this in production?

I think that the amount of data produced is managable for a low usage system. For a production environment where there is a lot of activity then the amount of data produced, and the cost of producing the data may be excessive. This could be an example of the cost of collecting the data is much larger than the cost of running the workload.

As z/OSMF acts as a broker, passing requests between end points you may wish just to use your existing reporting structures.

I used z/OSMF to display SDSF data, and set up and ISPF session within the web browse. This created two TSO tasks for my userid. If you include my traditional TSO session and the two from z/OSMF this is three TSO sessions in total running on the LPAR.

Why do they ship java products on z/OS with the handbrake on? And how to take the brake off.

I noticed that it takes seconds to start MQ on my little z/OS machine, but minutes (feels like days) to start anything with Liberty Web server. This include the MQWEB, z/OSMF, and Z/OSConnect. I mentioned this to an IBM colleague who asked if I was using Java Shared classes. These get loaded into z/OS shared pages.

When I implemented it, my Liberty server came up in half the time!

I found this blog post which was very helpful, and showed me where to look for more information. I subsequently found this document (from 2006!)

The kinder garden overview of how Java works.

You start with a program written in the Java language.
When you run this, Java converts it into byte codes
These byte codes get converted to native instructions – so a byte code “push onto the stack” may become 8 390 assembler instructions.
This code can be optimised, for example code which is executed frequently can have the assembler instructions rewritten to go faster. It might put code inline instead of out in a subroutine.
If you are using Java shared classes, this code can be written out and reused by other applications, or if you restart the server, it can reused what it created before. Reusing the shared classes means that programs benefit because the byte codes have already been converted into native code, and optimisations have been done on the hot code.

What happens on z/OS?

By default, z/OS writes the code to virtual memory and does not save anything to disk. If you restart your Java application within the same IPL, it can exploit the shared classes which have been converted to native code, and optimised – great- good design. I found the second time I started the web server it took half the time. However I IPL once a day, and start my web server once a day. I do not benefit from having it start faster a second time – as I only started it once per session. By default when you re-ipl, the shared classes code is discarded, and so next time you need the code, it has to be to convert to native instructions again, and it loses any optimisation which had been done.

What is the solution?

It is two easy steps:!

Tell Java to write the information from memory to disk – to take a snaphot.
After IPL tell Java to load memory from the disk image – to restore a snapshot.

It is as simple as that.

Background.

It is all to do with the java -Xshareclasses.

With your application you tell Java where to store information about the shared classed. It defaults to Cache=/tmp/ name=javasharedresources.

In my jvm.options I overrode the defaults and specified

-Xshareclasses:nonFatal 
-Xshareclasses:groupAccess 
-Xshareclasses:cacheDirPerm=0777 
-Xshareclasses:cacheDir=/tmp,name=mqweb

If you give each application a name (such as mqweb) you can isolate the cache to an application and not disrupt another JVM if you change the cache. For example if you restore from a snapshot, only users of that “name” will be affected.

List what is in the cache

You can use the USS command,

java -Xshareclasses:cacheDir=/tmp/,listAllCaches

I used a batch job to do the same thing.

//IBMJAVA  JOB  1 
// SET V='listAllCaches' 
// SET C='/tmp/' 
//S1       EXEC PGM=BPXBATCH,REGION=0M, 
// PARM='SH java -Xshareclasses:cacheDir=&C,&V' 
//STDERR   DD   SYSOUT=* 
//STDOUT   DD   SYSOUT=*

The output below, shows the cache name is mqweb. Once you have created a snapshot it has an entry for it.

Listing all caches in cacheDir /tmp/                                                                          
                                                                                                              
Cache name       level         cache-type      feature         OS shmid       OS semid 
mqweb            Java8 64-bit  non-persistent  cr              8197           4101

For a different product I got

Incompatible shared caches                                     
rseapi                  Java8 32-bit  non-persistent  default

The Incompatible shared caches looks like it means you are using 64 bit Java – but there is a cache using 32 bit Java.

For MQWEB the default parameters are -Xshareclasses:cacheDir=/u/mqweb/servers/.classCache,name=liberty-%u” where /u/mqweb is the WLP parameter, where my parameter are defined, and %u is the userid the server is running under, so in my case liberty=START1.

When I had /u/mqweb/servers/.classCache, then the total command line was too long for BPXBATCH. (Putting it into STDPARM gave me IEC020I 001-4 on the instream STDPARM because the resolved line wa greater than 80 characters. I resolved this by adding -Xshareclasses:cacheDir=/u/mqweb,name=cache to the jvm.options file.

To take a snapshot


//IBMJAVA  JOB  1 
// SET C='/tmp/' 
// SET N='mqweb' 
// SET V='restoreFromSnapshot' 
// SET V='listAllCaches'
// SET V='snapshotCache'
//S1       EXEC PGM=BPXBATCH,REGION=0M, 
// PARM='SH java -Xshareclasses:cacheDir=&C,name=&N,&V' 
//STDERR   DD   SYSOUT=* 
//STDOUT   DD   SYSOUT=* 
//

This job took a few seconds to run.

I believe you have to take the snapshot while your Java application is executing – but I do not know for definite.

Restore a snapshot

To restore a snapshot just use restoreFromSnapshot in the above JCL. This took a few seconds to run.

How to use it.

If you put the restoreFromSnaphot JCL at the start of the web server, it will preload it whenever you use your server.

If you take a snapshot every day before shutting down your server, you will get a copy with the latest optimisations. If you do not take a new snapshot it continues to use the old one.

If you want to not use the shared cache you can get rid of it using the command destroySnapshot.

Is my cache big enough?

If you use the printStats request you get information like

Current statistics for cache "mqweb":                                                
...                                                                                     
cache size                           = 104857040                                     
softmx bytes                         = 104857040                                     
free bytes                           = 70294788 
...
Cache is 32% full                                     
                                                      
Cache is accessible to current user = true

The documentation says

When you specify -Xshareclasses without any parameters and without specifying either the -Xscmx or -XX:SharedCacheHardLimit options, a shared classes cache is created with a default size, as follows:

For 64-bit platforms, the default size is 300 MB, with a “soft” maximum limit for the initial size of the cache (-Xscmx) of 64MB, …

I had specified -Xscmx100m which matches the value reported.

What is in the cache?

You can use the printAllStats command. This displays information like

Classpath

1: 0x00000200259F279C CLASSPATH
/usr/lpp/java/J8.0_64/lib/s390x/compressedrefs/jclSC180/vm.jar
/usr/lpp/java/J8.0_64/lib/se-service.jar
/usr/lpp/java/J8.0_64/lib/math.jar

Methods for a class

0x00000200259F24A4 ROMCLASS: java/util/HashMap at 0x000002001FF7AEB8.
ROMMETHOD: size Signature: ()I Address: 0x000002001FF7BA88
ROMMETHOD: put Signature: (Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; Address: 0x000002001FF7BC50

This shows

there is a class HashMap.
It has a method size() with no parameters returning an Int. It is at…. in memory
There is another method put(Object o1, Object o2) returning an Object. It is at … in memory

Other stuff

There are sections with JITHINTS and other performance related data.

Taking the brakes off ZFS on z/OS – move it to OMVS

From z/OS 2.2 there is a performance advantage in running the ZFS file system as part of OMVS, rather than its own address space. The IBM documentation says When running zFS in the OMVS address space, each file system vnode operation (such as creating a directory entry, removing a directory entry, or reading from a file) will have better overall performance. Each operation will take the same amount of time while inside zFS itself. The performance benefit occurs because z/OS UNIX can call zFS for each operation in a more efficient manner. This will be relevant when you application is doing a lot of file IO – for example using a web server.

This move is not documented – but it is really easy! It is mentioned here. Instructions are hidden in the installation instructions here.

Before I started

The IBM doc says You can determine if zFS is in its own address space by issuing D OMVS,PFS. If the output shows an ASNAME value, zFS is running as a colony address space.

OMVS     0010 ACTIVE             OMVS=(00,01,BP,IZ,RZ,BB)                
PFS CONFIGURATION INFORMATION                                            
 PFS TYPE   ENTRY      ASNAME    DESC      ST    START/EXIT TIME         
 ...   
  ZFS       IOEFSCM    ZFS       LOCAL     A     2021/02/17 17.35.06

The steps I took…

I added KERNELSTACKS(ABOVE) to USER.Z24A.PARMLIB(BPXPRM00).
Being ultra cautious I re-ipled.
The documentation talks about putting IOEZPRM DD in OMVS, then goes on to say As the preferred alternative to the IOEZPRM DDNAME specification, delete the IOEZPRM DDNAME and use the IOEPRMxx parmlib member. So I did not change the OMVS proc. When I reipled it worked and I got the message IOEZ00374I No IOEZPRM DD specified in OMVS proc. Parmlib search being used.
I edited USER.Z24A.PARMLIB(BPXPRM00) and removed the ASNAME in FILESYSTYPE TYPE(ZFS) ENTRYPOINT(IOEFSCM)~~ASNAME(ZFS)~~ Well I actually made a copy of the original line and put it between /* and */, then deleted the text.
I reipled.

Afterwards

The D OMVS,PFS command now gives N/A instead of the Address Space Name

OMVS     0010 ACTIVE             OMVS=(00,01,BP,IZ,RZ,BB)                
PFS CONFIGURATION INFORMATION                                         
 PFS TYPE   ENTRY      ASNAME    DESC      ST    START/EXIT TIME         
...
  ZFS       IOEFSCM    N/A       LOCAL     A     2021/02/17 17.55.47

Easy!

The hardest part was making sure I had an IPLable SARES1 in case I got it wrong!

Issuing commands…

I used to issue commands like f zfs,query,all. Now that the ZFS address space does not exist, you need to use f omvs,pfs=zfs,query,all.

What should I monitor for MQ on z/OS – buffer pool

For the monitoring of MQ on z/OS, there are a couple of key metrics you need to keep an eye on for the buffer pools, as you sit watching the monitoring screen.

A quick overview of MQ buffer pools

An inefficient queue manager could have all messages stored on disk, in page sets. An efficient queue manager would cache hot pages in memory to they can be accessed without doing any I/O.

The MQ Buffer Manager component does this caching, by using buffers (so no surprise there).

A simplified view of the operation is as follows

Messages are broken down into 4KB pages.
Getting the contents of a pageset page, if the page is not in the buffers, the buffer manager reads it from disk into a buffer.
If a page is requested and the contents are not required (for example it is about to be overwritten as part of a put) it does not need to read it from disk.
If the page is updated, for example a new message, or a field in the message is updated during get processing, the page is not usually written to disk immediately. The write to disk is deferred (using a process called Deferred Write Process – another non surprise in the naming convention). This has the advantage that there can be many updates to a page before it is written to disk.
Any buffer page which has not been changed, and is not currently being used, is a free page.

If the system crashes, non persistent messages are thrown away, and persistent messages can be rebuilt from the log.

To reduce the restart time, pages containing persistent data are written out to the page set periodically. This is driven by the log data set filling up which causes a checkpoint. Updates which have been around for more than 2 checkpoints are written to disk. During restart the page set knows how far back in the log restart needs to go.

In both cases, checkpoint and buffer pool filling up (when there are less than 15 % of free =85% in use) , once a page has been successfully written to the page set, the buffer is marked as free.

Pages for non-persistent messages can be written out to disk.

If the buffer pool is critically short of free buffers, and there are less than 5% free buffers, then pages are written to the page set immediately rather than use the deferred write process. This allows some application work to be done while the buffer manger is trying to make more free pages available.

What is typical behaviour?

The buffer pool is working well.

When a message is got, the buffer is already in the buffer pool so there is no I/O to read from the page set.

The buffer pool is less than 85% full (has more than 15% free pages), there is periodic I/O to the page set because pages with persistent data are written to the page set at checkpoint.

The buffer pool fills up and has more 85% in-use pages.

This can occur if the number and size of the messages being processed is bigger than the size of the buffer pool. This could be a lot of messages in a unit of work, big messages using lots of pages, or lots of transactions putting and getting messages. It can also occur when there are lots of puts, but no gets.

If the buffer pool has between 85% and 95% of in-use pages( between 15% and 5% free pages), the buffer manager is working hard to keep free pages available.

There will be I/O intermittently at checkpoints, and a steady I/O as the buffer manager writes the pages to the page set.

If messages are being read from disk, there will be read activity from the pageset, but the buffer pool page can be reused as soon as the data has been copied from the buffer pool page.

The buffer pool has less than 5% free pages.

The buffer manager is in overdrive. It is working very hard to keep free pages in the buffer pool. There will be pages written out to the page set as it tries to increase the number of free pages. Gets may require reads from page set I/O. All of this I/O can cause I/O contention and all the page set I/Os slow down, and so MQ API request using this buffer pool slow down.

What should be monitored

Most cars these days have a “low fuel light” and a “take me to a garage” light. For monitoring we can provide similar.

Monitor the % buffer pool full.

If it is below 85% things are OK
If it is between 85% and 95% then this needs to be monitored, it may be “business as usual”.
If it is >=95% this needs to be alerted. It can be caused by applications or channels not processing messages

Monitoring the number of pages written does not give much information.

It could be caused by a checkpoint activity, or because the buffer pool is filling up.

Monitoring the number of pages read from the page set can provide some insight.

If you compare today with this time last week you can check the profile is similar.

If the buffer pool is below 85% used,

Messages being got are typically in the buffer pool so there is no read I/O.
If there is read I/O this could be for messages which were not in the buffer pool – for example reading the queue after a queue manager restart.

If the buffer pool than 85% in-use and less than 95% in-use this can be caused by a large message work load coming in, and MQ is being “elastic” and queuing the messages. Even short lived messages may be read from disk. The number of read I/Os give an indication of throughput. Compare this with the previous week to see if the profile is similar.

If the buffer pool is more than 95% in-use this will have an impact on performance, as every page processed is likely to have I/O to the page set, and elongated I/O response time due to the contention.

What to do

You may want “operators notes” either on paper or online which describe the expected status of the buffer pools on individual queue managers.

PROD QMPR
1. BP 1 green less than 85% busy
2. BP 2 green less than 85% busy
3. BP3 green except for Friday night when it goes amber read I/O rate 6000 I/O per minute.
TEST QMTE
1. BP 1 green less than 85% busy
2. BP 2 green less than 85% busy
3. BP 3 usually amber – used for bulk data transfer

What to do the buffer statistics mean?

There are statistic on the buffer pool usage.

Buffer pool number.
Size of buffer pool – when the data was produced.
Low – the lowest number of free pages in the interval. 100* (Size – log)/Size gives you % full.
Now – the current number of free pages in the interval.
Getp – the number of requests ‘get page with contents’. If the page is not in the buffer pool then it is read from the page set.
Getn. A new page is needed. The contents are not relevant as it is about to be overwritten. If the page is not in the buffer pool, just allocate a buffer, and do not read from the page set.
STW – set write intent. This means the page was got for update. I have not seen a use for this. For example
1. A put wants to put part of a message on the page
2. A get is being done and it wants to set the “message has been got” flag.
3. The message has been got, and so pointers to a page need to be updated.
RIO -the number of read requests to the page set. If this is greater than zero
1. The request is for messages which have not been used since the queue manager started
2. The buffer pool had reached 85%, pages had been moved out to the page set, and the buffer has been reused.
WIO the number of write I/Os that were done. This write could be due to a checkpoint, or because the buffer pool filled up.
TPW total pages written, a measure of how busy the buffer pool was. This write could be due to a checkpoint, or because the buffer pool filled up.
IMW – immediate write. I have not used this value, sometimes I observe it is high, but it is not useful. This can be caused by
1. the buffer pool being over 95% full, so all application write I/O is synchronous,
2. or a page was being updated during the last checkpoint, and it needs to be written to the page set when the update has finished. This should be a small number. Frequent checkpoints (eg every minute) can increase this value.
DWT – the number of times the Deferred Write processor was started. This number has little value.
1. The DWP could have started and been so busy that it never ended, so this counter is 1.
2. The DWP could have started, written a few pages and stopped – and done this repeatedly, in this case the value could be high.
DMC. The number of times the buffer pool crossed the 95% limit. If this is > 0 it tells you the buffer pool crossed the limit
1. This could have crossed just once, and stayed over 95%
2. This could have gone above 95%, then below 95% etc.
SOS – the buffer pool was totally short on storage – there were no free pages. This is a critical indicator. You may need to do capacity planning, and make the buffer pool bigger, or see if there was a problem where messages were not being got.
STL – the number of times a “free buffer” was reused. A buffer was associated with a page of a page set. The buffer has been reused and is now for a different page. If STL is zero, it means all pages that were used were in the buffer pool
STLA – A measure of contention when pages are being reused. This is typically zero.

Summary

Now you know as much as I do about buffer pools you’ll see that the %full (or %free) is the key measure. If the buffer pool is > 85% used pages, then the I/O rate is a measure as to how hard the buffer manager is running.