Using the Java Health Centre to look at Liberty on z/OS

The Java Health centre is an eclipse plugin which allows you to monitor Liberty( and so MQWEB, z/OSMF, z/OS Connect, etc) on z/OS.

You can look at

  • the Java classes being used,
  • the environment variables, and startup parameters.
  • charts of garbage collection
  • method profiling

Before you get to far into investigation be warned, this is basically insecure.

  • By default anyone can connect to it
  • You can specify a file with userid and password in it (in clear text)
  • You can use TLS – but you have to have a keystore in a file rather than the security manager, and RACF keyrings.
  • You cannot use RACF or other security manager for authentication and authorization.

Install the health centre on Eclipse.

I used Eclipse Marketplace to install IBM Monitoring and Diagnostic Tools – Health Center 3.0.12

Configuring Health center on Liberty.

You can use

  • headless mode, where files are written to the local file system, transported to your Eclipse environment and reviewed offline.
  • Connection from Eclipse to give real time information

See Configuring the monitoring agent and Health Center configuration properties, and here

Headless mode

Add the following to the JVM.options (or other start up), change the directory location.

-XHealthCenter 
-Dcom.ibm.diagnostics.healthcenter.headless=on
-Dcom.ibm.java.diagnostics.healthcenter.data.collection.level=headless
-Dcom.ibm.java.diagnostics.healthcenter.headless.output.directory=/u/tmp/zowec/
-Dcom.ibm.diagnostics.healthcenter.readonly=on

Restart the server.

Download the files from the specified directory to your works station, so the files are available from Eclipse.

In Eclipse

Window -> Open Perspective -> Open Perspective -> Health Center Status Summary

File -> Load Data…

Specify the file name.

Interactive mode

First time use (with little or no security)

-Xhealthcenter:transport=jrmp,port=8125 
-Dcom.ibm.diagnostics.healthcenter.logging.level=info
-Dcom.ibm.diagnostics.healthcenter.readonly=on
-Dcom.ibm.diagnostics.healthcenter.data.profiling=off

Note: Use -Dcom.ibm.diagnostics.healthcenter.data.profiling=off to start the server with profiling disabled.

Restart the server. Check the error log for messages like INFO: Health Center agent started on port 8125.a

On Eclipse, once you have installed the health center.

Window -> Open Perspective -> Open Perspective -> Health Center Status Summary

Open a connection to the server

File->”New Connection” (do not select the “New” option)

JMX; Host name 10.1.1.2; Port 8125; Untick Scan next 100 ports

This should connect you.

Problems

If you get

  • Unable to contact agent: Non IBM version of Java has been detected. Check the server is fully active, and you get a message in STDERR INFO: Health Center agent started on port 8125. in STDERR

Ping the IP address to check connectivity.

Quick overview

Once you have connected, or loaded the files

  • Environment gives you the properties and class paths.
  • Classes
    • Classes loaded shows you the classes loaded, and if they can from the shared cache or not.
    • Class Histogram shows you information of all the fields and variables. For example I had 196106 Ljava/lang/String – so lots of string variables, and 146973 HashMap nodes.
  • Method profiling
    • Self – which was the hottest method.
    • Tree – includes parents. The top element in the tree is java.lan.Thread.run. All work runs on a thread, so all work will be reported under this.
    • Click a line and select Invocation Paths to show who called the method.
  • Where you get a chart with Elapsed time … you can use your mouse, to select a time period and zoom in. Click on “elapsed time” to unzoom.

Securing Health Center

This is documented in Securing Health Center – but is missing standard Z techniques of protecting resources.

  • By default anyone can connect to it
  • You can specify a file with userid and password in it (in clear text)
  • You can use TLS – but you have to have a keystore in a file rather than the security manager, and using RACF keyrings.
  • You cannot use RACF or other security manager for authentication and authorization.

Straight lines good – curved lines bad.

It was a great day, it was 0830, the sun was shining, the view was amazing, as I was walking back the long way from the shops. There was a good hill en route, it starts nice and steep, then gradually leveled off to flat. I sat on the convenient seat at the top of the hill to admire the view and got my breath back. The shape of the hill reminded me of a graph in a performance report I had seen a few days before; It started steeply at the bottom left of the graph, and levelled off at the top right of the graph and the description was “so we can see it scales”. As I sat there enjoying the sun, I realised that it may scale – but it does not scale very well.

What does it mean to scale?

I remember 40+ years ago sitting in a meeting where the performance team were presenting status. They showed many graphs, one of which showed the transaction rate as they increased the work load. The graph started going up at an angle, nice and straight, and then levelled out until they achieved 100 transactions a second, and they could do no more than this. They explained that this was because of a lock being held for 1/10 of a second while some work was done.

They worked with the developer who realised that they could use a counter instead of a lock, and use the hardware instruction to atomically increase and decrease the counter. As a result the graph was now a straight line. Round of applause etc, well done etc.

The performance guy then went on to say, so in order to find the next bottlenecks we will need a box that is even bigger and even faster!

I also remember this performance guy said his job was to make curved lines straight!

You can hide the truth with charts

People often use graphs to show information, but often tell the wrong story. For example if you plot the total cost vs transaction rate you get a straight line

Great – this looks linear!

However If you plot the cost per transaction vs rate you get

In actual fact there is an increase in the cost per transaction as the rate increases. So good scaling – but not perfect. It is linear – when it should have been flat!

If you want to scare people – change the axis

It is the same data as the previous chart – but now the axis is 20 to 22.5 instead of 0 to 25, so any difference appears much larger.

So depending on what story you want to tell, and if you want to hide the truth, think about what data you present, and the axis you use.

Every one wants a bigger box or two

On the subject of wanting a bigger box…

One of the senior tester’s presented to management about testing strategy. He said you can do testing in one of two ways:

  1. If we were careful, and picked the right set of results we could show all test working. We could quietly ignore the tests that failed.
  2. No matter what we did, we could not seriously break it. Even when we deliberately misconfigured it during testing it carried on running – slowly – but it carried on running.

The manager said “hmmm I’m always being asked for graphs to present status so the first would be better, but from a customer perspective the second would be better”.

The tester said “That’s the right answer, we now need two bigger boxes. One to run the workload, the other to drive it. We need to achieve higher throughput than our customers”.

One customer told me that their test box was larger than production, and they needed to prove that they could run at expected peak production throughput + 50% with no major degradations – they needed many very big boxes.

How to become a performance expert in 3 easy lessons

and many hard lessons.

I had emails from two people, with different experiences of doing performance on z/OS. One person has recently started, and is not sure what is involved. The other person has been doing lots of work with customers explaining that his product is not the cause of the performance problems.

I thought it might be interesting for people who might be tempted to work in performance, to see the route to becoming an expert.

What does “performance” mean?

Performance work covers many different areas, and once you are competent in one product area it is not too difficult to cover additional areas.

“Performance” covers

Making sure it scales close to linearly

If you double the throughput, the costs per transaction should be similar. As the throughput increases, the response time does not increase significantly. You can have many threads running concurrently.

If the workload has disk I/O then you need to have multiple threads, so while one task is waiting for I/O another task can be using the CPU.

You need a box with multiple CPUs to detect contention. If you have only one or two engines you may not detect concurrency issues.

Work to remove contention until you can drive the CPUs at 100% busy (and then you ask for a bigger box). If you cannot drive the box at 100% find out why, resolve it and repeat.

Reduce CPU

Once you have eliminated as much contention as possible, you need to investigate where the CPU is being used, and try to eliminate any hot spots. This might be

  • Change algorithms – use a hash table instead of a linked list.
  • Avoid unnecessary work. Do you really need to store intermediate values in a database?
  • Can you tune the services being used. For example tune the database, add an index to a table.
  • Rearrange the code, for example have the “hot code” located in the same few pages. Avoid lots of error handling code in the mainline code – branch out of the mainline to handle it.
  • Remove debug code, or put debug code within if (debug enabled) then { debug code}.

Work with customers problems

Understand what areas the users have problems with, identify “problem areas” which take time to identify the problem.

Enhance the design

From your testing, and the experience with customer problem propose improvements to help diagnose problems for example

  • Capture the number, the average time, and the maximum time of database requests. Report this as a statistics or in response to a display command.
  • Record the number of times a resource, such as a lock, was not available, record total count of requests, number of blocked requests, time spent waiting. This code may never be executed, but if it is, you get useful information about the size of the problem.

Provide useful information to the end user

These are often known as “performance reports”. It is easy to produce reports that people cannot use – I have done it many times. Producing reports with nice graphs are often not easy to use, as they do not match your scenario.

You need to consider the questions the end users will have.

  • I want to run an ill defined workload (I do not know all the details), how big a box do I need (how many CPUs), to support 1000 requests a second.
  • What should I look at to tell me if things are running well or not.
  • What are common symptoms, and what actions can I take to solve performance problems.
  • What things do we need to consider to make it run well? For example table layout, how many requests per commit, how often you need to sign on.

Performance roles

The roles below are typical of the sort of activities a performance person will do

Run tests

The first tasks a person usually does when becoming a performance person is to run tests, and collect the data. This may involve writing scripts and tools so it can all be automated. For example on z/OS you might use Netview to run scripts, capture responses, and take actions when there are problems. This could all be done using Rexx scripts in TSO, and possibly using a REST interface.

Good automation will collect all of the key metrics into one place, for example a spread sheet, so the analyst can simply press a button or two to be able to display the data.

There may be a management report produced daily or weekly to show that performance overall has improved – or has not got worse.

Look at a component

You need to look at components within the whole environment, for example this week, look at the z/OSMF SDSF interface, next week the logon process.

You need to drive a high volume workload using this component. You need to focus on the component, for example with a REST requests 90% of the cost may be in the logon and establish a session. This makes it hard to focus on the other 10%. Sign on once, and have an application that just issues requests to the component.

When I was testing MQ under CICS, the duration of an MQPUT took 50 microseconds, and the cost of starting the CICS transactions was 1000 microseconds. I changed the transaction to process 1000 messages, so the transaction now took about 50 milliseconds, and most of the work was in the MQPUT area, and not in the CICS transaction overhead.

Capture the response time of the transactions and plot it over time. You should get a flat line. If the response increases over time, you might have a storage leak, and so it takes longer to get storage.

You may find it does not scale. Turning trace on can give an indication where the problem is. You often get function entry and exit trace, with time stamps, so you can post process the output to calculate the duration within the function. Trace often does not scale, so you cannot always believe the output.

You may want to instrument a private copy of the code. Obtain the time on entry and exit to the function, and across major calls to external requests. Calculate the duration of the calls, add logic to say “If duration > 10 millisecond then throw exception”, or accumulate the data in a global control block. When I did this, I found the trace code was adding significant delays, and the root cause of the problem was an insignificant line of code, which got an exclusive latch for an update!

I added code to measure the average duration of file I/O, and output this in the statistics. This made solving some problems very easy – you have an I/O problem. See here, it is taking 10 ms to write a page of data!

Unless you are testing the startup times, you should allow the system under test to “warm up”, so the hardware cache is in a steady state, database tables are in memory etc.

I found it useful to warm it up, then take 5 sets of measurements each of 1-5 minutes. When displaying the data, the results should all be similar. If not, you need to find out why. You should also run these tests once a week, and whenever you change a component, such as putting fixes onto your system, or change the hardware. Some example of things that can change your results

  • Overnight the Operations Team run backups and cause a different disk response time
  • The order the LPARs were ipled has changed. Last week your system had 6 CPUs in one book (so all very close to each other) this week your system has 3 CPUs in one book – and 3 CPUs in a different book – 1 metre away.
  • The network between your driving system and the test system has changed, or has a different load.

Usually the performance machines have their own dedicated hardware, processors, disks, connections to the disks, network.

Develop skills in other products

My background is MQ performance on z/OS. I had to learn about the performance characteristics of z/OS, DB2, TCP/IP, IMS, and understand the tools these products provide. Once you understand one trace, other traces are basically similar. The hard part is capturing the trace.

MQ passes messages from system to system. There were several problems where the “network was slow”. This meant we had to understand what was happening under the covers. Some good problems with easy fixes included

  • There was a TCP performance “improvement” where one end would delay sending a packet for a few milliseconds, as it is more efficient to send one big packet rather than several smaller packets. This meant that every MQ message sent over the network had a couple of millisecond delay. This fix was easy – disable this feature.
  • TCP/IP by default uses small buffers (256 bytes). You can configure a session to have very large buffers and tell it to automatically tune the best buffer size ( up to MB sized buffers).

Work with customers on their performance problems

The work involves working on performance problems where you do not have any of your specially written code included in it. You may need to turn on the product trace for a few seconds, then turn it off, and then process the output. Many customers do not run with trace on because of the overhead and major impact on throughput.

You can acquire the skills to talk to customers on the phone about their problems. It is very good to feedback what you heard. “Let me check what you just said … when you do … you get … “

Over time you will build up a list of questions to ask.

Once the problem has been resolved, consider what would have made it easier to find the root cause. Can you get development to put in some statistics, so next time this happens, you can tell the customer to check a value.

In the early days on MQ, we used to get many problems, because the in-memory buffer was too small. Development put out a fix, so that every 10 minutes or so it would report if it had detected a buffer full problem since the last message. After this fix was rolled out, we had no more of these problems.

There is no limit as to how far you can go

Once you have skills in one component you can apply these skills to other products or components. For example I spent some time looking at MQ on Linux so I could understand (and blog) on the performance data produced. (The performance data was “here are some numbers, we are not going to tell you what they mean”).

I’ve also been looking at Java performance, which lead me to look at the zFS file system, and the statistics it provides (it provides some – but they are not very useful).

You can also go deep. I knew about z architecture instructions and how some are fast and some are slow. I attended a taskforce with lots of hardware people. I met the team leader for the “load instructions”, and found that the “load instruction” was not an instruction – it is more like a subroutine with logic, for example

  • Find which CPU which currently “owns” this data in the CPU cache, and go and get it
  • Lock the page
  • Go and get this value from another page
  • Add the two values
  • Unlock both the pages

The subroutine had to communicate with other CPUs in the LPAR, worry about its own CPU cache etc. Deep Stuff!

Once you know this sort of stuff, it helps you program, for example it is better not to share a field if you do not have to. When a multi threading program uses a buffer to trace into, do not have one buffer which they all share, but give each thread its own buffer. This way the hardware will not be fighting over the buffer, and the data for each application can be kept on the same CPU as the program. This is obvious once you know!

Collect statistics at the thread level, and not at the global level. Merge them at display time. You know the reason why.

The hardware can start to execute instructions out of order – as long as they “commit” in the right order.

The z hardware has instrumentation which samples the executing system, and can tell you why instructions were delayed. For example

  • Data had to be obtained from the L2 cache on the chip
  • The instruction needed to be interpreted and added to the Translation Lookaside Buffer

This is a bit deep for many people, especially if they are at the level of using “printf” in their programs to display debug information.

“Me, with the brain the size of a planet ….”

This is a quote from Marvin the paranoid Android in the Hitchhiker’s guide to the galaxy. With performance work you can go deep, or you can go wide, but you would need a bigger brain than I had to go deep and wide – but it is a fascinating area.

Example of zFS statistics

This blog post gives an example of zFS statistics, and my interpretation of what they mean.

Related posts

I IPLed my z/OS to give a clean system.

I used a batch job to read all of the files in a directory and throw away the output.

sh cat /usr/lpp/java/J8.0_64/lib/ext/* 1> /dev/null

The command

du -ka /usr/lpp/java/J8.0_64/lib/ext/

gave 16728 KB, and there were 30 files in the directory.

The interface layer

The command

query -knpfs

gave

------------- ---------- ---------- ---------- ----------
Operation Count XCF req Avg Time Bytes
------------- ---------- ---------- ---------- ----------
zfs_opens 37 0 0.053
zfs_closes 37 0 0.024
zfs_reads 4160 0 0.080 16.234M
zfs_getattrs 86 0 0.036
zfs_accesses 377 0 0.027

There were 4160 read requests of 4096 bytes = 16MB

There were 30 opens one for each file.

There was an open for ‘/’, ‘/usr’ ‘/usr/lpp’ etc .. so 37 opens in total. At the end, each of these objects were closed.

The interface layer calls the buffer manager

The command

query -usercache

gave the User FIle (VM) Caching System Statistics report. It had

External requests
Reads     4160 Fsyncs     0 Schedules 0
Writes       0 Setattrs   0 Unmaps    0
Asy Reads 4126 Getattrs 153 Flushes   0

Which says there were 4160 read requests, which matches the zfs_reads request.

There were 4126 requests from the interface layer which had read-ahead set. This tells the buffer manager to get the pages. If they are not already in the buffer start reading them from disk. The Asy Reads does not give the reads from disk.

When I repeated the test I had: Reads 4160, Asy Reads 4120, with reads from disk 0 (as expected).

 File System Reads:
 Reads Faulted          34     (Fault Ratio    0.817%) 
 Writes Faulted          0     (Fault Ratio    0.000%) 
 Read Waits             34     (Wait Ratio     0.817%) 
 Total Reads           276 

This shows there were 276 reads from a file system, of which 34 requests had to wait for I/O.

I interpret this as saying there were 34 requests for get page which required disk I/O. The remained 276 – 34 caused I/O for read ahead so the application did not have to wait. I think the first page of each file was not in the cache, so there was an I/O to read the first segment(16 pages) of records in. There were 30 files, so 34 is close enough. The first request also started a Read Ahead to read the next segment in.

 Page Management (Segment Size = (64K) ) (Page Size = 8K) 
 -------------------------------------------------------- 
 Total Pages           121725     Free             118843 
 Segments                 395 
 Steal Invocations          0     Waits for Reclaim     0 

Before the test the free pages was 120933, so the delta is 2490 pages. Each page is 8KB, so the amount of storage used is 2490 * 8KB = 19.5 MB. The amount of data read from disk is 16.234MB so these numbers are comparable.

The Steal Invocations is the number of 64KB segments released to make space in the cache. In another test, I used a very small cache (10MB) and read 25636 KB of data in, and repeated the reads. Steal invocations was 404. 404 * 64 * 1024 = 25856 KB. This is close to the amount of data processed. Note: The documentation is incorrect,it says the value is the number of 4KB pages, not 64KB segments.

Data level

                   I/O Summary By Type 
                   ------------------- 
                                                                      
 Count       Waits       Cancels     Merges      Type 
 ----------  ----------  ----------  ----------  ---------- 
         75          61           0           0  File System Metadata 
          0           0           0           0  Log File 
        276          51           0           0  User File Data 

This shows there were 75 I/O requests for meta information about the file, and 276 I/O requests to read the file itself. Reading the documentation I think the WAITS column indicates an I/O request was delayed before its I/O started, for example there was already an I/O outstanding.

                  zFS I/O by Currently Attached Aggregate 
 DASD   PAV 
 VOLSER IOs Mode  Reads  K bytes  Writes  K bytes  Dataset Name 
 ------ --- ----  -----  -------  ------  -------  ------------ 
 ... 
 A4PRD3   1  R/O    302    16780       0        0  JVB800.ZFS 
 ... 
 ------ --- ----  -----  -------  ------  -------  ------------ 
                    337    17104      14       56  *TOTALS* 

This shows there was I/O to the data set containing the Java file system. There were 302 reads, and it read 16780 KB of data.

I’ve omitted the other file systems which with 35 Reads, and 14 Writes.

These counts do not seem to tie up. There were 276 Reads to the User File Data, and 75 reads for File System Meta data, a total of 351. The zFS read count was 337.

zFS performance reports I would like to use on z/OS (but can’t)

What started off as an investigation in why Java seemed slow on z/OS; was it due to a ZFS tuning problem? It changed into what performance health checks can I do with zFS.

It may be that zFS is so good you do not need to check its status, but I could find no useful reports, on what to check, and found that basic reports are not available, and useful data is missing. I would rather check than assume things are working OK.

Related posts

Getting the data

Data is available from SMF 92 records. Records are produced on a timer, either the SMF Interval broadcast, or the zFS -smf_recording interval.

Data is available from the zFS commands, for example query -reset -usercache.

If you use the display command, you get the data accumulate since the system was started, or the last reset was issued.

You may want to have a process to issue the display and reset commands periodically to provide a profile throughout the day. Having data accumulated for a whole day does not allow you to see peaks and troughs.

Some data does not include the duration of the data (or reset time), so you cannot directly calculate rates. You might need to save the reset time in a file, and use this to calculate the interval.

query fsinfo includes the reset time; query metacache, usercache and dircache do not include the reset time.

There is an API BPX1PCT(“ZFS “,ZFSCALL_STATS, … This returns the data in a C structure, but z/OS does not seem to provide this as a header file! It provides sample c programs for printing the data for each sort of data.. I do not know if the data is cumulative, or since the last reset.

Simple scenario

Consider the simple scenario,

  • I have a web server (Liberty on z/OS) for example z/OSMF, z/OS Connect, WAS with people using it.
  • There are people developing a Java application
  • I have a production Java program which runs every hour, reads in data from a file, does some processing, and puts sends it over HTTP to a monitoring system. This could be reading SMF data, and coverting it to JSON.

What the basic reports did I expect?

The question below would apply to any work, for example a business transaction, using CICS, DB2, MQ and IMS, zFS is just another component within a transaction.

  • When I start my Java application – it sometimes takes much longer to start than at other times – 20 seconds longer. What is causing this? Is it due to the delays in reading files or should I look else where?
    • For each job, I would like to know the total time spent processing files, and identify the files, used by the job, were most time is spent.
  • We had a slow down last week, can we demonstrate that zFS is not the problem?
  • Do I need to take any actions on zFS
    • Today – because it is slow
    • Next week – because I can see an increase in disk I/O over the past few weeks.
  • Can I tell which files or file systems are using most of the cache, and what can I do about it?

For each job, I would like to know the total time spent processing files, and identify the files, used by the job, were most time is spent.

This information is not available.

From the SMF 92-11 records you can get some information

  • Job name
  • File name. Some files are given as /u/adcd/j.sh, other files are given as write.c with no path, just the name used. This is not very helpful, as it means I am unable to identify the specific file used.
  • Time file was opened
  • Time file was closed (so you can calculate the open duration)
  • The number of directory reads. For the file “.” this had 1 read,
  • The number of reads, blocks read, and bytes read
  • The number of writes, locks written, and bytes written. For example an application did 10,000 writes, with a buffer length of 4096. There were 10,003 blocks written and 40,960,000 bytes written.

This information does not tell you how long requests took. A fread() could require data to be read from the file, or it may be available in the cache.

You cannot get this information from the zfs commands. You can get other information, for example the I wrote to a file and issued the command fileinfo -path /u/adcd/temp.temp -both this gave


path: /u/adcd/temp.temp 
owner                S0W1       file seq read           yes 
file seq write       yes        file unscheduled        0 
file pending         625        file segments           625 
file dirty segments  0          file meta issued        0 
file meta pending    0          
 

The data is described here.

  • unscheduled Number of 4K pages in user file cache that need to be written.
  • pending Number of 4K pages being written.
  • segments Number of 64K segments in user cache.
  • dirty segment Number of segments with pages that need to be written.

Given a filename you can query how many segments it has, but I could not find a way of listing the files in the cache. You would have to search the whole tree, and query each file to find this. This operation would significantly impact the metadata cache.

We had a slow down last week. Can we demonstrate that zFS is not the problem?

You can get information on

  • the number of pages in the various pools
  • the number of reads from the file system, and the number of requests that were available from the cache – the cache hit ratio. A good cache hit is typically over 95%.
  • Steal Invocations tells you if the cache was too small, so pages had to be reused.
  • The I/O activity (number of reads and writes, and number of bytes) by file system.
  • The average I/O wait time by volume.
  • The number of free pages never goes down, you can use it to see the highest number of pages in use, since ZFS started. It it reached 95% full on Monday – it will stay at 95% until restart.

If you compare the problem period with a normal period you should be able to see if the data is significantly different.

You need to decide how granular you want the data, for example capture it every 10 minutes, or every minute.

Do I need to take any actions on ZFS?

Today – because it is slow

Display the key data for the cache, cache hits, compare the amount of I/O today with a comparable day.

I do not think there are any statistics to tell you how much to increase the size of the cache. Making the cache bigger may not always help performance, for example if a program is writing a 1GB file, then while the cache is below 1GB it will flood the cache with pages to be written, and read only pages will have been overwritten.

Next week

You can monitor the number of reads and writes per file, and the number of file system I/Os, but you cannot directly see the files causing the file system I/O.

If there is a lot of sustained I/O to a file system, you may want to move it to a less heavily used volume, or move subdirectories to a different file system, on a different volume.

There are several caches: User Cache, Meta data cache, VNode cache, Log cache. The size of these can all be reconfigured, but I cannot see how to tell how full they are, and if they need to be increased in size.

Can I tell which files or file systems are using most of the cache, and what can I do about it?

The SMF record 92-59 contains the number of pages the file system has in the user cache, and in the meta cache.

The field SMF92FSUS has the number of pages this file system has allocated in the user cache.

The field SMF92FSMT has the number of pages this file system has allocated in the meta data cache

For 40 file systems, the time the record was created was within 2ms, so you should be able to group records with a similar time stamp, for example save the data, and show % buffers per file system.

The command fsinfo -full -aggregate ZFS.USERS provides the same information. It gave me

Statistics Reset Time:     May 30 11:09:51 2021 
Status:RW,NS,GF,GD,SE,NE,NC 
Legend: RW=Read-write, GF=Grow failed, GD=AGGRGROW disabled                                  
        NS=Mounted NORWSHARE, SE=Space errors reported, NE=Not encrypted                     
        NC=Not compressed                                                                    
   *** local data from system S0W1 (owner: S0W1) ***                                         
Vnodes:              48              LFS Held Vnodes:         4       
Open Objects:        0               Tokens:                  0       
User Cache 4K Pages: 5011           Metadata Cache 8K Pages: 39      
Application Reads:   11239           Avg. Read Resp. Time:    0.046   
Application Writes:  22730           Avg. Writes Resp. Time:  0.081   
Read XCF Calls:      0               Avg. Rd XCF Resp. Time:  0.000   
Write XCF Calls:     0               Avg. Wr XCF Resp. Time:  0.000   
ENOSPC Errors:       1               Disk IO Errors:          0 

This also showed:

  • there was 1 no-space error
  • Status had
    • GF=Grow failed
    • GD=AGGRGROW disabled
  • There were 48 Vnodes (files) in the meta cache.

It looks like the Application Reads and Writes are true application requests. I had a program which wrote 10,000 4KB records, and the Application writes increased by 10002. The reads increased by 23 event. I think this due to the running of the program.

The command also gave


VOLSER PAV    Reads      KBytes     Writes     KBytes     Waits    Average           
------ --- ---------- ---------- ---------- ---------- ---------- ---------          
A4USS2   1         55        532       1658      91216         83 0.990              
------ --- ---------- ---------- ---------- ---------- ---------- ---------          
TOTALS             55        532       1658      91216         83 0.990              

The number of write ( to the file system) increased by 630, the KB written increased by 40,084KB which is the approximate size of the file (40,000KB)

You can use the command fileinfo -path /u/adcd/aa -both and it will display information about the file system the file is on.

Although you can see how much data was written to the file system, I could not find easily find which file it came from. The SMF 92-11 records can give an indication, but writing 10MB to a file, and deleting the file may mean no data is written to disk, so the SMF 92-11 records are not 100% reliable.

How to collect zFS statistics

This blog post is part of a series on the zFS file system on z/OS and how to identify performance problems.

Related posts

How to collect the statistics data.

You can collect statistics data from zFS using

  • SMF type 92 records
  • Using operator commands. This should not be the normal way of collecting data, as it is verbose, and does not format well
    • You can display accumulated data
    • You can display and reset accumulated data
  • Using a batch/tso command. You can create output datasets of the information
    • You can display accumulated data
    • You can display and reset accumulated data
  • You can display them in RMF.
  • You can write your own program to extract the records. zFS provides the code of their commands.

SMF

You need to enable SMF collection using the zfsadm command. You can use batch or Unix Services

// SET P='config -smf_recording off' 
// SET P='config -smf_recording on,10' 
// SET P='config -smf_recording on' 
// SET P='configquery -all' 
//AGGRINFO EXEC PGM=IOEZADM,REGION=0M, 
// PARM=('&P') 
//SYSPRINT DD SYSOUT=H 
//STDOUT DD SYSOUT=H 
//STDERR DD SYSOUT=H 
//SYSUDUMP DD SYSOUT=H 
//CEEDUMP DD SYSOUT=H 

You can use

  • configquery -all to display the current configuration
  • config -smf_recording on,10 to produce records every 10 minutes
  • config -smf_recording on to enable SMF recording on the SMF interval broadcast
  • config -smf_recording off to stop the collection of SMF data

You need to check that SMF is configured to collect the SMF 92 records. The operator command d SMF,o shows what is being collected. If it reports NOTYPE(14:19,62:69,92,99) with 92 in the list of NOTYPE, then SMF 92 records will not be collected.

You use a standard SMF job to copy the SMF data for post processing. I could not find an IBM provided formatter, so I wrote one.

I could not see how to cofigure zFS to not produce the SMF 92-11 records on individual zFS usage. I think you have to disable it at the SMF interval.

Operator command

You can issue a command at the console for example

F OMVS,PFS=ZFS,QUERY,VM

or

f ZFS,QUERY,VM

There is a lot of output, and it does not always format well on the console.

Using OMVS command line

You can use the omvs command zfsadm, for example zfsadm query -iobyaggr to display the data.

You’ll need to issue a command like

zfsadm query -iobyaggr 1>output_file

To capture the output

Using Batch

I use JCL (and move the relevant SET P statement to the bottom of the list as needed).

// SET P='config -smf_recording on,10' 
// SET P='/fileinfo /u/ibmuser      ' 
// SET P='config -smf_recording on' 
// SET P='configquery all' 
// SET P='config -smf_recording off' 
// SET P='query -iobyaggr' 
//AGGRINFO EXEC PGM=IOEZADM,REGION=0M, 
//  PARM=('&P') 
//SYSPRINT DD SYSOUT=H 
//STDOUT DD SYSOUT=H 
//STDERR DD SYSOUT=H 
//SYSUDUMP DD SYSOUT=H 
//CEEDUMP DD SYSOUT=H 

The query command has many options. I think you can only pass parameters via the parm statement. You cannot pass a list of command in//SYSIN.

Command interface

For the command interface, The values displayed are accumulated until reset, for example query -reset -iobyaggr

RMF

I started RMF, then used F RMF,START III to collect additional information.

I used the TSO RMFWDM command (RMF Work Delay Monitor). This gave me RMF Monitor III Primary Menu.

Selection S SYSPLEX Sysplex reports and Data Index

Selection I ZFSOVW zFS Overview

This gave

                                                                        
                                 ---------- Cache Activity ------------ 
System       -----Wait%------    ---User---    --Vnode---    -Metadata- 
              I/O  Lock Sleep     Rate Hit%     Rate Hit%     Rate Hit% 
                                                                        
S0W1         24.8   1.2   6.7    694.9 98.6    569.3 97.0    743.3 99.6 

This displays the user, vnode, and Metadata data cache. The rate of activity and the cache hit ratio. High(> 95%) is good. The rate is the number of get page requests a second.

If you tab to any of the numbers and press enter, it displays more information, for example

                     zFS Overview - User Cache Details                 
                                                                       
 The following details are available for system S0W1                   
 Press Enter to return to the Report panel.                            
                                                                       
 Size        :       951M         Storage fixed :  NO                  
 Total Pages :       122K                                              
 Free Pages  :      98245                                              
 Segments    :       4694                                              
                                                                       
 --------- Read ---------    --------- Write --------                  
  Rate  Hit%  Dly%  Async     Rate  Hit%  Dly%  Sched     Read%  Dly%  
                     Rate                        Rate                  
 261.3  96.4   0.2  97.44    433.6   100   0.0  7.010      37.6   0.0  
                                                                       
 ---------- Misc -----------                                           
 Page Reclaim Writes :     0                                           
 Fsyncs              :     7                                           

Selection 14 ZFSFS zFS File System (or zff)


                    RMF V2R4   zFS File System  - ADCDPL          Line 55 of 80
 Command ===>                                                  Scroll ===> CSR
                                                                               
 Samples: 100     Systems: 1    Date: 06/01/21  Time: 08.51.40  Range: 100   Sec
                                                                                
 ------ File System Name --------------------              I/O  Resp Read  XCF  
                  System    Owner     Mode    Size Usg%   Rate  Time  %    Rate                                                                                 
 ZFS.S0W1.USR.MAIL                                                              
                  *ALL      S0W1      RW     3600K  4.9  0.000 0.000  0.0 0.000 
 ZFS.S0W1.VAR                                                                   
                  *ALL      S0W1      RW       37M 63.2  265.1 0.033  100 0.000 
 ZFS.S0W1.VARWBEM                                                               
                  *ALL      S0W1      RO      105M 33.8  0.000 0.000  0.0 0.000 

If you put the cursor on any value ( except file name) you get more information.

I cound not find how to sort the data.

                           zFS File System Details                        
 File System Name : JVB800.ZFS                                            
 Mount                                                                    
 Point :                                                                  
 System : *ALL              Owner : S0W1              Mode : RO           
 -------------- Read -------------    ------------- Write -------------   
 --- Appl --- ---- XCF ----   Aggr    --- Appl --- ---- XCF ----   Aggr   
  Rate   Resp   Rate   Resp   Rate     Rate   Resp   Rate   Resp   Rate   
         Time          Time                   Time          Time          
 112.8  0.191  0.000  0.000  36618    0.000  0.000  0.000  0.000  0.000   
                                                                          
 Vnodes              :   111          USS held vnodes         :    68     
 Open objects        :    47          Tokens                  :     0     
 User cache 4k pages :  9549          Metadata cache 8k pages :   106     
                                                                          
 ENOSPC errors       :     0          Disk I/O error          :     0     
 XCF comm. failures  :     0          Cancelled operations    :     0     

Selection 15 ZFSKN zFS Kernel (zfk)

This gave me


                    RMF V2R4   zFS Kernel       - ADCDPL            Line 1 of 1
 Command ===>                                                  Scroll ===> CSR
 Samples: 100     Systems: 1    Date: 06/01/21  Time: 08.51.40  Range: 100   Sec
                                                                                
 System      - Request Rate -  --- XCF Rate ---  - Response Time -              
 Name         Local   Remote    Local   Remote    Local   Remote                
                                                                                
 S0W1          8599    0.000    0.000    0.000    0.054    0.000                

In all these reports you can use PF10 and PF11 to scroll through time.

Annoyances

With all the output you do not get the duration of the statistics, so you are not able to display rates, for example MB/Second to a file system.

If you enable SMF, then the first record contains the accumulated data since ZFS was started, or SMF was disabled. If you try plotting the values against time – you will get a strange graph.

There is no SMF formatter provided so I’ve written my own.

You cannot pass all of the parameters to IOEZADM as the parameter field is too long, so you have to use PARMSDD=

//AGGRINFO EXEC PGM=IOEZADM,REGION=0M,
// PARMDD=PARMS
//PARMS DD *
query -reset -iobyaggregate -iobydasd -knpfs -ctkc
-usercache -iocounts -metacache -dircache -logcache
/*

zFS on z/OS concepts, from a performance perspective

I was looking into a Java performance problem, and thought the problem may be connected to the performance of the unix files in the ZFS file system. I found it hard to find out useful information on how to investigate the problems. ZFS can produce a lot of information, but I found it hard to know which reports to look at, and what the key fields were.

This blog post gives the overall concept of a cached file system, it is based on my experience of other cached “file” systems. I have no special knowledge about zFS. I hopes it explains the concepts, it may not reflect reality.

Aside:

It reminds be of a lectures at University, where they explained matter was atoms with electrons whizzing around a small, solid nucleus. This was a good picture but entirely inaccurate. We then learned that the nucleus was composed to protons and neutrons. This was also a good picture, and entirely inaccurate. We then learned that protons and neutrons are composed of quarks particles, a good picture, but inaccurate. We then got into string theory and got knotted. Which ever picture you used, it help with the understanding but was not accurate.

Related posts

General background of cache systems.

A cached file system is common in IT. DB2 has buffer pools, MQ has buffer pools, and ZFS has a cache. The concepts are very similar. Over the years the technology has improved and the technology is efficient. For example all of the above system, use data in 4KB pages, and the IO to external media has been optimised.

I like to think of the technology in different layers.

  • The application layer, where the application does an fread(), MQGET or SQL query.
  • The interface layer, where it knows which records to get, which MQ message to gets, or which tables, rows and columns to get. This layer has a logical view of the data, and will request the next level down. “Please get me the data for this 4KB page on this data set at this position.
  • The buffer manager layer. The aim of the buffer manager is to keep the optimum amout of data in cache, and minimise I/O.
    • If the requested 4KB page is in the cache then return it. This counts as a cache hit.
    • If it is not in the cache then call the data layer, and say please read this page from disk at this location, into this buffer. This counts as a cache miss.
    • The buffer manager may have logic which can detect if a file is being read sequentially and perform read ahead. Logic like
      • Read page 19 of the data set, wait for the I/O to complete, return
      • Read page 20 of the data set, wait for the I/O to complete, return
      • Read page 21 of the data set, wait for the I/O to complete, return
      • Read page 22 … Hmm – this looks like a sequential read. Get pages 22 to 30 from the data set, wait for the IO to complete, return page 22
      • Read page 23 – get it from the buffer and return, no I/O
      • Read page 24 – get it from the buffer and return, no I/O
    • When a page has been updated, usually it is not written directly to the disk. It is more efficient to write multiple pages in one I/O. This means the application does not have to wait for the I/O. This is often called “write behind” or “lazy write”. When the application has to be sure the write to disk has worked, for example the fsync() request, or a transactional commit; the requester has to wait until the I/O has completed. The write to the disk is a collection of pages possibly from different applications. It is totally separate to the applications writing the records.
    • If the cache fills up, the buffer manager is responsible for making space. This might be to reuse the space for pages which have not been used for a long time, or if there are a lot of updated pages, writing these out – or doing both.
    • If the same file is often used, then the pages should be in the buffer. If a file is used for the first time, it will need to be read in – some pages synchronously by the application, then pages read in by the read ahead processing.
  • The data layer. This does the IO to the disk or other external media.

What statistics make sense?

The application can time the request and provide a true duration of the request.

At the interface level, one file read requests may have resulted in many calls to the buffer manager. The first few “get page” requests may be slow because it had to do I/O to the disk. After read ahead became active the reads from the buffer were very fast. “The average get page time” may have little value.

It may be possible to record the number of synchronous disk writes an application did (the fsync() request), but if the write was a lazy write it will not be recorded by against the file. If one I/O wrote 10 pages, four pages were for this application,six pages for that application. Recording the duration of the lazy write for each application has no value.

You can tell how many read and write requests there were to a data set (file system), and how long these requests took. You can also record how many bytes were read or written to a data set.

Overall there may be many statistics that tell you what each level is doing, and how it is performing, but they may not be very helpful when looking from an application viewpoint.

Simple file access example

  • fopen file name
  • fread
  • fwrite
  • fclose

fopen – Under the covers

Conceptually, the fopen may have logic like

zfs_open. This looks up to see if the file has been used before. It looks for the path name in the meta data cache. The meta data cache has information about the file, for example the file owner, the permissions for the owner, last time the file was read and pointer to the file system it is on, and its location on the file system.

If the path name is not in the meta cache then go to the file system and get the information. To get the information for file /u/colin/doc/myhelp.txt it may have to get a list of the files under /u/colin, then find where the ‘doc’ directory is. Then get information of the files under /u/colin/doc, this has record for myhelp.txt which has information on where this file is on disk. Set “next page” to where the file is on disk. Each of these steps may need or more pages to be read from disk.

fread – under the covers

The fread may have logic like

zfs_reads. Within this it may have logic like

  • Get the next page value. Does this page exist for in the cache? If so, return the contents, else read it from the file system, store it in the cache, update the next page pointer, and then return the contents.
  • Loop until enough data bas been read. As the pages are in 4KB units – to read a 10KB message will need 3 pages.

There are smarts; the code has read ahead support. If the system detects there have been a sequence of get next page requests, instead of “Loop until enough data has been read” it can do

  • Loop until enough data has been read, and start reading the next N pages, ready for the next request from the application.

By the time the application issues the next fread request, the data it needs may already have been read from disk. To the application it looks like there was no file I/O.

There may also be calls to zfs_getattrs, zfs_lookups.

fwrite – under the covers

The fwrite does not write directly to the disk. It writes some log data, and writes the data to the cache. This is known as “dirty data” because it has been changed. There is an internal process that writes the data out to the file system. Writing many records to the file system in one I/O is more efficient than writing one page each time in many IO.

Applications can use the fsync() request to force the writes to the disk.

Characteristics of the cached file system

It changes over time

The behaviour of the ZFS cache will change over time.

At start up, as files are used, the files will be read from disk into the cache.

Once the system has “warmed up”, the frequently used files will be in the cache, and should not need to be read from disk.

You could IPL the system at 0600, and for the first hour it warms up, and the cache settles down to a steady state for the rest of the day. In the evening, you may start other applications for the overnight processing, and these new applications will have a warm up period, and the cache will reach another steady state.

Data in the cache

Data in the file cache can be

  • Read only, for example a java program uses .jar files to execute. Some .jar files may be used by many applications and be access frequently.
  • Read only application data. For example a list of names.
  • Write application data – for example an output list of names, a trace or log file. For some writes this may be an update and so the previous contents need to be read in.

Read only jar files

The cache needs to be big enough to hold the files. Once the files have been read in, there may be no reads from the file system. Any files that had not been used before will need I/O to the file system. If the cache is not big enough then some of the data can be thrown away, and reloaded next time it is needed.

Read only application data

This data may only be used once a day. Typically it will be read in as needed, and once it has been used it is the cache storage could be stolen and reused for other applications.

Write application data

If the write updates existing pages for example writing to the end of a file, or updating a record within the file then the pages will be needed to be in cache. This may require disk I/O, or the page may be in the cache from a previous operation.

If the data is written to an empty page, then the page need not be in the cache before it is written to. Once the page has been updated, it will be written asynchronously, as it is more efficient to write multiple pages in one I/O than multiple I/Os with just one page.

File system activity

A program product file system

This will typically be used read only (even though it may be mounted read/write), so you can expect pages read from the data set, but no write I/O.

User’s data

A user will typically read and write files read I/O and write I/O.

Using subystems like Liberty Web Server ( and so products like z/OSConnect, z/OSMF, ZOWE) these will have read and write activity, as configuration information is used, and data is written to logs.

What happens as the cache is used?

Writing to a file

When data is written to a file, the cache gets updated. Modified pages get queued to be written to disk when

  • A segment of 64KB of data for a file is filled up.
  • The application does a fsync() request to say write the file out to disk.
  • The file is closed
  • The zfsadm config -sync_interval n has expired.
  • The cache is very full of updated (dirty) pages the so called Page Reclaim Writes

Reading a file

When a page has been used, it gets put on a queue, with most recent used pages at the front. A hot page (with frequent use) will always be near the front of the queue. If all the pages have data, and the buffer pool needs a buffer page, then the oldest page on this queue is stolen, (or reclaimed) for the new request.

Ideally the buffer pool needs to be big enough so there is always unused space.

If you have a cache of size 100 pages and read a 50 page file, it will occupy 50 pages in the cache. The first time the file is used data will have to be read from disk. The second time the file is used, all the pages are in memory and there is no disk I/O.

If the cache is only 40 pages, then the first 40 pages of data will fill the cache. When page 41 is read it will use replace the buffer with page 1 in it (the oldest page). When page 42 is read, it will replace the buffer with page 2 in it.

If you now read the file a second time – page 1 is no longer in the cache, so it will need to be read from disk, and will replace a buffer. All 40 pages will be read from disk.

Will making the cache bigger help? If you make the cache 45 pages it will have the same problem. If you make it 50 pages the file will just fit – and may still have a problem. If the cache is bigger than 50 pages the file should fit in – but other applications may be using other files, so you need to make the cache big enough for the 50 page file, and any other files being used. There is nothing to tell you how big to make it. The solution seems to be make the cache bigger until the I/O stops (or reduces), and you have 5-10% free pages. If you make the cache very large it might cause paging, so you have a balancing act. It is more important to have no paging, as paging makes it difficult for the buffer manager to manage. (For example it wants to write out a dirty page. It may need to page in the data, then write it out!)

Reusing pages

A page cannot be stolen(reused) if it needs to be written to disk. Once the contents have been written to disk the page can be stolen.

In reality it looks like blocks of 64 KB segments are used, not pages.

There is a VM statistic called Steal Invocations. This is a count of the number of 64KB blocks which were reused.

Overall performance objective

The cache needs to be big enough to keep frequently used files in the cache. If the cache is not big enough then it has to do more work, for example discard files, to make space, and reading files in from disk.

The system provides statistics on

  • How big the cache is
  • How many free pages there are
  • How many segments have been stolen (should be zero)
  • How many read requests were met from data in the cache (Cache hit), and so by calculation the number of requests that were not in the cache (cache miss), and required disk I/O.

Typically you will not achieve a cache hit of 100% because the application data may not be hot.

A little whoops

I had a little whoops. I wrote to a file, and filled the file system. When I deleted the file, and tried again, it reported there was no space on the device. When I waited for a few seconds, and repeated the command, it worked! This shows there are background tasks running asynchronously which clean up after a file has been used.

Just to make it more complex

  • ZFS uses 8KB as its “page” which is 2 * 4K pages on disk.
  • Small files live in the meta data, and not in the user cache!
  • There is also a Directory Backing Cache also known as Metadata Backing Cache. This seems to be a cache for the meta data, which doesn’t have the same locking. It is described in a Share presentation, zFS Diagnosis I: Performance Monitoring and Tuning Guidelines from 2012. It looks from the more recent documentation as if this has been rolled into the meta cache.

Sysplex support

The sysplex support makes it just a little more complex.

The ZFS support behaves like a client server.

One LPAR has the file system mounted read write – acting as the server. Other system act as the client.

If SYSA has the file system mounted Read Write, and SYSB wants to access a file, it sends a request through XCF. The access is managed by use of Tokens, and a Token Cache.

If you display KNPFS (Kernel Nodes Physical File System?) you get operations such as zfs_open

  • On Owner. On my single LPAR sysplex, I get values here
  • On Client. These are all zeros for me.

SMF 92-51 provides statistics on the zfs verbs such as zfs_open

  • Count of calls made to file systems owned locally or R/O file systems
  • Count of calls that required a transmit to another sysplex member to complete for locally-owned file systems.
  • Count of calls made to file systems owned remotely from this member.
  • Count of calls that required a transmit to another sysplex member to complete for remotely-owned file systems.
  • Average number of microseconds per call for locally-owned file systems.
  • Average number of microseconds per call for remotely-owned file systems

Configuration

The ZFS configuration is driven from the SYS1.PARMLIB(BPXPRM00), member with

FILESYSTYPE TYPE(ZFS) ENTRYPOINT(IOEFSCM)

This can have PRM=(aa, bb, …, zz) for SYS1.PARMLIB(IOEPRMaa)… It defaults to parmlib member IOEPRM00. See here for the contents.

Some of the mysteries of Java shared classes

When Java executes a program it read in the jar file, breaks it into the individual classes, converts the byte codes into instructions, and when executing it may replace instructions with more efficient instructions (Jitting). It can also convert the byte codes into instructions ahead of time, so called Ahead Of Time (AOT) compilation.

With shared classes, the converted byte codes, any Jitted code, and any AOT code can be saved in a data space.

  • When the java program runs a second time, it can reuse the data in the dataspace, avoid the overhead of the reading the jar file from the file system, and coverting the byte codes into instructions.
  • The data space can be hardened to a file, and restored to a data space, so can be used across system IPLs.

Using this, it reduced the start-up time of my program by over 20 seconds on my slow zPDT system. The default size of the cache is 16MB – one of my applications needed 100 MB, so most of the benefits of he shared classes could not be exploited if the defaults were used.

This blog post describes more information about this, and what tuning you can do.

You have to use the same level of Java to display information, as used by the program. For example

export PATH="/usr/lpp/java/new/J8.0_64/bin:$PATH"
/usr/lpp/java/new/J8.0_64/bin/java -Xshareclasses:cacheDir=/tmp,printStats,cache=rseapi

If the levels are different it may just report JVMSHRC005I No shared class caches available.

Issuing commands to manage the shared classes cache

What commands are there ?

The command

/usr/lpp/java/J17.0_64/bin/java -Xshareclasses:help 

gives 100 lines of help including the syntax.

This help text includes

Related command line options: 

-Xscmx<x> set size (or soft max size if option -XX:SharedCacheHardLimit= is present)
of new shared class cache to <x>
-Xscdmx<x> set size of shared class cache debug attribute area to <x>
-Xscminaot<x> set minimum shared classes cache space reserved for AOT data to <x>
-Xscmaxaot<x> set maximum shared classes cache space allowed for AOT data to <x>
-Xscminjitdata<x> set minimum shared classes cache space reserved for JIT data to <x>
-Xscmaxjitdata<x> set maximum shared classes cache space allowed for JIT data to <x>
-Xzero:nosharebootzip do not share bootstrap jar entry caches in the shared class cache
-XX:SharedCacheHardLimit=<x> set size of new shared class cache to <x>
-XX:ShareClassesEnableBCI create shared class cache with support for byte-code instrumentation
-XX:ShareClassesDisableBCI create shared class cache without support for byte-code instrumentation
-XX:+ShareAnonymousClasses enable storing and finding anonymous classes in the shared class cache
-XX:-ShareAnonymousClasses disable storing and finding anonymous classes in the shared class cache
-XX:+ShareUnsafeClasses enable storing and finding non-anonymous unsafe classes in the shared class cache
-XX:-ShareUnsafeClasses disable storing and finding non-anonymous unsafe classes in the shared class cache
-XX:+UseGCStartupHints enable storing and finding GC startup hints in the shared class cache
-XX:-UseGCStartupHints disable storing and finding GC startup hints in the shared class cache

Commands to manage the shared classes cache are issued like

java -Xshareclasses:cacheDir=/tmp,name=client6,printStats

which can be done using JCL

// SET V=’listAllCaches’
// SET V=’printStats’
// SET C=’/tmp’
// SET N=’client6′
//S1 EXEC PGM=BPXBATCH,REGION=0M,
// PARM=’SH java -Xshareclasses:cacheDir=&C,name=&N,verbose,&V’
//STDERR DD SYSOUT=*
//STDOUT DD SYSOUT=*

Enabling share classes

You specify -Xsharedclasses information as a parameter to the program, for example in the command line or in a jvm properties file.

To use the shared classes capability you have to specify all of the parameters on one line, like

-Xshareclasses:verbose,name=client6,cacheDirPerm=0777,cacheDir=/tmp

Having it like

-Xshareclasses:name=client6,,cacheDirPerm=0777,cacheDir=/tmp
-Xshareclass:verbose

means the name, etc all take their defaults. Only shareclass:verbose would be used.

Changing share classes parameters

You can have more than one cache; you specify a name. You specify a directory were an image is stored when the cache is hardened to disk.

Some of the options like name= and cacheDir= are picked up when the JVM starts, Other parameters like cacheDirPerm are only used when the cache is (re-)created.

You can delete the cache in two ways.

Delete the cache from your your Java program

When you are playing around, you can add reset to the end of the -Xshareclasses string to caused the cache to be deleted and recreated.This gives output like

JVMSHRC010I Shared cache “client6” is destroyed
JVMSHRC158I Created shared class cache “client6”
JVMSHRC166I Attached to cache “client6”, size=20971328 bytes

This was especially useful when tuning the storage allocations.

Delete the cache independently

java -Xshareclasses:cacheDir=/tmp,name=client6,destroy

How to allocate the size of the cache

You specify the storage allocations using -Xsc.. (where sc stands for shareclasses)

If you have -Xsharedcache:verbose… specified then when the JVM shuts down you get

JVMSHRC168I Total shared class bytes read=11660. Total bytes stored=5815522
JVMSHRC818I Total unstored bytes due to the setting of shared cache soft max is 0.
Unstored AOT bytes due to the setting of -Xscmaxaot is 1139078.
Unstored JIT bytes due to the setting of -Xscmaxjitdata is 131832.

This shows the values of maxaot and maxjitdata are too small they were

-Xscmx20m
-Xscmaxaot2k
-Xscmaxjitdata2k

When the values were big enough I got

JVMSHRC168I Total shared class bytes read=12960204. Total bytes stored=8885038
JVMSHRC818I Total unstored bytes due to the setting of shared cache soft max is 0.
Unstored AOT bytes due to the setting of -Xscmaxaot is 0.
Unstored JIT bytes due to the setting of -Xscmaxjitdata is 0.

How big a cache do I need?

If you use -Xshareclasses:verbose… it will display messages

for example

JVMSHRC166I Attached to cache “client6”, size=2096960 bytes
JVMSHRC269I The system does not support memory page protection

JVMSHRC096I Shared cache “client6” is full. Use -Xscmx to set cache size.
JVMSHRC168I Total shared class bytes read=77208. Total bytes stored=2038042

Message JVMSHRC096I Shared cache “client6” is full. Use -Xscmx to set cache size, tells you the cache is full – but no information about how big it needs to be.

You can use

java -Xshareclasses:cacheDir=/tmp,name=client6,printStats

to display statistics like

-Xshareclasses persistent cache disabled]                                         
[-Xshareclasses verbose output enabled]                                            
JVMSHRC159I Opened shared class cache "client6"                                    
JVMSHRC166I Attached to cache "client6", size=2096960 bytes                        
JVMSHRC269I The system does not support memory page protection                     
JVMSHRC096I Shared cache "client6" is full. Use -Xscmx to set cache size.          
                                                                                   
Current statistics for cache "client6": 
cache size                           = 2096592                       
softmx bytes                         = 2096592                       
free bytes                           = 0                             
ROMClass bytes                       = 766804                        
AOT bytes                            = 6992                          
Reserved space for AOT bytes         = -1                            
Maximum space for AOT bytes          = 1048576                       
JIT data bytes                       = 212                           
Reserved space for JIT data bytes    = -1                            
Maximum space for JIT data bytes     = 1048576                       
Zip cache bytes                      = 1131864                       
Startup hint bytes                   = 0                             
Data bytes                           = 13904                         
Metadata bytes                       = 12976                         
Metadata % used                      = 0%                            
Class debug area size                = 163840                        
Class debug area used bytes          = 119194                        
Class debug area % used              = 72%

Cache is 100% full  
                                                                             

This show the cache is 100% full, and how much space is used for AOT and JIT. The default value of -Xscmx I had was almost 16MB. I made it 200MB and this was large enough.

I could not find a way of getting my program to issue printStats.

How do I harden the cache?

You can use use the

java -Xshareclasses:cacheDir=/tmp,name=zosmf,verbose,snapshotCache

command to create the cache on disk. Afterwards the listAllCaches command gave

Cache name level        cache-type     feature 
client6    Java8 64-bit non-persistent cr
client6    Java8 64-bit snapshot       cr

Showing the non persistent data space, and the snapshot file.

You can use the restoreFromSnapshot to restore from the file to the data cache; before you start your Java program. You would typically do this after an IPL.

How can I tell what is going on and if shared classes is being used?

The java options “-verbose:dynload,class

reports on the

  • dynamic loading of the files, and processing them,
  • what classes are being processed.

For example

<Loaded java/lang/reflect/AnnotatedElement from /Z24A/usr/lpp/java/J8.0_64/lib/rt.jar>
< Class size 3416; ROM size 2672; debug size 0>
< Read time 1196 usec; Load time 330 usec; Translate time 1541 usec>
class load: java/lang/reflect/AnnotatedElement from: /Z24A/usr/lpp/java/J8.0_64/lib/rt.jar
class load: java/lang/reflect/GenericDeclaration from: /Z24A/usr/lpp/java/J8.0_64/lib/rt.jar

dynload gave

<Loaded java/lang/reflect/AnnotatedElement from /Z24A/usr/lpp/java/J8.0_64/lib/rt.jar>
< Class size 3416; ROM size 2672; debug size 0>
< Read time 1196 usec; Load time 330 usec; Translate time 1541 usec>

this tells you a jar file was read from the file system, and how long it took to process it.

class gave

class load: java/lang/reflect/AnnotatedElement from: /Z24A/usr/lpp/java/J8.0_64/lib/rt.jar
class load: java/lang/reflect/GenericDeclaration from: /Z24A/usr/lpp/java/J8.0_64/lib/rt.jar

This shows two classe were extracted from the jar file.

In a perfect system you will get the class load entries, but not <Loaded….

Even when I had a very large cache size, I still got dynload entries. These tended to be loading class files rather than jar files.

For example there was a dynload entry for com/ibm/tcp/ipsec/CaApplicationProperties. This was file /usr/lpp/zosmf./installableApps/izuCA.ear/izuCA.war/WEB-INF/classes/com/ibm/tcp/ipsec/CaApplicationProperties.class

If you can make these into a .jar file you may get better performance. (But you may not get better performance, as it may take more time to load a large jar file).

I also noticed that there was dynload for com/ibm/xml/crypto/IBMXMLCryptoProvider which is in /Z24A/usr/lpp/java/J8.0_64/lib/ext/ibmxmlcrypto.jar, so shared classes has some deeper mysteries!

What happens if the .jar file changes?

As part of the class load, it checks the signature of the file on disk, matches the signature on the data space. If they are different the data space will be updated.

“Why were the options I passed to Java ignored” – or “how to tell what options were passed to my Java?”

I was struggling to understand a problem with shared classes in Java and I found the options being used by my program were not as I expected.

I thought it would be a very simple task to display at start up options used. It may be, but I could not find how to do it. If anyone knows the simple answer please tell me.

I found one way – take a dump! This seems a little extreme, but it was all I could find. With Liberty you can take a javacore dump (F IZUSVR1,JAVACORE) and display it, or you can take a dump at start up.

In the jvm.options I specified

-Xdump:java:events=vmstart

This gave me in //STDERR

JVMDUMP039I Processing dump event “vmstart”, detail “” at 2021/05/20 13:19:06 – please wait.
JVMDUMP032I JVM requested Java dump using ‘/S0W1/var/…/javacore.20210520.131906.65569.0001.txt’
JVMDUMP010I Java dump written to /S0W1/var…/javacore.20210520.131906.65569.0001.txt
JVMDUMP013I Processed dump event “vmstart”, detail “”.

In this file was list of all the options passed to the JVM

1CIUSERARGS UserArgs:
2CIUSERARG -Xoptionsfile=/usr/lpp/java/J8.0_64/lib/s390x/compressedrefs/options.default
2CIUSERARG -Xlockword:mode=default,noLockword=java/lang/String,noLockword=java/util/Ma
2CIUSERARG -Xjcl:jclse29
2CIUSERARG -Djava.home=/usr/lpp/java/J8.0_64
2CIUSERARG -Djava.ext.dirs=/usr/lpp/java/J8.0_64/lib/ext
2CIUSERARG -Xthr:tw=HEAVY
2CIUSERARG -Xshareclasses:name=liberty-%u,nonfatal,cacheDirPerm=1000,cacheDir=…
2CIUSERARG -XX:ShareClassesEnableBCI
2CIUSERARG -Xscmx60m
2CIUSERARG -Xscmaxaot4m
2CIUSERARG -Xdump:java:events=vmstart
2CIUSERARG -Xscminjitdata5m
2CIUSERARG -Xshareclasses:nonFatal
2CIUSERARG -Xshareclasses:groupAccess
2CIUSERARG -Xshareclasses:cacheDirPerm=0777
2CIUSERARG -Xshareclasses:cacheDir=/tmp,name=zosmf2
2CIUSERARG -Xshareclasses:verbose
2CIUSERARG -Xscmx100m

the storage limits

1CIUSERLIMITS User Limits (in bytes except for NOFILE and NPROC)
NULL ------------------------------------------------------------------------
NULL         type            soft limit  hard limit
2CIUSERLIMIT RLIMIT_AS       1831837696   unlimited
2CIUSERLIMIT RLIMIT_CORE        4194304     4194304
2CIUSERLIMIT RLIMIT_CPU       unlimited   unlimited
2CIUSERLIMIT RLIMIT_DATA      unlimited   unlimited
2CIUSERLIMIT RLIMIT_FSIZE     unlimited   unlimited
2CIUSERLIMIT RLIMIT_NOFILE        10000       10000
2CIUSERLIMIT RLIMIT_STACK     unlimited   unlimited
2CIUSERLIMIT RLIMIT_MEMLIMIT 4294967296  4294967296

and environment variables used.

1CIENVVARS Environment Variables
2CIENVVAR LOGNAME=IZUSVR
2CIENVVAR _CEE_ENVFILE_S=DD:STDENV
2CIENVVAR _EDC_ADD_ERRNO2=1
2CIENVVAR _EDC_PTHREAD_YIELD=-2
2CIENVVAR WLP_SKIP_MAXPERMSIZE=true
2CIENVVAR _BPX_SHAREAS=NO
2CIENVVAR JAVA_HOME=/usr/lpp/java/J8.0_64

All interesting stuff including the -X.. parameters. I could see that the parameters I had specified were not being picked up because they were specified higher up! Another face palm moment.

There was a lot more interesting stuff in the file, but this was not relevant to my little problems.

Once z/OSMF was active I took a dump using the f izusvr1,javacore command and looked at the information on the shared classes cache

1SCLTEXTCMST Cache Memory Status
1SCLTEXTCNTD Cache Name      Cache path
2SCLTEXTCMDT sharedcc_IZUSVR /tmp/javasharedresources/..._IZUSVR_G37
2SCLTEXTCPF Cache is 85% full
2SCLTEXTCSZ Cache size = 104857040
2SCLTEXTSMB Softmx bytes = 104857040
2SCLTEXTFRB Free bytes = 14936416

This is where I found the shared cache was not what I was expecting! I originally spotted that the cache was too small – and made it bigger.

Afterwards…

Remember to delete the javacore files.

I removed the -Xdump:java:events=vmstart statement, because I found it more convenient to use the f izusvr1,javacore command to take a dump when needed.

Looking at the performance of Liberty products and what you can do with the information

The Liberty Web Server on z/OS can produce information on the performance of “transactions”. It may not be as you expect, and it may not be worth collecting it. I started looking into this to see why certain transactions were slow on z/OSMF.

This blog post covers

  • How does the web server work?
  • How can I see what is going on?
  • Access logs
  • SMF 120
  • WLM
  • Do I want to use this in production?

How does it work?

A typical Web Server transaction might use a database to debit your account, and to credit my account. I think of this as a fat transaction because the application does a lot of work,

z/OSMF runs on top of Liberty, and allows you to run SDSF and ISPF within a web brower, and has tools to help automate work. You can also use REST services, so you can send a self contained HTTP request to z/OSMF, for example request information about data sets, or jobs running on the system. Both of these send a request to z/OSMF, which might send a request to a TSO userid, get the response back and pass the response back to the requester. I think of this as a thin transaction because the application running on the web server is mainly acting as a router or broker. What the end user sees as a “transaction” may be many micro services – each of which is a REST requests.

How can I see what is going on?

You can see what “transactions” or requests have been issued from

  • the access log – a flat file of requests
  • SMF 120 records
  • WLM reports.

Access logs

In the <httpEndpoint … accessLoggingRef=”accessLogging”…>, the accessLogging is a reference to an <httpAccessLogging… statement. This creates a flat files containing a record of all inbound HTTP client requests. If you have multiple httpEndpoint statements, you can have multiple accessLogging files. You can control the number and size of the files.

This has information with fields like

  • 10.1.0.2 – the IP address the request came in from
  • COLIN – the userid
  • {16/May/2021:14:47:40 +0000} – the date, time and time zone of the event
  • PUT /zosmf/tsoApp/tso/ping/COLIN-69-aabuaaac – the request
  • HTTP/1.1 – the type of HTTP request
  • 200 – the HTTP return code
  • 78 – the number of bytes sent back.

You can have multiple httpEndpoint definitions, for example specifing different IP address and port combinations. These definitions can point to a shared or indivdual httpAccessLogging statement, and so can share (or not) the flat files. This allows you to specify that for one port you will use the accessLogging, and another port does not have access logging.

The SMF 120 records.

The request logging writes a record to SMF for each request.

<server>
<featureManager>
<feature>zosRequestLogging-1.0</feature>
</featureManager>
</server

I have not been able to find a formatter for these records from IBM, so I have written my own, it is available on github.

It produces output like

Server version      :2                                                              
System              :S0W1                                                           
Syplex              :ADCDPL                                                         
Server job id       :STC02771                                                       
Server job name     :IZUSVR1                                                        
config dir          :/var/zosmf/configuration/servers/zosmfServer/                  
codelevel           :19.0.0.3  
                                                     
Start time          :2021/05/16 14:42:33.955288                                     
Stop  time          :2021/05/16 14:42:34.040698                                     
Duration in secs    : 0.085410                                                      
WLMTRan             :ZCI4                                                           
Enclave CPU time    : 0.042570                                                      
Enclave ZIIP time   : 0.042570                                                      
WLM Resp time ratio :10.000000                                                      
userid long         :COLIN                                                          
URI                 :/zosmf/IzuUICommon/externalfiles/sdsf/index.html               
CPU Used Total      : 0.040111                                                      
CPU Used on CP      : 0.000000                                                      
CPU Delta Z**P      : 0.040111 
                                                     
WLM Classify type   :URI                                                            
WLM Classify by     :/zosmf/IzuUICommon/externalfiles/sdsf/index.html
WLM Classify type   :Target Host                            
WLM Classify by     :10.1.1.2                               
WLM Classify type   :Target Port                            
WLM Classify by     :10443   
                               
Response bytes      :7003                                   

Origin port         :57706                                  
Origin              :10.1.0.2                                              

This has information about

  • The Liberty instance, where it is running and configuration information
  • Information about the transaction, start time, stop time, duration and CPU time used
  • WLM information about the request. It was classified as
    • URI:…index.html
    • Target Host:10.1.1.2
    • Target port:10443
  • 7003 bytes were sent to the requester
  • the request came from 10.1.0.2 port 57706

The SMF formatter program also summarises the records and this shows there are records for

  • /zosmf/IzuUICommon/externalfiles/sdsf/js/ui/theme/images/zosXcfsCommand_enabled_30x30.png
  • /zosmf/IzuUICommon/externalfiles/sdsf/js/ui/theme/sdsf.css
  • /zosmf/IzuUICommon/externalfiles/sdsf/sdsf.js
  • /zosmf/IzuUICommon/externalfiles/sdsf/IzugDojoCommon.js
  • /zosmf/IzuUICommon/persistence/user/com.ibm.zos.sdsf/JOBS/JOBS

This shows there is a record for each part of the web page, icons, java scripts and style sheets.

Starting up SDSF within z/OSMF created 150 SMF records! Refreshing the data just created 1 SMF record. The overhead of creating all the SMF records for one “business transaction” may be too high for production use.

As far as I can tell this configuration is server wide. You cannot enable it for a specific IP address and port combination.

WLM reports

Much of the data produced in the records above can be passed to WLM. This can be used to give threads appropriate priorities, and can produce reports.

You enable WLM support using

    • <feature>zosWlm-1.0 </feature>
  • </featureManager>
  • <zosWorkloadManager collectionName=”MOPZCET”/>
    • <httpClassification transactionClass=”ZCI6″ resource=”/zosmf/desktop/”/>
    • <httpClassification transactionClass=”ZCI1″ resource=”/**/sdsf/**/*”/>
    • <httpClassification transactionClass=”ZCI3″ resource=”/zosmf/webispf//”/>
    • <httpClassification transactionClass=”ZCI4″ resource=”/**/*”/>
    • <httpClassification transactionClass=”ZCI2″ resource=”IZUGHTTP” />
    • <httpClassification transactionClass=”ZCI5″ port=”10443″/>
  • </wlmClassification>

Where the httpClassification maps a z/OSMF resource to an 8 character transaction class. THe records are process from the top until there is a match. For example port=10443 would not be used because of the generic resource=/**/* definition.

These need to be mapped into the WLM configuration…

WLM configuration

You can configure WLM through the WLM configuration panels.

option 6. Classification rules.

option 3 to modify CB(Component broker)

          -------Qualifier--------                 -------Class--------  
 Action   Type      Name     Start                  Service     Report   
                                          DEFAULTS: STCMDM      TCI2
  ____  1 CN        MOPZCET  ___                    ________    THRU
  ____  2   TC        ZCI1     ___                  SCI1        RCI1
  ____  2   TC        ZCI2     ___                  SCI2        RCI2
  ____  2   TC        ZCI3     ___                  SCI3        RCI3
  ____  2   TC        ZCI4     ___                  SCI4        RCI4
  ____  2   TC        ZCI5     ___                  SCI5        RCI5
  ____  2   TC        ZCI6     ___                  SCI6        RCI6
  ____  2   TC        THRU     ___                  THRU        THRU

For the Type=CN, Name=MOPZCET, this value ties up with the <zosWorkloadManager collectionName=”MOPZCET” above. Type=CN is for CollectionName.
For the subtype 2 TC Name ZCI4, This is for TransactionClass which ties up with a httpClassification transactionClass statement.

The service class SCI* controls the priority of the work, the Report class RCI* allow you to produce a report by this name.

If you make a change to the WLM configuration you can save it from the front page of the WLM configuration panels, Utilities-> 1. Install definition, and activate it using Utilities-> 3. Activate service policy.

If you change the statements in Liberty or z/OSMF I believe you have to restart the server.

How to capture the data

The data is written out to SMF records on a timer, or on the SMF end-of-interval broadcast. If you change the interval, SMF sends an end-of-interval broadcast and writes the records to SMF. For example on my dedicate test system I use the operator command setsmf intval(10) to change the interval to 10 minutes. After the next test, I use setsmf intval(15) etc..

The data is kept in SMF buffers, and you may have to wait for a time, before the data is written out to external storage. It SMF data is being produced on a regular basis, it will be flushed out.

How to report the data

I copy the SMF data to a temporary data

//IBMURMF  JOB 1,MSGCLASS=H RESTART=POST 
//* DUMP THE SMF DATASETS 
// SET SMFPDS=SYS1.S0W1.MAN1 
// SET SMFSDS=SYS1.S0W1.MAN2 
//* 
//SMFDUMP  EXEC PGM=IFASMFDP 
//DUMPINA  DD   DSN=&SMFPDS,DISP=SHR,AMP=('BUFSP=65536') 
//DUMPINB  DD   DSN=&SMFSDS,DISP=SHR,AMP=('BUFSP=65536') 
//* MPOUT  DD   DISP=(NEW,PASS),DSN=&RMF,SPACE=(CYL,(1,1)) 
//DUMPOUT  DD   DISP=SHR,DSN=IBMUSER.RMF SPACE=(CYL,(1,1)) 
//SYSPRINT DD   SYSOUT=* 
//SYSIN  DD * 
  INDD(DUMPINA,OPTIONS(DUMP)) 
  INDD(DUMPINB,OPTIONS(DUMP)) 
  OUTDD(DUMPOUT,TYPE(70:79)) 
  DATE(2020316,2021284) 
  START(1539) 
  END(2359) 
/* 

and display the report classes

//POST EXEC PGM=ERBRMFPP 
//* PINPUT DD DISP=SHR,DSN=*.SMFDUMP.DUMPOUT 
//MFPINPUT DD DISP=SHR,DSN=IBMUSER.RMF 
//SYSIN DD * 
SYSRPTS(WLMGL(RCPER)) 
/* 

The output in //PPXSRPTS was

REPORT CLASS=RCI4                                                      
-TRANSACTIONS--  TRANS-TIME HHH.MM.SS.FFFFFF  
AVG        0.01  ACTUAL                69449  
MPL        0.01  EXECUTION             68780  
ENDED       160  QUEUED                  668  
END/S      0.13  R/S AFFIN                 0  
#SWAPS        0  INELIGIBLE                0  
EXCTD         0  CONVERSION                0  
                 STD DEV              270428  
                                              
----SERVICE----   SERVICE TIME  ---APPL %---  
IOC           0   CPU    2.977  CP      0.05  
CPU        2551   SRB    0.000  IIPCP   0.05  
MSO           0   RCT    0.000  IIP     0.19  
SRB           0   IIT    0.000  AAPCP   0.00  
TOT        2551   HST    0.000  AAP      N/A  
/SEC          2   IIP    2.338                
ABSRPTN     232   AAP      N/A                
TRX SERV    232                               
                                              

There were 160 “transactions” within the time period or 0.13 per second, The average response time was 69449 microseconds, with a standard deviation of 270428. This is a very wide standard deviation, so there was a mixture of long, and short response times.

The total CPU for this report class was 2.977 seconds of CPU, or 0.019 seconds per “transaction”.

Do I want to use this in production?

I think that the amount of data produced is managable for a low usage system. For a production environment where there is a lot of activity then the amount of data produced, and the cost of producing the data may be excessive. This could be an example of the cost of collecting the data is much larger than the cost of running the workload.

As z/OSMF acts as a broker, passing requests between end points you may wish just to use your existing reporting structures.

I used z/OSMF to display SDSF data, and set up and ISPF session within the web browse. This created two TSO tasks for my userid. If you include my traditional TSO session and the two from z/OSMF this is three TSO sessions in total running on the LPAR.