Avoiding I/O by caching your PDSEs (It might not be worth it)

When you use most PDS datasets, the data has to be read from disk each time. (The exception is data sets in the Linklist LookAside(LLA) which do get cached. This blog post explains the set up to get your PDSEs cached in z/OS. There is a Red book Partitioned Data Set Extended Usage Guide SG24-6106-01 which covers this topic.
One of the benefits of using a PDSE is that you can get the data sets cached in Hiperspace in z/OS memory.

A C program I am working on takes about 8 seconds to compile in batch, and spends less than half a second doing I/O, so caching your PDSEs may not give you much benefit. You should try it youself as mileage may vary.


The caching of information for PDSEs is doing in the SMSPDSE component of SMS.

You can have two addresses spaces for caching PDSE data sets

  1. SMSPDSE caches the directory of PDSE data sets.  It also caches PDSEs that are contained in the LNKLIST.  SMSPDSE is configured using the parmlib concatenation member IGDSMSxx.  If you want to change the configuration you have to re ipl.
  2. SMPPDSE1. This is used to cache other eligible PDSEs.   SMSPDSE1 is configured using the parmlib concatenation member IGDSMSxx. You can issue a command to restart this address space, and pick up any parameter changes – this is why is is known as the restartable address space. 

It is easy to create the SMPDSE1 address space.  It is described here.

Making PDSE data sets eligible for caching.

It is more complex than just setting a switch on a data set.

The Storage Class controls whether a PDSE is eligible for caching.  It is more complex than just setting a simple switch.  The eligibility of caching is controlled by the Direct MilliSecond Response time.  (Which means the Response time in MilliSeconds of Direct (non sequential) requests).  If you use ISMF to display the Storage Classes, one of the fields is the Direct MSR.  The documentation says If the MSR is < 9 then the value is “must cache”, 10 -998 “may cache”, 999 “never cache”.  I only got caching if MSR was <= 9.

If you change the Storage Class remember to use the command setsms scds(SYS1.S0W1.DFSMS.SCDS) to refresh SMS.
Change your data set to use the appropriate Storage Class with the valid Direct MSR.

By default the SMSPDSE1 address space caches the PDSE until the data set is closed.  This means that PDSEs are not cached between jobs.  You can change this using the commands


Or just update the parameter in the parmlib IGDSMSxx member.
If you now use your PDSE it should be cached in Hiperspace.

You can use the command d sms,pdse1,hspstats to see what is cached.

This gave me

D SMS,PDSE1,HSPSTATS                                                   
IGW048I PDSE HSPSTATS Start of Report(SMSPDSE1) 531                    
HiperSpace Size: 256 MB                                                
LRUTime : 50 Seconds   LRUCycles: 200 Cycles                           
BMF Time interval 300 Seconds                                          
---------data set name-----------------------Cache--Always-DoNot       
CSQ911.SCSQAUTH                                N      N      N         
CSQ911.SCSQMSGE                                N      N      N         
CSQ911.SCSQPNLE                                N      N      N         
CSQ911.SCSQTBLE                                N      N      N         
CBC.SCCNCMP                                    N      N      N         
CEE.SCEERUN2                                   N      N      N
COLIN.JCL                                      Y      Y      N         
COLIN.SCEEH.SYS.H                              Y      Y      N         
COLIN.SCEEH.H                                  Y      Y      N         

The CSQ9* data sets are PDSEs in Link List.  The COLIN.* data sets are my PDSEs in storage class SCAPPL.   They have Always Cache specified.  If you restart the SMSPDSE1 address space, the cache will be cleared.

You can use the commands

  • d sms,pdse1,hspstats,DSN(COLIN.*)  to display a subset of data sets
  • d sms,pdse1,hspstats,STORCLAS(SCAPPL) to display the data sets in a storage class

SMF data on datasets

There were SMF 42.6 records for the SMSPDSE1 address space showing I/O to the PDSEs.
My jobs doing I/O to the PDSEs did not have a record for the PDSE in the SMF 42.6.

SMF data on SMSPDSE* buffer usage

Below is the printout from the SMF 42 subtype 1 records.

  • BMF:==TOTAL==
    • Data pages read: 20304 read by BMF: 567 <not read by BMF: 19737 ( 97 %) >
    • Directory pages read: 649 read by BMF: 642 <not read by BMF: 7 ( 1 %) >
    • Data pages read: 183 read by BMF: 0 <not read by BMF: 183 (100 %)>
    • Directory pages read: 64 read by BMF: <60 not read by BMF: 4 ( 6 %) >
    • Data pages read: 567 read by BMF: 567 <not read by BMF: 0 ( 0 %) >
    • Directory pages read: 472 read by BMF: 472 <not read by BMF: 0 ( 0 %) >
  • SC:**NONE**
    • Data pages read: 19554 read by BMF: 0 <not read by BMF: 19554 (100 %)>
    • Directory pages read: 113 read by BMF: 110 <not read by BMF: 3 ( 2 %)>

We can see that for Storage Class SCAPPL all pages requested were in the cache.

Will this speed up my thousands of C compiles ?

Not necessarily.  See the problems I had.

  • The C header files are in a PDS – not a PDSE, so you would have to convert the PDSs to PDSEs
  • The C compiler uses the SEARCH(“CEE.SCEE.H.*”) option which says read from this library.   This may override your JCL if you decide to create new PDSEs for the C header files.
  • When I compiled in USS my defaults had SEARCH(/usr/include/).  This directory was on ZFS.Z24A.VERSION a ZFS file system.   The files on the ZFS may be cached.

When I ran my compile,there were 31 SMF 42.6 records for CEE.SCEE.H, giving a total of 111 I/Os, there were 2 records for CEE.SCEE.SYS.H with a total I/O count of 14.  If each I/O takes 1 millisecond this is 125 milliseconds doing disk I/O to the PDS, so I expect it is not worth converting compiles to use PDSEs and caching them.



Why does my C compile fail if I remove data sets I do not use?

I was playing with caching of header file PDSEs when I compiled a C program. I could see from the SMF 42.6 records that CEE.SCEEH.H PDS was being used.  It took nearly two hours before my job did not use this PDS!
I created a PDSE called COLIN.SCEEH.H and copied CEE.SCEEH.H into it.  I updated my JCL to use the new libraries,  reran my job and the SMF records show I was till using CEE.SCEEH.H.  Hmm this was very strange.

I renamed CEE.SCEEH.H to COLIN.CEE.SCEEH.H.  Did it work ?   No – I got compile errors, so I renamed it back again.  Removing the data set clearly does not work.

I then spotted in the compiler listing that I had the default SEARCH(//’CEE.SCEEH.+’).   I added SE(//’COLIN.SCEEH.+’)  and thought Fixed it!  No … still not fixed,  it still used CEE.SCEEH…

I had to use C options NOSEARCH, SE(//’COLIN.SCEEH.+’) .  The first option turns off the SEARCH(//’CEE.SCEEH.+’) , and the second one creates a new one.   After a cup of tea and a biscuit I remembered I had hit this about 20 years ago!

Undestanding SMF 42.6 data set statistics

DFSMS provides SMF records to report activity, for example

  1. Data set statistics.
  2. Storage class statistics.
  3. SMS buffer usage.

This blog post describes the data set statistics in SMF 42 records, subtype 6.    IBM does not provide a formatter for these statistics, but there are formatters available from other companies.

My formatter written in C is available on Github.


There are statistics available in the following areas

  1. Job name (neither step name nor program name)
  2. Data set name, type (Linear, PDSE etc), volume (A4USR1) etc
  3. Number of I/Os, and number of read I/Os
  4. Number of normal cache requests, and number of successful cache requests
  5. Number of other cache requests, Sequential, 1 Track, cache bypass.  See Understanding Storage Controller caching and the caching statistics.
  6. Average response time – and where the time is spent. See Pending, disconnect and other gobbledegook.
  7. Number of application  requests (Access Manager) and what type (sequential, direct, or directory) .  Your application may issue 10 fread() requests, but only one disk I/O was done.

It feels like a bit like a “dogs’ dinner” (a dish with some very good stuff, and some other stuff, all mixed up).

When are the records produced?

  1. When the file is closed.
  2.  Immediately after the recording of the type 30 interval record.
  3. For PDSEs,  from the SMSPDSE and SMSPDSE1 address spaces; after the the BMFTIME interval.

Sometimes an SMF record has data for more than one data set, other times, there is one dataset per record, and so multiple SMF 4.26 records per job step.  I could not find the reason for which one is used.  It may be as follows

  • If the data set is closed write one data set record per job.
  • Else if the SMF 30 time interval expired – write as many data set records as will fit in the SMF record.

What data is collected ?

Data is collected on reads and writes.  The I/O for reading the VTOC to find the dataset, or opening and closing the file are not recorded in the SMF 42.6.  They are recorded in SMF 42.5 records.  I had a program which read from a PDS member and wrote it to a sequential file.  For the PDS there were

  1. 4 I/Os to the VTOC – including 2 to the VTOC index
  2. 1 I/O to the PDS directory.  The entries in the directory are not sorted, multiple directory entries may need to be read to locate the required entry.
  3. 6 I/Os to the PDS member – reported in the SMF 42.6 record
  4. 1 I/O to the PDS directory for the close
  5. 1 I/O to the VTOC on the close.

Data set information.

You get information on most data sets

  1. Physical sequential eg a listings file
  2. PDS’s such as SYS1.MACLIB
  3. Linear data sets, such as MQ or DB2 logs and page set sets
  4. Striped data sets, such as MQ logs, have a record for each volume used.  I had a log with two stripes, and had two records, one for each volume.
  5. Temporary data set may have no I/O to disk if the data can be kept in VIO (virtual memory).  It may still have Application Manager reads and write.

But only some data on PDSEs.    The read I/Os were from the SMSPDSE address space which caches PDSE directory information.   When the binder output a load module to a PDSE load libary there were SMF 42.6 records for the write, from the SMSPDSE address space. There were no records produced when I had a C program write to a member of a PDSE.

Storage Controller caching information.

For a disk read, if the data is in the cache of the Storage Controller, the data can be returned immediately;  this is a cache hit.  If not, then a disk will have to be read to get the data; this is a cache miss.

For a disk write, usually the data is written to the Non Volatile Cache, and the request returns. This is known as DASD Fast Write (DFW).  At a later time, the data is moved from the cache to the disk.  This is a cache hit.  If the request says do not use cache (Bypass DASD Fast Write) this counts as a cache miss.

  1. There is a count of total normal cache requests (read and write), and a count of cache hits (read and write).  This normal cache request usually process a couple of tracks.
  2. There is a count of total normal cache write requests, and a count of normal cache write hits.
  3. You can calculate the number of normal read caches, from (total cache – write cache).
  4. There is a count of sequential cache requests.   This request tells the storage controller to bring many records into the cache, or this is going to write many records to the cache.   There is one combined counter for read and write requests.
  5. For direct access, reading just one track there is direct access , (what I call 1Track cache).  There is one combined counter for read and write requests.
  6. There is bypass cache where the record is written directly to disk.  There is one combined counter for read and write requests.

I/O response times

The average response time of the request, and the break down into connect time, disconnect time etc is reported.  See Pending, disconnect and other gobbledegook.

There are two sets of response times, the first set are in units of 128 microseconds, from the old days when disk response times were measured in milliseconds, and a newer set, which are in microseconds.   The old ones should be ignored and the more granular set of values used.

Most, but not all, of the response time fields are available in the SMF 42.6 records.

For one data set, I had

  • overall response time of 96 microseconds,
  • time to get to the Storage controller (pending time) 22, 
  • time in the Storage Controller(connect time) 17,
  • time doing to disks(disconnect time) 0.  

The difference is “other time” (57 microseconds) is the time from the IO request was initiated till the Start SubCHannel (SSCH) was issued, and the time from the I/O interrupt indicating I/O complete, before the z/OS issuing the Test SubCHannel(TSCH) to clear the I/O request and get the status.

Access manager counts

Applications usually use facilities like C functions fread() and fwrite(), rather than do the I/O themselves.  This makes the application much simpler, as the Access Method hides most of the complexity of doing the actual I/O.

Consider the scenario where your application opens a file and reads 10 records.  Under the covers the Access Method reads the disk,  a track’s worth of data, and gives the application data when asked.  We have 10 reads to the access manager, and one read from disk.

There are statistics on the number of sequential access, the number of direct access (get this particular record) and the number of directory records accessed – though this always seems to be zero!

Access manager response times

There are response times for the AM requests, I dont think they are very useful.

 The documentation says

  • delay values are in units of 128 microseconds.
  • The I/O delay is calculated by the access method when the access method checks whether an I/O
    buffer is available to be reused. For example, when an access method has issued an I/O request to commit the buffer. The timer starts when the access method makes the check request; it does not start  when the I/O is requested. The timer ends when the I/O request completes. Thus, the delay value that is reported is the amount of time that the caller had to wait to reuse a buffer.

It is not clear to me what this means.  I think it means there is logic like

  1. start asynchronous read 1
  2. start asynchronous read 2
  3. do something
  4. start timer
  5. check read 1 and wait till it has finished
  6. stop timer, and calculate the duration
  7. start timer
  8. check read 2 and wait till it has finished
  9. stop timer and calculate the duration
  10. Add up all the durations and report the sum.

In the documentation for these field, some fields say “total I/O delay”, others say “I/O delay”.  It looks like the field is total I/O delay.

Reading a file, it had 527 I/Os, with an average I/O response time of 527 microseconds.  The AM statistics had number of sequential read blocks 527, sequential read time 314368 microseconds, or an average of 597 microseconds, which is consistent with the disk I/O response time.

I only seemed to get AM delays when the average disk I/O response time was over 500 microseconds.

Do the numbers make sense?

The answer is sometimes.  I have cases where the numbers do not add up.

  1. For the data set CEE.SCEERUN, it had 21 I/Os, 21 reads from cache, but 14 disk reads.  I did an IO trace, and there were 21 reads from normal cache.  I do not know where the missing 7 were reported.  There was no write activity.
  2. For an SMF data set, this had 586 I/Os, 586 using Sequential cache, and 586 read requests.  This makes sense.
  3. For CBC.SCCNCMP  this had 305 I/Os, 5 normal reads from cache, 300 from Sequential cache, and number of disk reads 305.  This makes sense.
  4. When using the binder my PDSE had 21 I/Os,2 Read cache 12 write cache, and 7 Sequential cache, number of disk reads 2.  For this to make sense the 7 sequential cache would be for the write.

How do you tell the number of read I/Os. 

This is a bit tricky.  First, ask yourself – do you care ? Or are you really interested in the total I/Os.

  1.  If the number of I/OS = (number of cache requests – number of write cache requests), this is the number of read I/Os.
  2. If the number of I/Os != number of cache requests, there should be counts for sequential, 1 Track, or No Cache.  You cannot tell if these are for reads or writes.

Other interesting “features”

  1. The userid is blank.   The documentation says User-defined identification field (taken from common exit parameter area, not from USER=parameter on job statement).  I do not think Ihave a common exit.
  2. There is no step name, just the job name.  I was able to manually connect the SMF 30 records to the SMF 42.6 records by using the time stamp interval of the data set.
  3. There are SMF type 14 records for input data sets,  this was available when z/OS was MVS/360.  It’s partner SMF 15 is for output data sets.  It records by DDName, not data set name, and so is not very useful.  it provides an overlap of information with the SMF 42.6.   You have to enable the collection of SMF 14,15.  As you get one SMF record per DDNAME enty, you get many SMF records.  Most customers do not collect these records. The DISPLAY SMF command on my system gives SYS(NOTYPE(14:19,62:69,99)) which says do not collect records  with SMF types: 14 through 19, 62 through 69, or 99.
  4. Block size can be 0 – which means use the maximum the device supports.
  5. There is a field “Average disconnect time for reads”.   If a read has disconnect time, then there was a cache miss – the data was not in the 3990 storage cache.  You can get read cache miss, if the data has not been used for a while, and the data has been removed from the cache.
  6. There is a field “Maximum data set I/O response time” and “Service time associated with maximum I/O response time” in units of 128 microseconds.  The difference in these values is the time interval between the I/O interrupt occurring, and the Test SubCHannel, which processed the interrupt. My values were 0.000128 and 0.000000 – not very helpful.  Another set of values was 0.001152 0.000640.   A good idea to have this – but it has the wrong granularity.

Understanding Storage Controller caching and the caching statistics.

This started off with an investigation to see if my data set were using the new zHyperlink synchronous IO.   I wrote a program to process the SMF 42, dataset statistics, but did not understand the information.   The documentation talked about cache hits – which cache; the SMSPDSE cache or the Storage controller cache?

A short history of disk evolution

Computer DASD has been around for around 60 years.

  • With early disks the operating system had to send commands to move the heads over the disk, and to wait until the right sector of the disk passed over the head.
  • The addition of read cache to the disks. For hot data, the requested record may be in the cache – and so avoids having to read from the physical disks.
  • Adding write, non volatile, cache and a battery meant there were two steps. 1) Send the data to the write cache, 2) move the data from the write cache to the spinning disks.
  • The use of PC type disks – and solid state disks with no moving parts.

I found this document from 1996 very interesting. It describes the disk caching and other disk concepts.

The statistics in the data set portion of the SMF 42.6 (dataset), and the SMF 42.5 (storage class) come from the disk controller such as a 3990, so references to cache are for the cache in the controller.

Cache for reading from disk.

There are different data scenarios which affect the cache.

  1. A program is reading sequentially;  for example a dataset with many records.  It would be good to say “read a whole cylinder’s worth of data from the disk into the controller’s cache, so the next records are in the cache when the next read is issued”.  The data which has been read can be removed from the cache.
  2. A program is reading a record in a database to update it, or is getting  records from a file (or PDSE) where the data is in 4KB records scattered across the data set.  This is known as direct access. Just get one track’s worth of data – the minimum amount possible to cache.
  3. Normal reading of a small file where requesting a record will cause the next few tracks to be loaded into the cache.

When a Channel Control Program (consisting of Channel Control Words) is issued to read some data, a hint can be passed down. The hints are, matching the scenarios above, with the fields in the SMF 42 records

  1. Sequential access, Number of sequential I/O operations
  2. Record (direct)  access, Number of record level cache I/O operations.
  3. Normal cache, (Number of cache candidates – Number of write cache candidates)

For requests that may be cache unfriendly, the hint can be – “if it is not in the cache – go do the disk, and get it – and do not use the cache”. This is called Inhibit Cache Load, or ICL.

Cache for writing to disk

If you go back 30 years, a write to disk actually had to write to the disk. With the availability of Non Volatile Storage (NVS) and batteries, a write could go to the NVS, and be written to disk asynchronously possibly seconds or minutes later. This was known as DASD Fast Write (DFW).

If the NVS cache was full – the IO went through to the disk without using the cache – and was consequently much slower- the speed of disks, rather than memory speeds.

By default DFW is enabled, but the application could give a hint called “Bypass DFW” so the data went to the disk directly, and not to the cache. One example could be to avoid an application flooding the NVS cache and which would impact other applications using the disks.

What is a cache hit and cache miss – and how do I tell if this is being used?

There is a lot of documentation which mentions cache hit, and cache miss. Of course every one knows what a cache hit is, but I could not find how you tell if you have a cache hit or not. The DASD controller can report the number of cache hits, and misses, but not at a data set level. I haven’t found a document which officially tells me how you determine a cache hit or miss, but I believe the following is true.

If you break down the journey of an IO request from the Start Sub CHannel (SSCH) through the controller, to the disk, and the response returning, there are two parts called Connect and Disconnect.

Looking at a read request, during the connect stage, there is a request to the storage controller for a piece of data. If this is in the cache, it can be returned immediately, along with ‘end of request’. This time in the controller, the “Connection Time” duration, is passed back in a performance block to the originator. The disconnect time (see below) is 0.

If the data is not in the cache, the status goes to “disconnect”, and the request is passed to the disk itself. When the disk has retrieved the data, it passes it to the controller, which connects to the original request and passes the data back to the application. The connection duration, and the disconnect duration are passed back in a performance block to the host.

How do you tell if the data was in the cache or not? Easy – if the disconnect time is zero – the data was in the cache and only the cache was used. If the disconnect time duration is greater than 0, the data was not in the cache – so this is a cache miss.

For a write, if the write to the cache was successful – there the disconnect duration is 0. If the write had to go to the disk, the disconnect duration will be greater than 0. You can use the same argument. If the disconnect time duration is zero – the cache was used. If there is disconnect time, the cache could not be used, so this is a cache miss.

Why is there cache hits and write cache hits, and no read hits reported?

I think this is historical. Initially there was no cache. When volatile cache was available, there were statistics on cache usage. When non volatile cache for disk writes was implemented, they kept the “cache statistics” unchanged to save confusion, and added “write statistics”. This meant you had two fields – all cache requests, and write only cache requests. When you are looking at cache usage, you have to do the calculations yourself to obtain the read cache statistics.

Pending, disconnect and other gobbledegook.

During my time as a MQ performance person, one of the hardest areas to understand was data set performance, and how to make access to data sets faster.  One reason was all the terms that people used.  If you were an expert, the terms were “obvious”.   It is very much like going to a hospital and the doctor says you have a contusion.   Is this good new or is it bad news?  When the doctor explains that a contusion is a fancy name for a bruise, it is clear that this may not be good news  (unless the news is – you haven’t broken your leg, you just have a contusion).  

Gobbledegook definitions

  • Made unintelligible by excessive use of technical terms.
  • Language that seems difficult because you do not understand it.
  • Language that seems to mean nothing.

Yep – DASD performance ticks all of those boxes!

Disk and data set performance statistics

I’ve recently been looking at data set statistics trying to understand the SMF 42 statistics, and as I haven’t used these statistics for five years, I struggled to understand them.

Below, I’ve tried to explain all of the complex terms in simple language.   It may not be 100% accurate, but I hope you get the picture.

What is the basic hardware?

  1. You have the CPU where your program runs.
  2. There may be an I/O processor for offloading the I/O requests
  3. There is a cable – typically known as FICON – or high speed fibre connected from CPU or I/O processor to the disk subsystem.
  4. This FICON is connected to a Storage Controller.  The Storage Controller acts as an interface to the disks.  It can cache data from disks. It talks to other Storage Controllers if mirrored DASD is being used.
  5. Disks.   These used to be be big spinning disks 1 meter in diameter.   These days you have many solid state disks which are used in laptop computers.  High capacity – small footprint.
  6. You can have a FICON director.  This acts as a big switch between I/O the mainframe and the Storage Controllers.   You plug the FICON cables from the mainframe into one side, and the output goes from the FICON director to the Storage Controller.

What is the path of an I/O?

There are many stages to get data to and from a disk.

  1. When the I/O is started, a Start Sub CHannel (SSCH) command is issued.
  2. Within the z/OS box is an I/O processor for offloading the I/O requests.  The SSCH wakes up the I/O processor.
  3. The I/O processor sends a request over FICON, possibly via a FICON Director to the Storage Controller.
  4. The Storage Controller may be able to process the request without going to a disk.
  5. The Storage Controller may pass the request to a disk.
    1. The disk processes the request, and when it has finished notifies the Storage Controller.
  6. The Storage Controller sends the data back to the requestor.
  7. When the storage controller has finished, it sends up a “storage control ended” back up the FICON cable.
  8. The I/O processor catches the request, issues a Test Sub CHannel to get the performance information from the I/O request.  
  9. The I/O processor notifies the CPU, which then wakes up your application.

What are the major categories of I/O delay?

At a high level the time spent in an I/O operation fall in the following categories

  1. The time before the request leaves the CPU and into the I/O subsystem.   
  2. Getting from the I/O subsystem down to the storage controller
  3. The time the storage controller was active
  4. Time spent accessing the disks.
  5. The time between the I/O completing and the CPU processing the status.

Technical terms explaining delays, and possible reasons

Some of these terms are defined here.

  • IOSQ – The time before the request leaves the CPU and into the I/O subsystem.
    • In z/OS, all paths to the device are busy.   You can defined multiple paths to the disks using PAV.
    • All the I/O Processors are busy
    • The I/O could be delayed because a higher priority I/O took precedence.
  • Pending – getting from the I/O subsystem down to the storage controller, (the time required to get the storage hardware to initiate an I/O operation).
    • The FICON channel is overloaded and cannot be used, or the channel is busy.
    • This mainframe already has a reserve on the volume.
    • The FICON director (FICON router) is busy.
    • The Command Response measures the delay to get from the I/O subsystem to the Storage Controller and back – think of it as a TCP/IP ping.
    • Device busy delay might mean:
      • Another system is using the volume.
      • Another system reserved the device, but is not actually using it.
  • Connect time – The time the Storage Controller was active.  Processing the data – read or write.  There is connect time talking to the FICON, and also connect time talking to the disks.
    • Control unit queue time, this is the time queuing within the Storage Control unit – think of it as an enqueue on a track or cylinder.
    • Transferring data.   Note: data may be multiplexed down a connection.   More connections can slow down a requests transfer rate.   Think of a road – when there is too much traffic, the traffic slows down.
    • FICON internal chat.
    • The amount of data to process.   The more data, the longer the transfer takes.
  • Disconnect – Time spent accessing the disks.  This could be accessing local disks, or accessing remote (mirrored) disks.  The Storage controller is not doing any work while the disk is busy.
    • The volume is reserved by another system.
    • Waiting for the arm to move, or the disk to rotate (for spinning disks).
    • The disk is processing the request – for example a cache miss means the disk has to be read.
    • Waiting for a signal from a remote peer to say that write data has been stored.
    • Some SMF records have a field “Read Disconnect time”.    This indicates the read wanted a record which was not in the controller cache.
  • Device Active Only time (DAO).   The channel has finished its work, but the disk was busy for a little longer (for example waiting for a remote disk to complete).  This is the additional time after the channel has finished.
  • Service time: The duration between the SSCH and the interrupt at the end of the I/O.
  • Interrupt Delay Time.   This is the delay between the I/O subsystem getting the interrupt, and z/OS issuing the TSCH to get the status.   If your z/OS image is running as an LPAR, this includes time to dispatch your LPAR, and then for your LPAR to issue the TSCH instruction.

Do I need to know this?

Most of the time you do not need to know about disk performance, as most disk are solid state, and data is in cache – but if you have performance problem – it is worth checking the datasets are not the cause.

In the SMF 42 records you get the following durations (average) in microseconds

  • Response time
    • Pending
      • Initial Command Response (ping)
      • Device busy time
    • Connect time
      • Control Unit Queue
    • Disconnect time
    • Device-Active-Only
    • Disconnect time for reads
  • Response time per random read
  • Service time per random read

Note:Application resume delay is for the z/Hyperlink Synchronous IO.

Setting up striping for MQ logs and other experiences with SMS.

You get improved throughput using striped data sets. Products like MQ and DB2 have logs and page sets which can exploit VSAM striping.   This blog post tells you what you need to configure to be able to use it.

Why do striped logs have higher throughput?

When writing to a dataset, the duration of the request is composed of three parts

  1. Issuing the request and getting it to the IO subsystem on the mainframe.
  2. Getting from the IO subsystem on the mainframe down to the Storage Controller
  3. Transferring the data.

The time depends on the amount of data to transfer.  Striping uses more than one volume,  so less data is written to each device, and so the response time is shorter.

When writing pages to the MQ log which is not striped, all the pages are sent down the channel to one disk.

When using an MQ log with 4 stripes, you have 4 volumes.

  1. Page 1 goes to volume A
  2. Page 2 goes to volume B
  3. Page 3 goes to volume C
  4. Page 4 goes to volume D
  5. Page 5 goes to volume A
  6. Page 6 goes to volume B

The time to send 2 pages to volume A,  etc should be less than the time taken to send 6 pages to a non striped volume.

When I worked for IBM I had JCL to define the logs which was set up with striping.   The storage manager person had set up the SMS definitions so it was easy for me.  I recently tried to set up stripes for my MQ logs on my personal z/OS system and I had to set up my own SMS definitions.  In theory it was easy, but it took me a long time because I did not know about one tiny little SMS command.

SMS, Storage Classes, and Data Classes ( and Management classes)

When I first experienced System Managed Storage(SMS) there were lots of new terms to learn.  I’ll give a 10 second summary of SMS.  

  • Datasets have attributes, such as size in MB, record length, is it a PDSE or a sequential file.
  • Disk volumes have attributes .  You want production datasets on new disks, because they are faster than old disks.  You want to keep the backup copy of a dataset on a different set of disks to the original data set.
  • Some datasets you want to backup daily, have multiple backups and take off them site.  Some data sets you backup once a week, and have only one copy.   If the data set has not been used for a month, then migrate it to tape. 

The SMS classes are

  1. Data Class.  Specify the data set attributes; What record length, space allocation, PDSE or sequential file.
  2. Storage Class defines the criteria for the allocation of data sets.  It defines if the volumes need to be “dual copy” or not, or have a performance response time better than a specified value
  3. Management Class – back this up daily.

You have ACS scripts, which can do processing like, if the HLQ is SYS1.** then set Management Class =  frequent_backup.  If the data set name is like MQS.**.PROCLIB then use Data Class  = BIGALLOC which allocates 50 Cylinders.

You configure the SMS classes using the interactive ISMF tool.

Setting up MQ logs with stripes

You need JCL like

CYL(1000) ) -



You need to pick a Data Class which has extended format.

Data Set Name Type . . . . . : EXTENDED
 If Extended . . . . . . . . : REQUIRED

You need to specify a storage class with striping configured. I thought it would be easy to have a field in the Storage Class called “Number of stripes”, but no, it is way more complex that this.

There is a Storage Class field called Sustained Data Rate(SDR). As I see it, the crazy reasoning behind its use, is that at one time, 3390’s could sustain a data rate of 4 MB/second. If you want more than this you clearly need to use more 3390s. By specifying an SDR value of 16 with 3390s, SMS could work out you want at least 4 stripes – see; crazy!

I changed the storage class to have the SDR value of 16.

Performance Objectives
Direct Millisecond Response . . . : 1
Direct Bias . . . . . . . . . . . :
Sequential Millisecond Response . : 1
Sequential Bias . . . . . . . . . :
Initial Access Response Seconds . :
Sustained Data Rate (MB/sec) . . . : 16

Aside: Another crazy. If you want to cache a dataset – such as a PDSE, there is no storage class switch saying “CACHE=Y|N”. You specify the response time by setting “Sequential MilliSecond Response=1″. MSR= 1 says “This Must be Cached”, MSR= 10 says “This may be cached if SMS has the capacity”. These are only advisories.

You also need to specify the volumes to be used. VOLUMES( A4USR1 USER00) . Your ACS may be set up to choose the appropriate volumes (with the right attributes and enough space). My ACS is very simple, and I have to explicitly specify the SMS volume names.

Did it work? No.

I read the books, and it looked like I had done the right thing. I enabled SMS tracing of allocations using the operator command


This gave me in my JCL


I could see the Storage Class, and the Data Class being used, but why did it have STRIPING(N)?

After a day’s worth of struggling, I eventually dumped and wrote a program to format the SMS DCOLLECT definitions . I could see the SDR value for the storage class was 0! My changes had not been picked up.

I used the operator command

setsms scds(SYS1.S0W1.DFSMS.SCDS)

to reload the definitions,reran my job and it worked. Because I still had SMS VOLumeSELectMSG(ON) enabled, the job produced




The listcat command


gave me



What’s the difference between a PDS and a PDSE?

I’ve been using PDSE’s for years. I thought that PDSE was a slight improvement to a PDS in that you did not have to compress PDSEs like you had to with PDSs, and binding programs require a PDSE.

I’ve found there is a big difference. IBM documents it here. For me the difference are

  1. A PDSE can be larger than a PDS – it can have more extents.
  2. When you delete a member from a PDS, the space is not reclaimed.  When you add a member to a PDS it uses up free space “from the free end”.  When the PDS is full you have to compress it, and reorganise the space.  With a PDSE the data is managed in 4KB pages.  When a member is deleted the space is available immediately
  3. With a PDS you can get “directory full”, if you did not allocate enough directory blocks when you created the data set.  With a PDSE, if it needs a new “directory block” it gets any free block.
  4. The directory of a PDS is in create order.  To find a member you have to search the directory.  With a PDSE the directory is indexed.
  5. With a PDS only one thread can update it at a time.  With a PDSE, multiple tasks can update it – including in a sysplex.
  6. Old fashioned link edits can go into a PDS or a PDSE.   The binder (the enhanced likage editor) can only store  modules in a PDSE.  One reason is that there is more information in the directory entry.
  7. PDSEs are faster.   When you read a PDS there is IO to the disk, firstly to get the directory blocks, to search for the entry, then to read the member from disk.  With a PDSE, the system address space SMSPDSE may have cached directory entries, or the pages themselves, and so eliminated the need for IOs.  Even if it is not cached the directory search may be shorter.
  8. Some system load libraries have to be PDS and not PDSE, as the PDSE code may not be loaded early in the IPL.

You can find out about PDSEs here

Is this disk synchronous IO really new?

There is synchronous write, and synchronous write, you just have to know if you are talking about a synchronous write as seen by an application, or as issued by the z/OS operating system.  The synchronous disk IO as used by the IO subsystem is new (a couple of years old) it is known as zHyperLink.

If you know about sync and async requests to a coupling facility – you already know the concepts.

The application synchronous IO

For the last 40 years an application did an IO request which looked to be synchronous.  Immediately after a read IO request had finished you could use the data. What happens under the cover is as follows:

  1. Issue the IO request to read from disk – for example a c fread function.
  2. The operating system determines which disk to use, and where on the disk to read from.
  3. The OS issues the IO request.
  4. The OS dispatcher suspends the requesting task.
  5. The OS dispatcher dispatches another task.
  6. When the IO request has completed, it signals an IO-complete interrupt.  This interrupts the currently executing program to set a flag saying the original task is now dispatch-able.
  7. The dispatcher resumes the original task which can now process the data.

40 years ago a read could take over 20 milliseconds.

A short history of disk IO.

Over 40 years disks have changed from big spinning disks – 2 meters high to PC sized disks with many times the capacity.

  • With early disks the operating system had to send commands to move the heads over the disk, and to wait until the right part of the disk passed under the head.
  • The addition of read cache to the disks. For hot data, the wanted record may be in the cache – and so avoids having to read from the physical disks
  • Adding write cache – and a battery meant there were two steps. 1) Send the data to the write cache, 2) move the data from the write cache to the spinning disks
  • The use of PC type disks – and solid state disks with no moving parts.

These all relied on the model start an IO request, wait for the IO complete interrupt.

The coupling facility

The coupling facility(CF) is a machine with global shared memory, available to systems in a Sysplex.

When this was being developed the developers found that it was sometimes quicker to issue an IO instruction and wait for it to complete, than have the model used above of starting an IO, and waiting for the interrupt. The “issue the IO instruction and wait”, the synchronous request, might take 1000 microseconds. The “start the IO, wait, and process the interrupt”, the asynchronous request might take 50 microseconds.

How long does the synchronous instruction take? – How long is a piece of string?

Most of the time spent in the synchronous instruction is the time on the cable between the processor and the disk controller – a problem with the speed of light. If the distance is long (a long bit of cable), the instruction takes too long, and it is more efficient to use the Async model to communicate to the CF. Use a shorter cable (it may mean moving the CF closer to the CPU) and the instruction is quicker.

How about synchronous disk IO?

The same model can be used with disk IO. The underlying technology (eg DB2) had to change the way it does IO to exploit this.

When used for disk read – the data is expected to be in the disk controller cache. If not then the request will time out, and an Async request will be made.

This can be used for disk write to put the data into the disk controller cache, but this may not be as useful. If you are mirroring your logs, with local disks and remote disks, the IO as seen by DB2 will not compete until the local and remote IOs have completed. Just like the CF it means the DASD controller (3990) needs to be close to the CPU.

I found Lightning Fast I/O via zHyperLink and Db2 for z/OS Exploitation a good article which mentions synchronous IO.

IO statistics

I noticed that in older releases of z/OS, IO response times were in units of 128 microseconds. For example when an IO finishes the response contains the IO delays in the different IO stages. In recent releases, the IO response times are now in microseconds, as you may get response times down to the 10’s of microseconds, and so reporting it in units of 128 microseconds is not accurate enough.

The bear traps when using enclaves

I hit several problems when trying to use the enclave support.

In summary

  1. The functions to set up and use an enclave are available from C, but the functions to query and display usage are not available from C (and so not available from Java).
  2. Some functions caused an infinite loop because they overwrote the save area.
  3. Not all classify functions are available in C.  For example ClientIPAddr
  4. I had problems in 64 bit mode.
  5. Various documentation problems
  6. It is not documented that you need to pass the connection token to __server_classify(_SERVER_CLASSIFY_CONNTKN, (char * ) connToken. You get errno2 errno2=0x0330083B.  Home address space does not own the connect token
    from the input parameter list.
  7. You can query the CPU used by your enclave using the IWMQTME macro (in supervisor state!). I had to specify CURRENT_DISP=YES to cause the dispatcher to be called to update the CPU figures.  By default the CPU usage figures are updated at the end of a dispatch cycle.  On my low use system, my transactions were running without being redispatched, and so the CPU “used” was reported as 0.

In more detail…

Minimum functionality for C programs.

You cannot obtain the CPU used by the enclaves from a C program, as the functions are not defined.  I had to write my own assembler code to called the assembler macros to obtain the information.  Some of these macros require supervisor state.

Many macros clobber the save area

Many macros, use a program call to execute a function.  Other functions such as  IWMEQTME use a BASR instruction.  This function then does a standard save of the registers.  This means that you need to have a standard function save area.  Without this, the callers save area was used, and this overwrote the register, and Branch back… just branched to after the macro.

Instead of a function like

EDEL     AMODE  31 
          USING *,12 
          STM  14,12,12(13) 
          LR   12,15 
          L    6 0(1)  the work area  
          L    2,4(1)  ADDRESS OF THE passed data              
          IWM4EDEL ETOKEN=0(2),MF=(E,0(6),COMPLETE),                   XX 
                CPUTIME=8(2),ZAAPTIME=16(2),ZIIPTIME=24(2),            XX 
          LM   14,12,12(13) 
          SR   15,15 
          BR   14 

I needed to add in code to create a save area, for example with a different macro

QCPU     AMODE  31 
      USING *,12 
      STM  14,12,12(13) 
      LR   2,1 
      LR   12,15 
      LA    0,WORKLEN 
      L    2,0(2)  ADDRESS OF THE CPUTIME 
      IWMEQTME CPUTIME=8(2),ZAAPTIME=16(2),ZIIPTIME=24(2),          X 
            CURRENT_DISP=YES,                                       X 
      LR   3,15 
* free the resgister save area
      LR     1,13               ADDRESS TO BE RELEASED 
      L     13,4(,13)          ADDRESS OF PRIOR SAVE AREA 
            ADDR=(1),            ..ADDRESS IN R1                    X 
            LENGTH=(0)           ..LENGTH IN R0 
      L    14,12(13) 
      LR  15,3 
      LM   0,12,20(13) 
 SR   15,15 
      BR   14   

Problems using a 64 bit program

I initially had my C program in 64 bit mode. This caused when I wrote some stub code to use the assembler interface, as the assembler macros are supported in AMODE 31, but my program, and storage areas were 64 bit, and the assembler code had problems.

Various documentation problems

  1. It is not documented that you need to pass the connection token to __server_classify(_SERVER_CLASSIFY_CONNTKN, (char * ) connToken. You get errno2 errno2=0x0330083B.  Home address space does not own the connect token
    from the input parameter list
  2. _SERVER_CLASSIFY_SUBSYSTEM_PARM Set the transaction subsystem parameter. When specified, value contains a NULL-terminated character string of up to 255 characters containing the subsystem parameter being used for the __server_pwu() call.  This applies to _Server_classify_ as well as __server_pwu().   The sample applies for  _SERVER_CLASSIFY_TRANSACTION_CLASS , _SERVER_CLASSIFY_TRANSACTION_NAME, _SERVER_CLASSIFY_USERID.
  3. Getting report and server class back from __server-classify
    2. You use _SERVER_CLASSIFY_RPTCLSNM@, _SERVER_CLASSIFY_SERVCLS@, _SERVER_CLASSIFY_SERVCLSNM@ without the @ at the end.   I think this is meant to imply these are pointers.
    3. They did not work for me.  I could not see when the fields are available.   The classify work is only done during the CreateWorkUnit() request.  I request it before this function, and after this function and only got back a string of hex 0s.

Using enclaves in a java program

Ive blogged about using enclaves from a C program.  There is an interface from Java which uses this C interface.

Is is relatively easy to use enclave services from a java program, as there are java classes for most of the functions, available from JZOS toolkit.  For example the WorkloadManager class is defined here.

Below is a program I used to get the Work Load Manager(WLM) services working.

import java.util.concurrent.TimeUnit;
import com.ibm.jzos.wlm.ServerClassification;
import com.ibm.jzos.wlm.WorkUnit;
import com.ibm.jzos.wlm.WorkloadManager;
public class main
// run it with /usr/lpp/java/J8.0_64/bin/java main
public static void main(String[] args) throws Exception
WorkloadManager wlmToken = new WorkloadManager("JES", "SM3");
ServerClassification serverC = wlmToken.createServerClassification();
for ( int j = 0;j<1000;j++)
WorkUnit wU = new WorkUnit(serverC, "MAINCP");
float f;
for (int i = 0;i<1000000;i++) f=ii2;
TimeUnit.MICROSECONDS.sleep(20*1000); // 200 milliseconds
wU.delete(); // end the workload

The WLM statements are explained below.

WorkloadManager wlmToken = new WorkloadManager(“JES”, “SM3”);

This connects to the Work Load Manager and returns a connection token.    This needs to be done once per JVM.  You can use any relevant subsystem type, I used JES, and a SubsystemInstance (SI) of SM3. As a test, I created a new  subsystem category in WLM called DOG, and used that.  I defined ServerInstance SI with a value of SM3 within DOG and it worked.

z/OS uses uses subsystems such as JES for jobs submitted into JES2, and STC for Started task.

ServerClassification serverC = m.createServerClassification();

If your application is going to classify the transaction to determine the WLM service class and reporting  class you need this.  You create it, then add the classification criteria to it, see the following section.

Internally this passes the connection token wlmToken to the createServerClassification function.


This passes information to WLM to determine the best service class and reporting class.  Within Subsystem CAT, Subsystem Instance SM1, I had a sub rule TransactionName (TN) with a value TCI3.  I defined the service class and a reporting class.

WorkUnit wU = new WorkUnit(serverC, “MAINCP”);

This creates the Independent (business transaction) enclave.  I have not see the value MAINCP reported in any reports.   This invokes the C run time function CreateWorkUnit(). The CreateWorkUnit function requires a STCK value of when the work unit started.  The Java code does this for you and passes the STCK through.


This connect the current task to the enclave, and any CPU it uses will be recorded against the enclave. 


Disconnect the current task from the enclave.  After this call any CPU used by the thread will be recorded against the address space.


The Independent enclave(Business transaction) has finished. WLM records the elapsed time and resources used for the business transaction.


The program disconnects from WLM.

Reporting class output.

I used RMF to print the SMF 72 records for this program.   The Reporting class for this program had

AVG        0.29  ACTUAL                36320 
MPL        0.29  EXECUTION             35291 
ENDED       998  QUEUED                 1028 
END/S      8.31  R/S AFFIN                 0 
#SWAPS        0  INELIGIBLE                0 
EXCTD         0  CONVERSION                0 
                 STD DEV               18368 
----SERVICE----   SERVICE TIME  ---APPL %--- 
IOC           0   CPU   12.543  CP      0.01 
CPU       10747   SRB    0.000  IIPCP   0.01 
MSO           0   RCT    0.000  IIP    10.44 
SRB           0   IIT    0.000  AAPCP   0.00 
TOT       10747   HST    0.000  AAP      N/A 

From this we can see that for the interval

  1. 998 transactions ended.  (Another report interval had 2 transactions ending)
  2. the response time was an average of 36.3 milliseconds
  3. a total of 12.543 seconds of CPU was used.
  4. it spent 10.44 % of the time on a ZIIP.
  5. 0.01 % of the time it was executing ZIIP eligible work on a CP as there was no available ZIIP.

Additional functions.

The functions below

  • ContinueWorkUnit – for dependent enclave
  • JoinWorkUnit – as before
  • LeaveWorkUnit – as before
  • DeleteWorkUnit – as before

can be used to record CPU against the dependent (Address space) enclave.  There is no WLM classify for a dependent enclave.

Java threads and WLM

A common application pattern is to use connection pooling.  For example the connect/disconnect to a database or MQ is expensive.  If you have a pool of threads, which connect, and start connected, an application can request a thread and get a thread which has already been connected to the resource manager.

It should be a simple matter of changing the interface from



connectionPool.getConnection(WorkUnit wU)
 connection = connectionPool.getConnection()

and add a connection.leave(wU) to the releaseConnection.