Pending, disconnect and other gobbledegook.

During my time as a MQ performance person, one of the hardest areas to understand was data set performance, and how to make access to data sets faster.  One reason was all the terms that people used.  If you were an expert, the terms were “obvious”.   It is very much like going to a hospital and the doctor says you have a contusion.   Is this good new or is it bad news?  When the doctor explains that a contusion is a fancy name for a bruise, it is clear that this may not be good news  (unless the news is – you haven’t broken your leg, you just have a contusion).  

Gobbledegook definitions

  • Made unintelligible by excessive use of technical terms.
  • Language that seems difficult because you do not understand it.
  • Language that seems to mean nothing.

Yep – DASD performance ticks all of those boxes!

Disk and data set performance statistics

I’ve recently been looking at data set statistics trying to understand the SMF 42 statistics, and as I haven’t used these statistics for five years, I struggled to understand them.

Below, I’ve tried to explain all of the complex terms in simple language.   It may not be 100% accurate, but I hope you get the picture.

What is the basic hardware?

  1. You have the CPU where your program runs.
  2. There may be an I/O processor for offloading the I/O requests
  3. There is a cable – typically known as FICON – or high speed fibre connected from CPU or I/O processor to the disk subsystem.
  4. This FICON is connected to a Storage Controller.  The Storage Controller acts as an interface to the disks.  It can cache data from disks. It talks to other Storage Controllers if mirrored DASD is being used.
  5. Disks.   These used to be be big spinning disks 1 meter in diameter.   These days you have many solid state disks which are used in laptop computers.  High capacity – small footprint.
  6. You can have a FICON director.  This acts as a big switch between I/O the mainframe and the Storage Controllers.   You plug the FICON cables from the mainframe into one side, and the output goes from the FICON director to the Storage Controller.

What is the path of an I/O?

There are many stages to get data to and from a disk.

  1. When the I/O is started, a Start Sub CHannel (SSCH) command is issued.
  2. Within the z/OS box is an I/O processor for offloading the I/O requests.  The SSCH wakes up the I/O processor.
  3. The I/O processor sends a request over FICON, possibly via a FICON Director to the Storage Controller.
  4. The Storage Controller may be able to process the request without going to a disk.
  5. The Storage Controller may pass the request to a disk.
    1. The disk processes the request, and when it has finished notifies the Storage Controller.
  6. The Storage Controller sends the data back to the requestor.
  7. When the storage controller has finished, it sends up a “storage control ended” back up the FICON cable.
  8. The I/O processor catches the request, issues a Test Sub CHannel to get the performance information from the I/O request.  
  9. The I/O processor notifies the CPU, which then wakes up your application.

What are the major categories of I/O delay?

At a high level the time spent in an I/O operation fall in the following categories

  1. The time before the request leaves the CPU and into the I/O subsystem.   
  2. Getting from the I/O subsystem down to the storage controller
  3. The time the storage controller was active
  4. Time spent accessing the disks.
  5. The time between the I/O completing and the CPU processing the status.

Technical terms explaining delays, and possible reasons

Some of these terms are defined here.

  • IOSQ – The time before the request leaves the CPU and into the I/O subsystem.
    • In z/OS, all paths to the device are busy.   You can defined multiple paths to the disks using PAV.
    • All the I/O Processors are busy
    • The I/O could be delayed because a higher priority I/O took precedence.
  • Pending – getting from the I/O subsystem down to the storage controller, (the time required to get the storage hardware to initiate an I/O operation).
    • The FICON channel is overloaded and cannot be used, or the channel is busy.
    • This mainframe already has a reserve on the volume.
    • The FICON director (FICON router) is busy.
    • The Command Response measures the delay to get from the I/O subsystem to the Storage Controller and back – think of it as a TCP/IP ping.
    • Device busy delay might mean:
      • Another system is using the volume.
      • Another system reserved the device, but is not actually using it.
  • Connect time – The time the Storage Controller was active.  Processing the data – read or write.  There is connect time talking to the FICON, and also connect time talking to the disks.
    • Control unit queue time, this is the time queuing within the Storage Control unit – think of it as an enqueue on a track or cylinder.
    • Transferring data.   Note: data may be multiplexed down a connection.   More connections can slow down a requests transfer rate.   Think of a road – when there is too much traffic, the traffic slows down.
    • FICON internal chat.
    • The amount of data to process.   The more data, the longer the transfer takes.
  • Disconnect – Time spent accessing the disks.  This could be accessing local disks, or accessing remote (mirrored) disks.  The Storage controller is not doing any work while the disk is busy.
    • The volume is reserved by another system.
    • Waiting for the arm to move, or the disk to rotate (for spinning disks).
    • The disk is processing the request – for example a cache miss means the disk has to be read.
    • Waiting for a signal from a remote peer to say that write data has been stored.
    • Some SMF records have a field “Read Disconnect time”.    This indicates the read wanted a record which was not in the controller cache.
  • Device Active Only time (DAO).   The channel has finished its work, but the disk was busy for a little longer (for example waiting for a remote disk to complete).  This is the additional time after the channel has finished.
  • Service time: The duration between the SSCH and the interrupt at the end of the I/O.
  • Interrupt Delay Time.   This is the delay between the I/O subsystem getting the interrupt, and z/OS issuing the TSCH to get the status.   If your z/OS image is running as an LPAR, this includes time to dispatch your LPAR, and then for your LPAR to issue the TSCH instruction.

Do I need to know this?

Most of the time you do not need to know about disk performance, as most disk are solid state, and data is in cache – but if you have performance problem – it is worth checking the datasets are not the cause.

In the SMF 42 records you get the following durations (average) in microseconds

  • Response time
    • Pending
      • Initial Command Response (ping)
      • Device busy time
    • Connect time
      • Control Unit Queue
    • Disconnect time
    • Device-Active-Only
    • Disconnect time for reads
  • Response time per random read
  • Service time per random read

Note:Application resume delay is for the z/Hyperlink Synchronous IO.

Setting up striping for MQ logs and other experiences with SMS.

You get improved throughput using striped data sets. Products like MQ and DB2 have logs and page sets which can exploit VSAM striping.   This blog post tells you what you need to configure to be able to use it.

Why do striped logs have higher throughput?

When writing to a dataset, the duration of the request is composed of three parts

  1. Issuing the request and getting it to the IO subsystem on the mainframe.
  2. Getting from the IO subsystem on the mainframe down to the Storage Controller
  3. Transferring the data.

The time depends on the amount of data to transfer.  Striping uses more than one volume,  so less data is written to each device, and so the response time is shorter.

When writing pages to the MQ log which is not striped, all the pages are sent down the channel to one disk.

When using an MQ log with 4 stripes, you have 4 volumes.

  1. Page 1 goes to volume A
  2. Page 2 goes to volume B
  3. Page 3 goes to volume C
  4. Page 4 goes to volume D
  5. Page 5 goes to volume A
  6. Page 6 goes to volume B

The time to send 2 pages to volume A,  etc should be less than the time taken to send 6 pages to a non striped volume.

When I worked for IBM I had JCL to define the logs which was set up with striping.   The storage manager person had set up the SMS definitions so it was easy for me.  I recently tried to set up stripes for my MQ logs on my personal z/OS system and I had to set up my own SMS definitions.  In theory it was easy, but it took me a long time because I did not know about one tiny little SMS command.

SMS, Storage Classes, and Data Classes ( and Management classes)

When I first experienced System Managed Storage(SMS) there were lots of new terms to learn.  I’ll give a 10 second summary of SMS.  

  • Datasets have attributes, such as size in MB, record length, is it a PDSE or a sequential file.
  • Disk volumes have attributes .  You want production datasets on new disks, because they are faster than old disks.  You want to keep the backup copy of a dataset on a different set of disks to the original data set.
  • Some datasets you want to backup daily, have multiple backups and take off them site.  Some data sets you backup once a week, and have only one copy.   If the data set has not been used for a month, then migrate it to tape. 

The SMS classes are

  1. Data Class.  Specify the data set attributes; What record length, space allocation, PDSE or sequential file.
  2. Storage Class defines the criteria for the allocation of data sets.  It defines if the volumes need to be “dual copy” or not, or have a performance response time better than a specified value
  3. Management Class – back this up daily.

You have ACS scripts, which can do processing like, if the HLQ is SYS1.** then set Management Class =  frequent_backup.  If the data set name is like MQS.**.PROCLIB then use Data Class  = BIGALLOC which allocates 50 Cylinders.

You configure the SMS classes using the interactive ISMF tool.

Setting up MQ logs with stripes

You need JCL like

//IDCAMS EXEC PGM=IDCAMS,REGION=4M 
//SYSPRINT DD SYSOUT=*
//SYSIN DD *
DEFINE CLUSTER -
(NAME (MQS.STRIPED.LOGCOPY1.DS01) -
LINEAR -
DATACLAS(DCEXT ) -
STORAGECLASS(SCMQS ) -
VOLUMES( A4USR1 USER00) -
SHAREOPTIONS(2 3) -
CYL(1000) ) -
DATA -
(NAME(MQS.STRIPED.LOGCOPY1.DS01.DATA) )

LISTCAT ENTRIES(MQS.STRIPED.LOGCOPY1.DS01) ALL

 

You need to pick a Data Class which has extended format.

Data Set Name Type . . . . . : EXTENDED
 If Extended . . . . . . . . : REQUIRED

You need to specify a storage class with striping configured. I thought it would be easy to have a field in the Storage Class called “Number of stripes”, but no, it is way more complex that this.

There is a Storage Class field called Sustained Data Rate(SDR). As I see it, the crazy reasoning behind its use, is that at one time, 3390’s could sustain a data rate of 4 MB/second. If you want more than this you clearly need to use more 3390s. By specifying an SDR value of 16 with 3390s, SMS could work out you want at least 4 stripes – see; crazy!

I changed the storage class to have the SDR value of 16.

Performance Objectives
Direct Millisecond Response . . . : 1
Direct Bias . . . . . . . . . . . :
Sequential Millisecond Response . : 1
Sequential Bias . . . . . . . . . :
Initial Access Response Seconds . :
Sustained Data Rate (MB/sec) . . . : 16


Aside: Another crazy. If you want to cache a dataset – such as a PDSE, there is no storage class switch saying “CACHE=Y|N”. You specify the response time by setting “Sequential MilliSecond Response=1″. MSR= 1 says “This Must be Cached”, MSR= 10 says “This may be cached if SMS has the capacity”. These are only advisories.

You also need to specify the volumes to be used. VOLUMES( A4USR1 USER00) . Your ACS may be set up to choose the appropriate volumes (with the right attributes and enough space). My ACS is very simple, and I have to explicitly specify the SMS volume names.

Did it work? No.

I read the books, and it looked like I had done the right thing. I enabled SMS tracing of allocations using the operator command

SETSMS VOLSELMSG(ON) 

This gave me in my JCL

IGD17385I =====SUMMARIZED ANALYSIS MESSAGES ON DEFINING DATA SET MQS.STRIPED.LOGCOPY1.DS01 =====
IGD17386I VOLSELMSG(ON,0) TYPE(ALL) JOBNAME(* ) ASID() 076 STEPNAME( ) DSNAME(* )
IGD17387I DS_TYPE(VSAM) SC(SCMQS) DC(DCEXT) GS(N) SPACE(4000KB) BESTFIT(N) STRIPING(N)
IGD17290I THERE WERE 3 CANDIDATE STORAGE GROUPS OF WHICH THE FIRST 3 078
WERE ELIGIBLE FOR VOLUME SELECTION.
THE CANDIDATE STORAGE GROUPS WERE:SGMQS SGBASE SGEXTEAV
IGD17269I 247 VOLUMES WERE REJECTED BECAUSE THEY WERE NOT ONLINE
IGD17269I 247 VOLUMES WERE REJECTED BECAUSE THE UCB WAS NOT AVAILABLE

I could see the Storage Class, and the Data Class being used, but why did it have STRIPING(N)?

After a day’s worth of struggling, I eventually dumped and wrote a program to format the SMS DCOLLECT definitions . I could see the SDR value for the storage class was 0! My changes had not been picked up.

I used the operator command

setsms scds(SYS1.S0W1.DFSMS.SCDS)

to reload the definitions,reran my job and it worked. Because I still had SMS VOLumeSELectMSG(ON) enabled, the job produced

IGD17387I DS_TYPE(VSAM) SC(SCMQS) DC(DCEXT) GS(N) SPACE(4000KB) BESTFIT(N) STRIPING(Y)

and

IGD17070I DATA SET MQS.STRIPED.LOGCOPY1.DS01
ALLOCATED SUCCESSFULLY WITH 2 STRIPE(S).
IGD17172I DATA SET MQS.STRIPED.LOGCOPY1.DS01
IS ELIGIBLE FOR EXTENDED ADDRESSABILITY
IDC0508I DATA ALLOCATION STATUS FOR VOLUME A4USR1 IS 0
IDC0508I DATA ALLOCATION STATUS FOR VOLUME USER00 IS 0
IDC0181I STORAGECLASS USED IS SCMQS
IDC0181I DATACLASS USED IS DCEXT
IDC0001I FUNCTION COMPLETED, HIGHEST CONDITION CODE WAS 0

The listcat command

LISTCAT ENTRIES(MQS.STRIPED.LOGCOPY1.DS01) ALL

gave me

STRIPE-COUNT-----------2
...
VOLUME                      
   VOLSER------------A4USR1  
   STRIPE-NUMBER----------1  
 VOLUME                      
   VOLSER------------USER00  
   STRIPE-NUMBER----------2  

Success!

What’s the difference between a PDS and a PDSE?

I’ve been using PDSE’s for years. I thought that PDSE was a slight improvement to a PDS in that you did not have to compress PDSEs like you had to with PDSs, and binding programs require a PDSE.

I’ve found there is a big difference. IBM documents it here. For me the difference are

  1. A PDSE can be larger than a PDS – it can have more extents.
  2. When you delete a member from a PDS, the space is not reclaimed.  When you add a member to a PDS it uses up free space “from the free end”.  When the PDS is full you have to compress it, and reorganise the space.  With a PDSE the data is managed in 4KB pages.  When a member is deleted the space is available immediately
  3. With a PDS you can get “directory full”, if you did not allocate enough directory blocks when you created the data set.  With a PDSE, if it needs a new “directory block” it gets any free block.
  4. The directory of a PDS is in create order.  To find a member you have to search the directory.  With a PDSE the directory is indexed.
  5. With a PDS only one thread can update it at a time.  With a PDSE, multiple tasks can update it – including in a sysplex.
  6. Old fashioned link edits can go into a PDS or a PDSE.   The binder (the enhanced likage editor) can only store  modules in a PDSE.  One reason is that there is more information in the directory entry.
  7. PDSEs are faster.   When you read a PDS there is IO to the disk, firstly to get the directory blocks, to search for the entry, then to read the member from disk.  With a PDSE, the system address space SMSPDSE may have cached directory entries, or the pages themselves, and so eliminated the need for IOs.  Even if it is not cached the directory search may be shorter.
  8. Some system load libraries have to be PDS and not PDSE, as the PDSE code may not be loaded early in the IPL.

You can find out about PDSEs here

Is this disk synchronous IO really new?

There is synchronous write, and synchronous write, you just have to know if you are talking about a synchronous write as seen by an application, or as issued by the z/OS operating system.  The synchronous disk IO as used by the IO subsystem is new (a couple of years old) it is known as zHyperLink.

If you know about sync and async requests to a coupling facility – you already know the concepts.

The application synchronous IO

For the last 40 years an application did an IO request which looked to be synchronous.  Immediately after a read IO request had finished you could use the data. What happens under the cover is as follows:

  1. Issue the IO request to read from disk – for example a c fread function.
  2. The operating system determines which disk to use, and where on the disk to read from.
  3. The OS issues the IO request.
  4. The OS dispatcher suspends the requesting task.
  5. The OS dispatcher dispatches another task.
  6. When the IO request has completed, it signals an IO-complete interrupt.  This interrupts the currently executing program to set a flag saying the original task is now dispatch-able.
  7. The dispatcher resumes the original task which can now process the data.

40 years ago a read could take over 20 milliseconds.

A short history of disk IO.

Over 40 years disks have changed from big spinning disks – 2 meters high to PC sized disks with many times the capacity.

  • With early disks the operating system had to send commands to move the heads over the disk, and to wait until the right part of the disk passed under the head.
  • The addition of read cache to the disks. For hot data, the wanted record may be in the cache – and so avoids having to read from the physical disks
  • Adding write cache – and a battery meant there were two steps. 1) Send the data to the write cache, 2) move the data from the write cache to the spinning disks
  • The use of PC type disks – and solid state disks with no moving parts.

These all relied on the model start an IO request, wait for the IO complete interrupt.

The coupling facility

The coupling facility(CF) is a machine with global shared memory, available to systems in a Sysplex.

When this was being developed the developers found that it was sometimes quicker to issue an IO instruction and wait for it to complete, than have the model used above of starting an IO, and waiting for the interrupt. The “issue the IO instruction and wait”, the synchronous request, might take 1000 microseconds. The “start the IO, wait, and process the interrupt”, the asynchronous request might take 50 microseconds.

How long does the synchronous instruction take? – How long is a piece of string?

Most of the time spent in the synchronous instruction is the time on the cable between the processor and the disk controller – a problem with the speed of light. If the distance is long (a long bit of cable), the instruction takes too long, and it is more efficient to use the Async model to communicate to the CF. Use a shorter cable (it may mean moving the CF closer to the CPU) and the instruction is quicker.

How about synchronous disk IO?

The same model can be used with disk IO. The underlying technology (eg DB2) had to change the way it does IO to exploit this.

When used for disk read – the data is expected to be in the disk controller cache. If not then the request will time out, and an Async request will be made.

This can be used for disk write to put the data into the disk controller cache, but this may not be as useful. If you are mirroring your logs, with local disks and remote disks, the IO as seen by DB2 will not compete until the local and remote IOs have completed. Just like the CF it means the DASD controller (3990) needs to be close to the CPU.

I found Lightning Fast I/O via zHyperLink and Db2 for z/OS Exploitation a good article which mentions synchronous IO.

IO statistics

I noticed that in older releases of z/OS, IO response times were in units of 128 microseconds. For example when an IO finishes the response contains the IO delays in the different IO stages. In recent releases, the IO response times are now in microseconds, as you may get response times down to the 10’s of microseconds, and so reporting it in units of 128 microseconds is not accurate enough.

The bear traps when using enclaves

I hit several problems when trying to use the enclave support.

In summary

  1. The functions to set up and use an enclave are available from C, but the functions to query and display usage are not available from C (and so not available from Java).
  2. Some functions caused an infinite loop because they overwrote the save area.
  3. Not all classify functions are available in C.  For example ClientIPAddr
  4. I had problems in 64 bit mode.
  5. Various documentation problems
  6. It is not documented that you need to pass the connection token to __server_classify(_SERVER_CLASSIFY_CONNTKN, (char * ) connToken. You get errno2 errno2=0x0330083B.  Home address space does not own the connect token
    from the input parameter list.
  7. You can query the CPU used by your enclave using the IWMQTME macro (in supervisor state!). I had to specify CURRENT_DISP=YES to cause the dispatcher to be called to update the CPU figures.  By default the CPU usage figures are updated at the end of a dispatch cycle.  On my low use system, my transactions were running without being redispatched, and so the CPU “used” was reported as 0.

In more detail…

Minimum functionality for C programs.

You cannot obtain the CPU used by the enclaves from a C program, as the functions are not defined.  I had to write my own assembler code to called the assembler macros to obtain the information.  Some of these macros require supervisor state.

Many macros clobber the save area

Many macros, use a program call to execute a function.  Other functions such as  IWMEQTME use a BASR instruction.  This function then does a standard save of the registers.  This means that you need to have a standard function save area.  Without this, the callers save area was used, and this overwrote the register, and Branch back… just branched to after the macro.

Instead of a function like

EDEL     RMODE  ANY 
EDEL     AMODE  31 
EDEL     CSECT 
          USING *,12 
          STM  14,12,12(13) 
          LR   12,15 
          L    6 0(1)  the work area  
          L    2,4(1)  ADDRESS OF THE passed data              
          IWM4EDEL ETOKEN=0(2),MF=(E,0(6),COMPLETE),                   XX 
                CPUTIME=8(2),ZAAPTIME=16(2),ZIIPTIME=24(2),            XX 
                RSNCODE=32(2),RETCODE=36(2) 
          LM   14,12,12(13) 
          SR   15,15 
          BR   14 

I needed to add in code to create a save area, for example with a different macro

QCPU     RMODE  ANY 
QCPU     AMODE  31 
QCPU     CSECT 
** CAUTION THE IWMEQTME CORRUPTS SAVE AREA SO PROGRAM NEEDS ITS OWN
** SAVE AREA 
      USING *,12 
      STM  14,12,12(13) 
      LR   2,1 
      LR   12,15 
      LA    0,WORKLEN 
      STORAGE OBTAIN,LENGTH=(0) 
      ST     1,8(,13) FORWARD CHAIN IN PREV SAVE AREA 
      ST     13,4(,1) BACKWARD CHAIN IN NEXT SAVE AREA 
      LR     13,1     SET UP NEW SAVE AREA/REENTRANT WORKAREA 
      L    2,0(2)  ADDRESS OF THE CPUTIME 
      IWMEQTME CPUTIME=8(2),ZAAPTIME=16(2),ZIIPTIME=24(2),          X 
            CURRENT_DISP=YES,                                       X 
            RSNCODE=4(2),RETCODE=0(2),MF=(E,32(2),COMPLETE) 
      LR   3,15 
* free the resgister save area
      LR     1,13               ADDRESS TO BE RELEASED 
      L     13,4(,13)          ADDRESS OF PRIOR SAVE AREA 
      LA    0,WORKLEN           LENGTH OF STORAGE TO RELEASE 
      STORAGE RELEASE,           RELEASE REENTRANT WORK AREA        X 
            ADDR=(1),            ..ADDRESS IN R1                    X 
            LENGTH=(0)           ..LENGTH IN R0 
      L    14,12(13) 
      LR  15,3 
      LM   0,12,20(13) 
 SR   15,15 
      BR   14   

Problems using a 64 bit program

I initially had my C program in 64 bit mode. This caused when I wrote some stub code to use the assembler interface, as the assembler macros are supported in AMODE 31, but my program, and storage areas were 64 bit, and the assembler code had problems.

Various documentation problems

  1. It is not documented that you need to pass the connection token to __server_classify(_SERVER_CLASSIFY_CONNTKN, (char * ) connToken. You get errno2 errno2=0x0330083B.  Home address space does not own the connect token
    from the input parameter list
  2. _SERVER_CLASSIFY_SUBSYSTEM_PARM Set the transaction subsystem parameter. When specified, value contains a NULL-terminated character string of up to 255 characters containing the subsystem parameter being used for the __server_pwu() call.  This applies to _Server_classify_ as well as __server_pwu().   The sample applies for  _SERVER_CLASSIFY_TRANSACTION_CLASS , _SERVER_CLASSIFY_TRANSACTION_NAME, _SERVER_CLASSIFY_USERID.
  3. Getting report and server class back from __server-classify
    1. It is  _SERVER_CLASSIFY_SRVCLSNM not _SERVER_CLASSIFY_SERVCLSNM.
    2. You use _SERVER_CLASSIFY_RPTCLSNM@, _SERVER_CLASSIFY_SERVCLS@, _SERVER_CLASSIFY_SERVCLSNM@ without the @ at the end.   I think this is meant to imply these are pointers.
    3. They did not work for me.  I could not see when the fields are available.   The classify work is only done during the CreateWorkUnit() request.  I request it before this function, and after this function and only got back a string of hex 0s.

Using enclaves in a Java program – capturing elapsed and CPU time used by a Java transaction.

Ive blogged about using enclaves from a C program.  There is an interface from Java which uses this C interface.

Is is relatively easy to use enclave services from a java program, as there are java classes for most of the functions, available from JZOS toolkit.  For example the WorkloadManager class is defined here.

Below is a program I used to get the Work Load Manager(WLM) services working.

import java.util.concurrent.TimeUnit;
import com.ibm.jzos.wlm.ServerClassification;
import com.ibm.jzos.wlm.WorkUnit;
import com.ibm.jzos.wlm.WorkloadManager;
public class main
{
// run it with /usr/lpp/java/J8.0_64/bin/java main
public static void main(String[] args) throws Exception
{
WorkloadManager wlmToken = new WorkloadManager("JES", "SM3");
ServerClassification serverC = wlmToken.createServerClassification();
serverC.setTransactionName("TCI3");
for ( int j = 0;j<1000;j++)
{
WorkUnit wU = new WorkUnit(serverC, "MAINCP");
wU.join();
float f;
for (int i = 0;i<1000000;i++)
{
f=ii2;
TimeUnit.MICROSECONDS.sleep(20*1000); // 200 milliseconds
}
wU.leave();
wU.delete(); // end the workload
}
wlmToken.disconnect();
}
}

The WLM statements are explained below.

WorkloadManager wlmToken = new WorkloadManager(“JES”, “SM3”);

This connects to the Work Load Manager and returns a connection token.    This needs to be done once per JVM.  You can use any relevant subsystem type, I used JES, and a SubsystemInstance (SI) of SM3. As a test, I created a new  subsystem category in WLM called DOG, and used that.  I defined ServerInstance SI with a value of SM3 within DOG and it worked.

z/OS uses uses subsystems such as JES for jobs submitted into JES2, and STC for Started task.

ServerClassification serverC = m.createServerClassification();

If your application is going to classify the transaction to determine the WLM service class and reporting  class you need this.  You create it, then add the classification criteria to it, see the following section.

Internally this passes the connection token wlmToken to the createServerClassification function.

serverC.setTransactionName(“TCI3”);

This passes information to WLM to determine the best service class and reporting class.  Within Subsystem CAT, Subsystem Instance SM1, I had a sub rule TransactionName (TN) with a value TCI3.  I defined the service class and a reporting class.

WorkUnit wU = new WorkUnit(serverC, “MAINCP”);

This creates the Independent (business transaction) enclave.  I have not see the value MAINCP reported in any reports.   This invokes the C run time function CreateWorkUnit(). The CreateWorkUnit function requires a STCK value of when the work unit started.  The Java code does this for you and passes the STCK through.

wU.join();

This connect the current task to the enclave, and any CPU it uses will be recorded against the enclave. 

wU.leave();

Disconnect the current task from the enclave.  After this call any CPU used by the thread will be recorded against the address space.

wU.delete();

The Independent enclave(Business transaction) has finished. WLM records the elapsed time and resources used for the business transaction.

m.disconnect();

The program disconnects from WLM.

Reporting class output.

I used RMF to print the SMF 72 records for this program.   The Reporting class for this program had

-TRANSACTIONS--  TRANS-TIME HHH.MM.SS.FFFFFF 
AVG        0.29  ACTUAL                36320 
MPL        0.29  EXECUTION             35291 
ENDED       998  QUEUED                 1028 
END/S      8.31  R/S AFFIN                 0 
#SWAPS        0  INELIGIBLE                0 
EXCTD         0  CONVERSION                0 
                 STD DEV               18368 
                                             
----SERVICE----   SERVICE TIME  ---APPL %--- 
IOC           0   CPU   12.543  CP      0.01 
CPU       10747   SRB    0.000  IIPCP   0.01 
MSO           0   RCT    0.000  IIP    10.44 
SRB           0   IIT    0.000  AAPCP   0.00 
TOT       10747   HST    0.000  AAP      N/A 

From this we can see that for the interval

  1. 998 transactions ended.  (Another report interval had 2 transactions ending)
  2. the response time was an average of 36.3 milliseconds
  3. a total of 12.543 seconds of CPU was used.
  4. it spent 10.44 % of the time on a ZIIP.
  5. 0.01 % of the time it was executing ZIIP eligible work on a CP as there was no available ZIIP.

Additional functions.

The functions below

  • ContinueWorkUnit – for dependent enclave
  • JoinWorkUnit – as before
  • LeaveWorkUnit – as before
  • DeleteWorkUnit – as before

can be used to record CPU against the dependent (Address space) enclave.  There is no WLM classify for a dependent enclave.

Java threads and WLM

A common application pattern is to use connection pooling.  For example the connect/disconnect to a database or MQ is expensive.  If you have a pool of threads, which connect, and start connected, an application can request a thread and get a thread which has already been connected to the resource manager.

It should be a simple matter of changing the interface from

connectionPool.getConnection()

to

connectionPool.getConnection(WorkUnit wU)
{ connection = connectionPool.getConnection()
 connection.join(wU)
}

and add a connection.leave(wU) to the releaseConnection.

Using enclaves in a C program -capturing elapsed and CPU time used by a transaction.

On z/OS enclaves allow you to set the priority of business transactions within your program, and to record the CPU used by the threads involved in the transaction – even if they are in a different address space.

Many of the WLM functions are assembler macros which require supervisor state. There are C run time functions which do some of the WLM functions. There are also Java methods which invoke the C run time functions.

Not all of the WLM functions are available in the C run time environment. You can set up enclaves, but the application cannot query information about the enclave, such as total CPU used, or the WLM parameters.

Minimal C program

The minimal program is below, and the key WLM functions are explained afterwards.

#include <sys/__wlm.h>
int main(void) { 
  wlmetok_t  enclavetoken; 
  server_classify_t   classify; 
  long rc; 
  // 
  //  Connect to work manager 
  // 
  unsigned int  connectToken = ConnectWorkMgr("JES","SM3");
  classify = __server_classify_create( ); 
  // pass the connection token to the classify
  //  This is needed but not documented 
  rc = __server_classify(classify, 
                 _SERVER_CLASSIFY_CONNTKN, 
                (char *) connectToken 
       ); 
  rc = __server_classify(classify, 
                         _SERVER_CLASSIFY_TRANSACTION_NAME, 
                         "TCI2" 
  for ( int loop=0;loop < 1000 ;loop++) 
  { 
    rc= CreateWorkUnit(&enclavetoken, 
                       classify, 
                       NULL, 
                       "COLINS"   ); 
    rc = JoinWorkUnit(& enclavetoken); 
 
    // do some work to burn some CPU 
    for ( int i = 0;i< 100000 ;i++) 
    { 
      double xx = i/0.5; 
      double yy = xx * xx; 
    } 
    rc = LeaveWorkUnit(&enclavetoken); 
    rc = DeleteWorkUnit(&enclavetoken);   
    rc = DisconnectServer(&connectToken);
}
 

What are they key functions?

unsigned int connectToken = ConnectWorkMgr(“JES”,”SM3″); 

This creates a connection to WLM, and uses the subsystem JES, and subsystem name of SM3.  Note: On my system it is JES, not JES2.   The WLM dialogs, option 6. Classification Rules list the subsystems available.  You can browse a subsystem type and see the available definitions.  I had

         -------Qualifier--------                 -------Class--------  
Action   Type Name     Start        Service     Report   
 ____  1 SI   SM3      ___          TCI1SC      THRU
 ____  2   TN  TCI3    ___          TCI1SC      TCI3
 ____  2   TN  TCI2    ___          TCI1SC      TCI2

server_classify_t  classify = __server_classify_create( );

CreateWorkUnit, the function used  to create an independent enclave (business transaction), needs to be able to classify the transaction to determine what service class (priority) to give the enclave.  This request sets up the classify control block.

rc = __server_classify(classify, _SERVER_CLASSIFY_CONNTKN, (char *)&connectToken );

The documentation does not tell you to pass the connection token.  If you omit this step the CreateWorkUnit fails with error code errno2=0x0330083B.

The __server_classify expects a char * as the value, so you have to use (char *) & connectionToken.

rc = __server_classify(classify, _SERVER_CLASSIFY_TRANSACTION_NAME, “TCI2” );

This tells WLM about the transaction we want to use.  TRANSACTION_NAME matches up with TN above in the WLM definitions.  This says the business transaction is called TCI2.  There are other criteria such as userid, plan or LU. See here for the list.   The list is incomplete, as it does not support classifiers like Client IP address which is available with the assembler macros.

rc= CreateWorkUnit(&enclavetoken, classify,  NULL, “COLINS” );

This uses the classification parameter defined above, to create the independent enclave, and return the enclave token. 

The documentation for CreateWorkUnit says you need to pass the arrival time,  Address of a doubleword (unsigned long long) field that contains the arrival time of the work request  in STCK format.  I created a small assembler function which just returned a STCK value to my C program.  However I passed NULL and it seemed to produce the correct values  – I think CreateWorkUnit does a STCK for you.

You pass in the name of the function(“COLIN”).  The only place I had seen this is if you use the QueryWorkUnitClassification() to extract the classify information.  For example QueryWorkUnitClassification gave a control block with non empty fields   _ecdtrxn[8]=TCI2 , _ecdsubt[4]=JES , _ecdfcn[8]=COLIN  , _ecdsubn[8]=SM3 . This function does not return the report class or service class.

rc = JoinWorkUnit(& enclavetoken);

This cause any work this TCB does to be recorded against the enclave.

rc =LeaveWorkUnit(&enclavetoken);

This stop work being be recorded against the enclave.  Subsequent work gets charged to the home address space.

rc= DeleteWorkUnit(& enclavetoken);

The business transaction has finished.  Information about the response time and CPU used are stored.

rc =  DisconnectServer(&connectToken);

This disconnects from WLM.

Using a subtask

I had a different thread n the program which did some work for the transaction. Using the enclave token, this work can be recorded against the transaction using

// The enclave token is passed with the request
rc = JoinWorkUnit(&enclaveToken); 
do some work...
rc = LeaveWorkUnit(&enclaveToken);  

This would be useful if you are using a connect pool to connect to MQ or DB2 subsystem. You have a pool of threads which have done the expensive connect with a particular userid, and the thread is used to execute the MQ or DB2 subsystem as that userid.

Other function available

Other functions available

Dependent (address space) enclave

The above discussion was for an business transaction, know in the publications as an Independent enclave.   An address space can have a Dependent enclave where the CPU is recorded as “Dependent Enclave” within the address space.  You use the function ContinueWorkUnit(&enclave) to return the enclave token.    You then use JoinWorkUnit and LeaveWorkUnit as before.  I can not see why you might want to use this.

Display the classification

You can use the QueryWorkUnitClassification to return a structure for the classification.

Reset the classification.

If you want to classify a different transaction, you can use server_classify_init() to reset the structure.

Set up a server

You can set up a server where your application puts work onto WLM queues, and other threads can get work.   This is an advanced topic which I have not looked into.

Make your enclave visible across the sysplex

You can use ExportWorkUnit  and ImportWorkUnit to have your enclave be visible in the sysplex.

Query what systems in the sysplex are running in goal mode.

You can use QueryMetrics() to obtain the systems in the sysplex that are in goal mode. This includes  available CPU capacity and resource constraint status. 

What is not available to the C interface

One reason why I was investigating enclaves was to understand the enclave data in the SMF 30 records.  There is an assembler macro IWMEQTME which returns the CPU, ZIIP and ZAPP times, used by the independent enclaves.  Unfortunately this requires supervisor state.    I wrote some assembler code to extract this and display the data.  Another complication is that the IWLM macros are AMODE 31 – so it did not work with my 64 bit C program.

Enclaves in practice. How to capture all the CPU your application uses, and knowing where to look for it.

Z/OS has enclaves to manage work.  

  1. When an enclave is used, a transaction can issue a DB2 request, then DB2 uses some TCBs in the DB2 address spaces on behalf of the original request.  The CPU used used by these DB2 TCBs can be charged back to the original  application. 
  2. When an enclave is not used, the CPU used by the original TCB is charged to its address space, and the DB2 TCBs are charged to the DB2 address space.  You do not get a complete picture of the CPU used by the application.
  3. A transaction can be defined to Work Load Managed(WLM) to set the priority of the transaction, so online transactions have high priority, and background work gets low priority.    With an enclave, the DB2 TCBs have the same priority as the original request.  With no enclave the TCBs have the priority as determined by DB2.

When an application sets up an enclave

  1. Threads can join the enclave, so any CPU the thread uses while in the enclave, is recorded against the enclave.
  2. These threads can be in the same address space, a different address space on the same LPAR, or even in a different LPAR in the SYSPLEX.
  3. Enclaves are closely integrated with Work Load Manager(WLM).   When you create an enclave you can give information about the business transaction, (such as transaction name and  userid).   You classify the application against different factors. 
  4. The classification maps to a service class.   This service class determines the appropriate priority profile.  Any threads using the enclave will get this priority.
  5. WLM reports on the elapsed time of the business transaction, and the CPU used.

What enclave types are there?

In this simple explanation there are two enclave types

  1. Independent enclave – what I think of as Business Transaction, where work can span multiple address spaces.  You pass transaction information (transaction, userid, etc) to WLM so it can set the priority for the enclave. You can get reports on the enclave showing elapsed time, and CPU used.  There can be many independent enclaves in the lifetime of a job.  You can have these enclaves running in parallel within a job.
  2. Dependent enclave or Address space enclave.   I cannot see the reason for this.  This is for tasks running within an address space which are not doing work for an independent enclave.  It could be used for work related to transactions in general.   In the SMF 30 job information records you get information on CPU used in the dependent enclave.  
  3. Work not in an enclave.  Threads by default run with the priority assigned to the address space.  CPU is charged to the address space.

To help me understand enclave reports, I set up two jobs

  1. The parent job,
    1. Creates an independent (transactional) enclave with “subsystem=JES, definition=SM3” and “TRANSACTION NAME=TCI2”.  It displays the enclave token.
    2. Sets up a dependent enclave.
    3. Joins the dependent enclave.
    4. Does some CPU intensive work.
    5. Sleeps for 30 seconds.
    6. Leaves the dependent enclave.
    7. Deletes the dependent enclave.
    8. Deletes the independent enclave.
    9. Ends.
  2. The child, or subtask, job.
    1. This reads the enclave token as a parameter.
    2. Joins the enclave,if the enclave does not exist, use the dependent enclave.
    3. Does some CPU intensive work.
    4. Leaves the enclave.
    5. Ends.

Where is information reported?

  1. Information about a job and the resources used by the job is in SMF 30 records. It reports total CPU used,  CPU used by independent enclaves, CPU used by the dependant enclave.  In JCL output where it reports the CPU etc used by the job step, this comes from the SMF 30 record.
  2. Information about the independent enclave is summarised in an SMF 72 record over a period(typically 15 minutes) and tells you information about the response time distribution, and the CPU used.

I used three scenarios

  1. Run the parent job – but not not the child.  This shows the costs of the just parent – when there is no child running the workload for it.
  2. Run the child but not the parent.   This shows the cost of the expensive workload.
  3. Both parent and child active.   This shows the costs of running the independent enclave  in the child are charged to the parent. 

SMF job resource used report

From the SMF 30 record we get the CPU for the Independent enclave.

Parent CPUChild CPU
Parent, no child
Total              : 0.070
Dependent enclave : 0.020
Not applicable
Child,no parentNot applicable
Total CPU         : 2.930
Dependent Enclave : 2.900
Non Enclave : 0.030
Parent and child
Total               : 2.860 
Independent enclave : 2.820
Dependent enclave : 0.010
Non enclave : 0.030
Total            : 0.020
No enclave CPU reported

From the parent and child we can see that the CPU used by the enclave work in the child job has been recorded against the parent’s job under “Independent enclave CPU”.

The SMF type 30 record shows the Parent job had CPU under Independent enclave, Dependent enclave, and a small amount (0.03) which was not enclave.

SMF WLM reports

From the SMF 72 data displayed by RMF (see below for an example) you get the number of transactions and CPU usage for the report class, and service class. I had a report class for each of the parent, and the child job, and for the Independent enclave transaction.

Total CPUElapsed timeEnded
Parent2.81834.971
Child2.6366.761
Business transaction2.81930.041

It is clear there is some double accounting. The CPU used for the child doing enclave processing, is also recorded in the Parent’s cost. The CPU used for the Business transaction is another view of the data from the parent and child address spaces.

For charging based on address spaces you should use the SMF 30 records.

You can use the SMF 72 records for reporting on the transaction costs.

RMF workload reports

When processing the SMF data using RMF you get out workload reports

//POST EXEC PGM=ERBRMFPP 
//MFPINPUT DD DISP=SHR,DSN=smfinput.dataset  
//SYSIN DD * 
SYSRPTS(WLMGL(SCPER,RCLASS,RCPER,SCLASS)) 
/* 

For the child address space report class RMF reported

-TRANSACTIONS--  TRANS-TIME HHH.MM.SS.FFFFFF
 AVG        0.05  ACTUAL             6.760380
 MPL        0.05  EXECUTION          6.254239
 ENDED         1  QUEUED               506141
 ...
 ----SERVICE----   SERVICE TIME  ---APPL %---
 IOC        2152   CPU    2.348  CP      2.20
 CPU        2012   SRB    0.085  IIPCP   0.00
 MSO         502   RCT    0.004  IIP     0.00
 SRB          73   IIT    0.197  AAPCP   0.00
 TOT        4739   HST    0.002  AAP      N/A

There was 1 occurrence of the child job, it ran for 6.76 seconds on average, and used a total of 2.636 seconds of CPU (if you add up the service time).

For a more typical job using many short duration independent enclaves the report looked like

-TRANSACTIONS--  TRANS-TIME HHH.MM.SS.FFFFFF 
 AVG        0.11  ACTUAL                13395 
 MPL        0.11  EXECUTION             13395 
 ENDED      1000  QUEUED                    0 
 END/S      8.33  R/S AFFIN                 0 
 SWAPS         0  INELIGIBLE                0
 EXCTD         0  CONVERSION                0 
                  STD DEV                1325 
 ----SERVICE----   SERVICE TIME  
 IOC           0   CPU    1.448   
 CPU        1241   SRB    0.000   
 MSO           0   RCT    0.000  
 SRB           0   IIT    0.000  
 TOT        1241   HST    0.000  

This shows 1000 transaction ended in the period and the average transaction response time was 13.395 milliseconds. The total CPU time used was 1.448 seconds, or an average of 1.448 milliseconds of CPU per transaction.

For the service class with a response time definition, you get a response time profile. The data below shows the most most response times were between 15 and 20 ms. The service class was defined with “Average response time of 00:00:00.010”. This drives the range of response times reported. If this data was for a production system you may want to adjust the “Average response time” to 00:00:00.015 to get the peak in the middle of the range.

-----TIME--- -% TRANSACTIONS-    0  10   20   30   40   50 
    HH.MM.SS.FFF CUM TOTAL BUCKET|...|....|....|....|....|
 <= 00.00.00.005       0.0  0.0   >                           
 <= 00.00.00.006       0.0  0.0  >                           
 <= 00.00.00.007       0.0  0.0  >                           
 <= 00.00.00.008       0.0  0.0  >                           
 <= 00.00.00.009       0.0  0.0  >                           
 <= 00.00.00.010       0.0  0.0  >                           
 <= 00.00.00.011       0.3  0.3  >                           
 <= 00.00.00.012      10.6 10.3  >>>>>>                      
 <= 00.00.00.013      24.9 14.3  >>>>>>>>                    
 <= 00.00.00.014      52.3 27.4  >>>>>>>>>>>>>>              
 <= 00.00.00.015      96.7 44.4  >>>>>>>>>>>>>>>>>>>>>>>     
 <= 00.00.00.020     100.0  3.2  >>                          
 <= 00.00.00.040     100.0  0.1  >                           
    00.00.00.040     100.0  0.0  >                                                                                

Take care.

The field ENDED is the number of transactions that ended in the interval. If you have a measurement that spans an interval, you will get CPU usage in both intervals, but “ENDED” only when the transaction ends. With a large number of transactions this effect is small. With few transactions it could cause divide by zero exceptions!