z/OS Connect can provide two types of SMF record
- SMF 120 subtype 11, provided by the base Liberty support. This gives information on the URL used to access Liberty, and the CPU used to perform requests. This is enabled at the Server level – so you can have records for all request, or no requests. There is one SMF record for each web server request. Would I use this to report CPU used ? No – see the bottom of this blog.
- SMF 123 provides information about API and the service used, and the “pass through” services. It provides elapsed time of the request, and of the the “pass through” requests. It does not provide CPU usage figures. This can be configured to produce records depending on the http host and port used to access z/OS Connect. One SMF record can have data for multiple web server requests. The SMF records are produced when the SMF record is full – or the server is shut down.
The SMF 120-11 and SMF 123 records are produced independently, and there is no correlating field between them. They both have a URI field, and time stamps, so at low volumes it may be possible to correlate the SMF data.
I’ll document the fields which I think are interesting. If you think other fields are useful please let me know and I’ll update this document.
I have written an SMF formatter in C which prints out interesting data, and summarises it.
- You get the standard date and time the record was produced, and with the LPAR. You can use PGM=IFASMFDP with the following to filter which records are copied
DATE(2020282,2020282) START(1000) START(2359)
- There is server version (3), system(SOW1), and job id(STC04774) which are not very interesting
- Server job name(SM3) is more interesting. I started the server with s baqstrt,parms=’MQTEST’,jobname=sm3
- The config dir (/var/zosconnect/servers/MQTEST/) is boring – as is code level (220.127.116.11)
- The start and stop times (2020/10/08 09:18:19.852360 and 2020/10/08 09:18:22.073946) are interesting as is the duration – which is the difference between them.
- You get userid information.
- I used a certificate to authenticate. The DN from the certificate is not available. You only get the userid from the RACF mapping of DN to userid. This mapped userid was in the 64 byte field. The 8 byte userid field was empty for me. The lack of certificate DN, and having the userid in the wrong field feels like a couple of buglets.
- You get the URL used /stockmanager/stock/items/999999 I treat this as a main key for processing and summarising data. If you want to summarise the data, you may want so summarise it just on /stockmanager/stock/. The full URI contains the part number – and so I would expect a large number of parts.
- You can configure your requests to WLM. For example
<wlmClassification> <httpClassification transactionClass="TCI1" method="GET" resource="/zosConnect/services/stockQuery"/> </wlmClassification>
This produced in the SMF record
WLMTRan :TCI1 WLM Classify type :URI :/zosConnect/services/stockQuery WLM Classify type :Target Host :10.1.3.10 WLM Classify type :Target Port :19443
This means that the URL, the host, and the port were passed to WLM to classify.
If you get the WLM classification you also get CPU figures when the enclave request ended (was deleted).
- You get the ports associated with the request.
- Which port was used on the server – Target Port :9443
- Where did the request come from? Origin :10.1.1.1 and port :36786
- The number of bytes in the response Response bytes :791
- CPU figures for the CPU used on the TCB. See discussion below on the usefulness of this number. You get the CPU figures before the request, and after the request – so you have to calculate the difference yourself! It does not look like the developers knew what the end user wanted. The values come from the timeused facility. You can calculate the delta and get
- CPU Used Total : 0.967417
- CPU Used on CP : 0.026327
- and calculate these to to get CPU Delta. on Z**P : 0.941090 This is the CPU offloaded to ZIIP or ZAAP.
- If you had the URI classified with WLM, you get Enclave data, see below for a discussion on what the numbers mean.
- Enclave CPU time : 0.148803
- Enclave CPU service : 0.000000
- Enclave ZIIP time : 0.148803
- Enclave ZIIP Service : 0.000000
What do the CPU numbers mean?
Typically a transaction flow is as follows
- A listening thread listens on the HTTP(s) host and port.
- When a request arrives, it passes the request to a worker thread, and goes back to listening
- The worker thread may do some work and send the response back
- The worker thread may need to call another thread to do some work. For example to issue an MQ request,
- the MQ code looks for a thread in a pool for a matching queue manager and userid. If it find one it uses it the thread and issues the MQ request.
- If it does not find a matching thing thread it may allocate a new thread, and issue an MQCONN to connect to MQ. These are both expensive operations, which is why having a pool of threads with queue manager and userid is a good way of saving CPU
- The work is done
- The thread is put back into the MQ pool, and the application returns to the worker thread
- The worker thread sends the response back to the originator
A thread can ask the operating system, how much CPU time it(the thread) has used. What usually happens is
- the thread requests how much CPU it has used
- the thread does some work
- the thread requests how much CPU it has used,
- the thread calculates the difference between the two CPU values and reports this delta.
I the SMF 120 record records the CPU from just the worker thread – and no other thread.
When there are more than one thread involved it gets more complex, as you could have a CICS transaction issuing an MQ request, then a DB2 request, and then an IMS request. You can set up z/OS WorkLoad Manager(WLM) to say “these CICS transactions in this CICS region are high priority”.
With some subsystems you can pass a WLM token into a request. The thread being invoked call tell WLM that the thread is now working on behalf of this token. The thread does some work, and tells WLM that it has finished doing the work. WLM can manage the priority of the threads to achieve the best throughput, for example making the thread high or low priority. WLM can manage a thread doing work in multiple LPARs across a sysplex!
WLM records the CPU used by the thread while performing the work, accumulates and reports this.
This use of multiple threads for a business transaction across one or more address spaces is known as an enclave.
What happens with enclaves?
- A request arrives at the listener thread.
- The Liberty looks up the URI in the <wlmClassification httpClassification…. It compare the server’s host, server’s port, the URI resource /stockmanager… method ( GET) and finds the best match for the transactionClass.
- If there is a transactionClass,
- the server calls WLM with the Subsystem type of CB, the specified collectionName, and the transactionClass.
- WLM looks for these parameters and if WLM has a matching definition then WLM will manage the priority of the work,
- WLM returns a WLM token.
- This WLM token is passed to threads which are set up for enclaves.
- If there is no transaction class specified in Liberty, or WLM does not have the subsystem, collectionname, transactionClass then there is no token or a null WLM token
- The work continue as before.
- If another thread is used then pass the WLM token. If the code is set up for WLM token then report “work started”, when it has finished report “work ended”
- If there is a transactionClass,
What happens if the request is not known to WLM.
The worker thread calculates the CPU used for just its work, and reports this. The CPU used by any other thread is not report. The figures reported are the CPUTotal timeused values. You have to calculate the difference yourself
What happens if the request is known to WLM.
You get the timeused CPU for the worker thread – as with the case where the request is not known to WLM.
From RMF (or other products) you get out reports for an interval with
- The number of requests in the interval
- The rate of requests in the interval
- The amount of time on a CP engine in seconds
- The amount of time on a ZIIP engine is seconds
- The amount of time on a ZAAP in seconds.
- Over the interval, what percentage of time was CP on CP engines, zAAP on zAAP engines, zAAP on CP engines, zIIP on zIIP engines.
From the SMF 120 records you get
Enclave CPU time
Enclave ZAAP time
Enclave ZIIP time
Example Enclave figures.
For 100 API requests, the figures as reported by SMF 120-11, and I averaged the values.
- Average CPU(1) 0.023
- Average CPU(2) 0.0008
- Enclave CPU 0.029
- Enclave ZAAP 0
- Enclave ZIIP 0.028
The figures reported by RMF per request
- CPU 0.031
- ZIIP 0.039
- ZAAP 0.000
- Total 0.070 seconds of CPU per transaction
These figures tie up – the Enclave CPU, ZIIP, and ZAAP are similar.
The CPU used by the server address space was
- CPU 30.1 seconds
- ZIIP 28.7 seconds
- ZAPP 0 seconds.
- Total 58.8.
Each request took 0.070, and there were 100 requests – so reported 7 second of CPU.
The difference(51) seconds is not reported in the transaction costs. It looks like the “timeused” value is less than 1% of the CPU value, and the enclave figures are under 2% of the grand total.
Looking at the trace in a dump, I can see many hot TCBs using much more CPU that is reported by WLM and RMF. I expect that many TCBs used in a request, but they do not have the enclave support in them. Overall – pretty useless for charge back and understanding the cost per transaction.