Question: In your car how do you tell if your car has a problem? Answer: You look at the dashboard and see if there is a red light showing. You may not know how to fix it – but you know that you need to get help to fix it.
The aim of this series of blog posts is to show you what to look for in z/OS performance and if you have a problem.
I will cover
- CPU at the LPAR level.
- Synchronous I/O
- Workload Manager; or is my work achieving its goals?
For some of these you need data from z/OS. This post describes how to get the SMF data, and format it using RMF.
DASD has changed in 40 years
40 years ago “disk storage” was on huge rotating disks and you had to carefully manage where you put your datasets -both which disk, and whereabouts on the disk. For example people would put the hot dataset in the “centre” of the disk to minimise the time to move the heads.
For the last 20 years people use the term “storage” because most I/O activity goes to cache in the disk controller, and the disk controller writes the data out to PC sized disks – which in turn may be solid state, and have no moving parts.
A pictorial view of disks
- You have the processor running z/OS
- Plugged into the side of the processor is the I/O adapter
- Plugged into this I/O adapter are a lot of channels (think optical fibre cables)
- Theses cables can be plugged into a switch – think of a plug board or telephone exchange. This allows channels from 2 processors plugged into the switch, and have one cable down to the storage controller . You could avoid a switch and have cables directly from the processor to the storage controller. Each processor would need its own set of cables.
- The storage controller manages all of the IO
- It has a lot of cache so most I/O may go to the cache. During a read, the storage controller will read from the disks if the data is not in the cache.
- It has many PC type of disks. These disks could be solid state, or have rotating disks
- If you have mirrored disks, the storage controller talks to a remote storage controller
- Within each channel are many logical sub channels. Each disk has at least one sub-channel allocated to it. A disk can have multiple sub-channels allocated to it. There can be a pool of sub-channels which are used as needed to allowed parallel I/O to a disk.
The I/O journey
- Your application wants to read the first record of a file.
- Once the file has been opened, the application can issue the read.
- z/OS knows where the data set is on disk (eg VOLID A4USR1, Cylinder 200, track 4)
- z/OS builds up a set of commands (such as locate disk, locate cylinder 200, locate track 4, read data, read data, read data) to get the data and issues the Start Sub channel request, passing the list of I/O commands.
- This is queued to the I/O adapter.
- The original application is suspended (until the I/O is complete)
- The I/O adapter looks for a free sub-channel for the disk, or gets one from the sub-channel pool.
- The I/O adapter takes the list of commands, and executes them one at a time.
- When the I/O adapter has finished the list of commands, it sends an interrupt to the mainframe saying “this subchannel has finished”.
- z/OS wakes up, looks at the interrupt, and resumes the application.
Today you have to consider 3 areas where you can get delays, you need to be an expert if you want to look at more detail.
- Waiting in the I/O adapter before being able to get a sub channel. This is known as IOSQ – IO subsystem Queueing.
- Establishing the connection from processor to the storage controller
- Transferring the data the connect time.
This is complicated by being able to use disks 50 km away, which adds to the delay time.
In the RMF MFR000… report with section D I R E C T A C C E S S D E V I C E A C T I V I T Y. (I search for IOSQ).
DEVICE AVG AVG AVG AVG AVG AVG AVG AVG % % VOLUME PAV LCU ACTIVITY RESP IOSQ CMR DB INT PEND DISC CONN DEV DEV SERIAL 1 RATE TIME TIME DLY DLY DLY TIME TIME TIME CONN UTIL A4RES1 1 102.896 .044 .003 .001 .000 .004 .000 .036 0.38 0.38 A4RES2 1 27.293 .036 .000 .001 .000 .003 .000 .032 0.09 0.09 USER00 1 25.331 .031 .003 .001 .000 .004 .000 .024 0.06 0.06 A4SYS1 1 365.102 .026 .005 .001 .000 .004 .000 .017 0.62 24.52
- Volume Serial such as A4RES1 is the volid of the disk
- PAV – I’ll mention this below.
- Device Activity Rate – how many requests (start sub channel) from z/OS, per second
- Average response time in milliseconds
- Average IOSQ – how long did it have to wait in z/OS and the I/O adapter before the request was sent down to the storage controller
The times are in milliseconds.
There are often thousands of volumes in a z/OS environment some are heavily used, some are not used. See below on how to find the hot volumes.
I typically look at the volumes with the highest I/O. If the hot volumes have good response time, the not so hot should be OK.
If you think of the sub-channel connection between the mainframe and the volid in the storage controller, there can only be one I/O requests at a time per sub-subchannel. You can have multiple connections down to a volume. These are known as PAV, or Parallel Access Volumes. The PAV is the average number of sub-channels in use.
The first field you look at is the IOSQ. This is the time between z/OS starting the request, and before the I/O could be started to the storage controller. This should be small 10s of microseconds ( 0.0xx in the report above). If this value is larger than this, you need to speak to your Storage Manager or z/OS Systems Programmer.
The second field you look at is the % DEV UTIL. How busy was the connection to the storage controller. A value of 100% means that it was running flat out. If the utilisation is around 70-80% it may be a OK – just something to note. More PAVs can increase throughput for a busy disk.
The next figure you look at is the RESP TIME. This is the response time the application sees. For local disk, response times of under 1 millisecond are OK. If you have remote disks, and synchronous I/O then the response time will be longer.
Finding the hot volumes
I take the RMF report and extract the DASD records.
- For SDSF where the output is in the spool
- I use Status to list all of the jobs, (Output or Hold work just as well)
- Put ? in front of the job to show all of the spool data sets
- use the SE command to Spool Edit the report
- For a dataset I use the View prefix command in ISPF 3.4
- Put DD in line prefix area on line 1
- Find ‘D I R E C T’
- Put DD in line prefix area, press enter, to delete the lines above it
- Find ‘D I R E C T’ last
- put d9999 in the line prefix area following the data (My report has ‘P A G I N G’), and press enter.
- You should now have only DASD records
- Put ‘cols’ in the line command area, note the columns of the DAR (50 to 58)
- In the command line type SORT 50 58 D on Device Activity Rate.
- This shows you the top usage volumes. Check the response times. Under 1 millisecond is good for locally attached disks. It can be down to 0.1 ms
- If the response time is 1 ms or larger…
- Check columns 60-65 (AVG IOSQ TIME) this should be 0. If this is non zero it means there was queueing in z/OS before it got to the disks. If there was only one I/O request to the volume, then there would IOSQ would be zero. If there are multiple I/O requests then you can get IOSQ queuing time.
- Any IOSQ could be reduced by moving data sets to other volumes, or adding more paths(sub-channels) between the mainframe and the disks. Each disk requires at least one subchannel. You can allocate more in a pool – which are used when needed, but this is a z/OS system programmer/Storage manager job.
- As a performance person you can control which disks you use, and can spread the load.
- Avg CMR (ComMand Response) is the time to get from the processor down to the Storage Controller, and the controller to respond with “I’ve got the request” This should be small. This value allows you to see if delays are due to getting to the Storage controller, or within the controller.
If you do this for all disks you get an overall view of the data. Now you can select the DASD volumes you are using and check those.
If you find you have a long response time, then it is hard to find out the root cause. There are many links in the end to end chain. See here for more information.