During my time as a MQ performance person, one of the hardest areas to understand was data set performance, and how to make access to data sets faster. One reason was all the terms that people used. If you were an expert, the terms were “obvious”. It is very much like going to a hospital and the doctor says you have a contusion. Is this good new or is it bad news? When the doctor explains that a contusion is a fancy name for a bruise, it is clear that this may not be good news (unless the news is – you haven’t broken your leg, you just have a contusion).
- Made unintelligible by excessive use of technical terms.
- Language that seems difficult because you do not understand it.
- Language that seems to mean nothing.
Yep – DASD performance ticks all of those boxes!
Disk and data set performance statistics
I’ve recently been looking at data set statistics trying to understand the SMF 42 statistics, and as I haven’t used these statistics for five years, I struggled to understand them.
Below, I’ve tried to explain all of the complex terms in simple language. It may not be 100% accurate, but I hope you get the picture.
What is the basic hardware?
- You have the CPU where your program runs.
- There may be an I/O processor for offloading the I/O requests
- There is a cable – typically known as FICON – or high speed fibre connected from CPU or I/O processor to the disk subsystem.
- This FICON is connected to a Storage Controller. The Storage Controller acts as an interface to the disks. It can cache data from disks. It talks to other Storage Controllers if mirrored DASD is being used.
- Disks. These used to be be big spinning disks 1 meter in diameter. These days you have many solid state disks which are used in laptop computers. High capacity – small footprint.
- You can have a FICON director. This acts as a big switch between I/O the mainframe and the Storage Controllers. You plug the FICON cables from the mainframe into one side, and the output goes from the FICON director to the Storage Controller.
What is the path of an I/O?
There are many stages to get data to and from a disk.
- When the I/O is started, a Start Sub CHannel (SSCH) command is issued.
- Within the z/OS box is an I/O processor for offloading the I/O requests. The SSCH wakes up the I/O processor.
- The I/O processor sends a request over FICON, possibly via a FICON Director to the Storage Controller.
- The Storage Controller may be able to process the request without going to a disk.
- The Storage Controller may pass the request to a disk.
- The disk processes the request, and when it has finished notifies the Storage Controller.
- The Storage Controller sends the data back to the requestor.
- When the storage controller has finished, it sends up a “storage control ended” back up the FICON cable.
- The I/O processor catches the request, issues a Test Sub CHannel to get the performance information from the I/O request.
- The I/O processor notifies the CPU, which then wakes up your application.
What are the major categories of I/O delay?
At a high level the time spent in an I/O operation fall in the following categories
- The time before the request leaves the CPU and into the I/O subsystem.
- Getting from the I/O subsystem down to the storage controller
- The time the storage controller was active
- Time spent accessing the disks.
- The time between the I/O completing and the CPU processing the status.
Technical terms explaining delays, and possible reasons
Some of these terms are defined here.
- IOSQ – The time before the request leaves the CPU and into the I/O subsystem.
- In z/OS, all paths to the device are busy. You can defined multiple paths to the disks using PAV.
- All the I/O Processors are busy
- The I/O could be delayed because a higher priority I/O took precedence.
- Pending – getting from the I/O subsystem down to the storage controller, (the time required to get the storage hardware to initiate an I/O operation).
- The FICON channel is overloaded and cannot be used, or the channel is busy.
- This mainframe already has a reserve on the volume.
- The FICON director (FICON router) is busy.
- The Command Response measures the delay to get from the I/O subsystem to the Storage Controller and back – think of it as a TCP/IP ping.
- Device busy delay might mean:
- Another system is using the volume.
- Another system reserved the device, but is not actually using it.
- Connect time – The time the Storage Controller was active. Processing the data – read or write. There is connect time talking to the FICON, and also connect time talking to the disks.
- Control unit queue time, this is the time queuing within the Storage Control unit – think of it as an enqueue on a track or cylinder.
- Transferring data. Note: data may be multiplexed down a connection. More connections can slow down a requests transfer rate. Think of a road – when there is too much traffic, the traffic slows down.
- FICON internal chat.
- The amount of data to process. The more data, the longer the transfer takes.
- Disconnect – Time spent accessing the disks. This could be accessing local disks, or accessing remote (mirrored) disks. The Storage controller is not doing any work while the disk is busy.
- The volume is reserved by another system.
- Waiting for the arm to move, or the disk to rotate (for spinning disks).
- The disk is processing the request – for example a cache miss means the disk has to be read.
- Waiting for a signal from a remote peer to say that write data has been stored.
- Some SMF records have a field “Read Disconnect time”. This indicates the read wanted a record which was not in the controller cache.
- Device Active Only time (DAO). The channel has finished its work, but the disk was busy for a little longer (for example waiting for a remote disk to complete). This is the additional time after the channel has finished.
- Service time: The duration between the SSCH and the interrupt at the end of the I/O.
- Interrupt Delay Time. This is the delay between the I/O subsystem getting the interrupt, and z/OS issuing the TSCH to get the status. If your z/OS image is running as an LPAR, this includes time to dispatch your LPAR, and then for your LPAR to issue the TSCH instruction.
Do I need to know this?
Most of the time you do not need to know about disk performance, as most disk are solid state, and data is in cache – but if you have performance problem – it is worth checking the datasets are not the cause.
In the SMF 42 records you get the following durations (average) in microseconds
- Response time
- Initial Command Response (ping)
- Device busy time
- Connect time
- Control Unit Queue
- Disconnect time
- Disconnect time for reads
- Response time per random read
- Service time per random read
Note:Application resume delay is for the z/Hyperlink Synchronous IO.