Question: In your car how do you tell if your car has a problem? Answer: You look at the dashboard and see if there is a red light showing. You may not know how to fix it – but you know that you need to get help to fix it.
The aim of this series of blog posts is to show you what to look for in z/OS performance and if you have a problem.
I will cover
- CPU at the LPAR level.
- Synchronous I/O
- Workload Manager; or is my work achieving its goals?
What is a TCP/IP performance problem?
People complain about a TCP/IP performance problem when “it” seems slow. This could be caused by a variety of problems
- Data between two ends is being discarded. This can occur on an unreliable, or overloaded component, whose default action is to throw away data, knowing it will be resent.
- The time taken to get from one end to the other and back (“a ping”) is slow. This can be caused by slow or overloaded components.
- Or all of the above.
There is a quote “Never under estimate the bandwith with of a lorry full of tapes”. It might take 10 hours, but a truck 6 ft wide by 20 ft long could hold 300,000 1TB tapes and deliver 8 TBytes/second (with a round trip time of 20 hours). Which is more than the internet can provide!
You need to know
- Are packets being thrown away? You see this from the number of packets which were resent.
- What is the round trip time? (You could use ping – but you may not be able to)
- Is data being sent efficiently – in big blocks?
With TCP/IP there is a connection between a sender and a receiver. The sender sends numbered packets of data to the receiver. The receiver sends an acknowledgement that a packet has been received.
The following is a representation of the flow
- The sender sends packet 1
- The sender sends packet 2
- The sender sends packet 3
- The receiver receives packet 1 and sends an acknowledgement for packet 1
- The sender sends packet 4
- The receiver receives packet 2 and sends an acknowledgement for packet 2
- The sender waits until the acknowledgement of packet 1 has been received
- The sender sends packet 5 and waits till the acknowledgement of packet2 has been received
This way it is self limiting. It means the sender cannot send more than the receiver can handle.
If a packet goes missing, eventually the sender gets a time out, and resends it.
There are two parts to “performance”.
- FTP like: How much data can be sent per second. This is of interest to FTP and MQ, where there is mainly a one way transmission of lots of data. The round trip time is not so critical if you can have a lot of data in transit.
- Transactional: Send some data and wait for the remote end to respond, for example a web browser. The amount of data may be measured in KB, but the round trip time is important.
The term “window” is often used in TCP/IP.
The term “send window” on the sender side represents the total number of packets yet to be acknowledged by the receiver. With a bigger window, there is more data in the pipe line, and the throughput goes up. With a window of 1, one packet is sent and the sender waits for the acknowledgement before sending the next. With this, if there is a high latency, the overall throughput will be low.
One of the factors that affects performance is the receive buffer size. If this was set to 4KB, it means that an application can read up to 4 KB of data at a time. This receive buffer size is sent to the sender, and basically says “send chunks up to this size – as that is all the receiver can take” – this sets the send-buffer-size.
The term Dynamic Right Sizing(DRS) allows the TCP receive buffer size to expand if the network conditions are favourable.
The term Outbound Right Sizing(ORS) allows the TCP send buffer size to expand if the network conditions are favourable.
Another term used is congestion window. If too much data is sent, or the network is unreliable, packets will get lost or thrown away. The congestion window is a measure of how much data can be in-flight. If packets get lost, the congestion window is made smaller. If packets are not lost, then it will try to increase the congestion window. This is a very rough indication of the quality of the network.
FTP like performance
There are several factors which can improve the throughput down a connection
- Make packets bigger. In the early days of TCP/IP a typical packet was 256 bytes. These days a typical default packet size can be 64KB or more.
- One of the Smarts in the protocol is called dynamic right sizing, where TCP will send increasing larger packets until the receiver says “big enough”. The packet size can change with load.
- How much data to send before waiting for the acknowledgement. For a reliable connection, where data is never lost, it is efficient to send a lot of data before waiting. This is called a large send window.
- If the connection is unreliable, it may be more efficient to have only a small send window, before waiting for the acknowledgement.
- Having big buffers may not improve throughput, for example with a web page, the data may all fit into 2KB. In this case having a buffer size of 16KB or 64 KB may make no difference to throughput or performance.
- Typically if one packet contains all the data, then this will be acknowledge as soon as it arrives.
How to see what is going on
You can use the well known “ping” command to send data to the remote end, and get the response. This gives a measure of the network time.
I found most of the data for looking at performance, is available from the netstat command. I found it useful to capture the output of the command in a file or data set.
What connections are connected to this server?
I use the netstat command in TSO , because my fingers are more used to it, and the command options are more memorable than the omvs command ( for example with omvs netstat, do I need the -a or -A option)
netstat conn (port 1414
netstat conn report hlq colin ( port 1414
netstat conn report dsn ‘colin.output’ ( port 1414
These all gave the same output. The report hlq colin creates a data set colin.netstat.conn. The data set name is from the hlq, ‘netstat’, and the subcommand. You can specify a data set name using the ‘dsn’ option.
For omvs you can use
netstat -c -p TCPIP -P 1414 > filename
That lists all of the connections for port 1414.
The command gave me
MVS TCP/IP NETSTAT CS V2R4 TCPIP Name: TCPIP 09:18:34 User Id Conn Local Socket Foreign Socket State ------- ---- ------------ -------------- ----- CSQ9CHIN 00000023 10.1.1.2..1414 10.1.0.2..60538 Establsh CSQ9CHIN 00000022 0.0.0.0..1414 0.0.0.0..0 Listen
There is one connection established from 10.1.0.2 port 60538 to the server with the port listening on 1414.
The commands below give a lot of information about the connection
netstat all report hlq colin (ipport 10.1.0.2+60538
netstat -A -p TCPIP -B 10.1.0.2+60538 > all.port1
Output from the netstat command
The fields are described at the bottom of this page.
Both commands gave me the same output.
There is a lot of data. I’ve broken it into sections with comments after the interesting fields.
MVS TCP/IP NETSTAT CS V2R4 TCPIP Name: TCPIP 09:23:29 Client Name: CSQ9CHIN Client Id: 00000023 Local Socket: 10.1.1.2..1414 Foreign Socket: 10.1.0.2..60538 BytesIn: 0000002988 BytesOut: 0000002912 SegmentsIn: 0000000019 SegmentsOut: 0000000011
- 09:23:29 is the time when request was made. If you repeat the command you can get the interval between commands, and so calculate rates.
- You get the client (job) name CSQ9CHIN.
- The listener socket for the job (local socket) 10.1.1.2 with port 1414.
- The foreign socket – the remote end of the connection. IP address 10.1.0.2 port 60538.
- You can get the data rate If you repeat the command, calculate the deltas BytesIn and BytesOut, and divide by the time between measurement.
StartDate: 06/16/2021 StartTime: 10:00:21 Last Touched: 10:20:37 State: Establsh RcvNxt: 2019327903 SndNxt: 0864946572 ClientRcvNxt: 2019327903 ClientSndNxt: 0864946572 InitRcvSeqNum: 2019324914 InitSndSeqNum: 0864943659 CongestionWindow: 0000018720 SlowStartThreshold: 0000065535
Look at the congestion window. Big is good. Small may indicate small amounts of data being sent or it may indicate network problems, either slow connections or packets are being dropped.
IncomingWindowNum: 2019458463 OutgoingWindowNum: 0865008524 SndWl1: 2019327903 SndWl2: 0864946572 SndWnd: 0000061952 MaxSndWnd: 0000064256
Check the send window. A small (1KB) send window can indicate poor configuration at the remote client, or only small amounts of data are being sent.
SndUna: 0864946572 rtt_seq: 0864946064 MaximumSegmentSize: 0000001440 DSField: 00 Round-trip information: Smooth trip time: 6.000 SmoothTripVariance: 12.000
Monitor the smooth route trip time (in milliseconds) this the local end to the remote end, and back. The variance gives a measure of the spread of response times. These are not strictly averages.
If you had a million requests taking 1 millisecond, and then had a long request taking 1000 milliseconds. The “Average” response time would change by a very small amount (to 1.09 milliseconds). The smoothed (or weighted average) may be something like – (99 * previous average + current value) /100. In this case the “average” goes up to 10.9 milliseconds, which is noticeable different.
ReXmt: 0000000000 ReXmtCount: 0000000000
The re transmits should be zero – or not changing. If this number increases it means the network has lost packets.
DupACKs: 0000000000 RcvWnd: 0000130560
The receive window is usually set to 2 * receive buffer.
SockOpt: 88 TcpTimer: 00
Check SockOpt. Check bit 0x08. If set this indicates “delayed acknowledgement disabled”. See Nagle algorithm. This value being set is good.
If this is not set, then sender can delay sending data for up to about 200 ms, and so combine data from different applications into the same packet for the same destination. This reduces network traffic as there are fewer packets, but it delays the data being sent.
TcpSig: 04 TcpSel: 40 TcpDet: E4 TcpPol: 00 TcpPrf: 81 TcpPrf2: 20 TcpPrf3: 00
For FTP type applications check the TCP Performance Flag TcpPrf. This says if Dynamic Right sizing (using bigger buffers) is enabled. The flag bits are x80 – enabled, x40 Active, x20 Active but disabled. X80 |X40 is good.
The TCP performance flag2 TcpPrf2. This is for outbound right sizing (ORS). A non zero value is good.
DelayAck: Yes QOSPolicy: No TTLSPolicy: No RoutingPolicy: No ReceiveBufferSize: 0000065536 SendBufferSize: 0000065536
These buffer sizes should be large with 64KB or larger, if so the system can dynamically increase them.
They can be configured at the TCP/IP level, or by the application. If they are 64KB or higher then TCP Dynamic Right Sizing can be used (adjust the buffers to match the load).
ReceiveDataQueued: 0000000000 SendDataQueued: 0000000000
These should always be zero.
- Received data queued means the application is slow to retrieve the data
- Send data queued – the application has issued a send – but TCP/IP cannot process it.
SendStalled: No Ancillary Input Queue: N/A
Send stalled should always be no.
What do you need to check?
- SendStalled, ReceiveDataQueued,SendDataQueued should all be 0. They usually are 0. They would be non zero if there was a problem right now. If the problem gets better, these values would be 0.
- Check ReXmt = The total number of times a packet has been retransmitted for this connection. This count is historical for the life of the connection.
- If this is zero then there have been no re transmits, and so no packets lost.
- If this is non zero, then it could be a historical problem. Wait and reissue the netstat command. If the ReXmt value has changed, this indicates packets are being lost.
- Check the round trip time (and variance). Is the value what you expected? If there is traffic flowing on the connection, display the value multiple times, and see if there is significant variation.
- Check ReceiveBufferSize and SendBufferSize. Values of 64KB or larger are good. Small is not good.
- Check congestion window.
It is good to have some data for a normal day, and a problem day. For example if the packets are often lost, then this may not be the problem. If the SendBufferSize is only 8KB today and was 64KB last week – this would a good place to start looking. So capture and save NETSTAT reports for typical sessions.
What about connections into z/OS
Windows has a netstat command.
On Linux Netstat has been superseded with ss for example
ss –info dst 10.1.1.2This is ss dash dash info …
ss –info dst 10.1.1.2:1414
ss –info src 101.0.2
gives similar information for connections going to 10.1.1.2, or the address and port 10.1.1.2:1414
Example netstat output from a slow FTP in connection
Client Name: IBMUSER Client Id: 000006FE Local Socket: 10.1.1.2..1109 Foreign Socket: 10.1.0.2..35508 BytesIn: 0220191104 BytesOut: 0000000000 SegmentsIn: 0000152946 SegmentsOut: 0000083051 StartDate: 06/28/2021 StartTime: 13:47:56 Last Touched: 14:24:28 State: Establsh RcvNxt: 3569682809 SndNxt: 2105824963 ClientRcvNxt: 3569577977 ClientSndNxt: 2105824963 InitRcvSeqNum: 3349491704 InitSndSeqNum: 2105824962 CongestionWindow: 0000005760 SlowStartThreshold: 0000065535 IncomingWindowNum: 3569946679 OutgoingWindowNum: 2105889219 SndWl1: 3569681369 SndWl2: 2105824963 SndWnd: 0000064256 MaxSndWnd: 0000064256 SndUna: 2105824963 rtt_seq: 2105824962 MaximumSegmentSize: 0000001440 DSField: 00 Round-trip information: Smooth trip time: 3.000 SmoothTripVariance: 2.000 ReXmt: 0000000000 ReXmtCount: 0000000000 DupACKs: 0000000000 RcvWnd: 0000263870 SockOpt: A0 TcpTimer: 00 TcpSig: 04 TcpSel: 40 TcpDet: E0 TcpPol: 00 TcpPrf: E0 TcpPrf2: 28 TcpPrf3: 00 DelayAck: Yes QOSPolicy: No TTLSPolicy: No RoutingPolicy: No ReceiveBufferSize: 0000184351 SendBufferSize: 0000184320 ReceiveDataQueued: 0000104832 OldQDate: 06/28/2021 OldQTime: 14:24:27 SendDataQueued: 0000000000 SendStalled: No Ancillary Input Queue: N/A Application Data: EZAFTP0S D IBMUSER C FSSH
- Congestion window low
- Smooth trip time: 3.00 good
- ReXmt: 0 good
- Receive buffr 184351- good
- Receive buffer queued 104832 – BAD