March 30 2017
I was at a customer in Asia who was stress testing a huge new application and they asked me what do they need to check MQ is OK.
I started preparing a presentation and I found this was too complex – they wanted practical instructions.
So for getting started with MQ performance
Most systems should be able to log at 30MB/Second. Use MP1B to look at the log report for Log write rate XMB/s per copy.
If you have a few transactions running concurrently processing 2KB messages you should be able to log at 30 MB/Second or higher.
Using larger messages such as 1MB you should be able to log at 100 MB/Second.
People with mirrored DASD over a large distance may find they cannot achieve 100MB/Second
The disk response time of MQ logs should typically be under 1 ms – many people are down at 250 microseconds. Use tools like RMF to display the disk response time. Collect this at a good time, so you can compare with when you have problems.
The time for a commit or log force request (Out of syncpoint, or buffer pool filling up) is typically 2 * average log IO write time.
If your log IO time is 1 ms – one server/transaction/channel will be unlikely for to do more than 500 commits a second. If your log response time doubles – you will be unlikely to do 250 commits a second.
As you log more data the response time increases – so check the log response time at peak time for the LPAR and under maximum load for MQ.
Using multiple applications in parallel can improve throughput
Processing more messages before a commit can improve throughput
If your IO response time changes throughout the day – the maximum rate an application can process will vary during the day.
Monitor queue depths
An empty queue is a good queue. If the queue depth increases – it is usually the application getting from the queue which has the problem – but it may be the putting program processing many messages in a unit of work
Keep buffer pools below 75% usage
If the buffer pool fills up then there will be page set IO and it will be very slow to process messages as each 4KB page will require at least 2 IOs. One for the log and one for the page set.
Use MQ V8 and larger buffer pools.
With QREP ensure the buffer pool is sufficient for at least batchsize * message size * 2 * channels. We found in QREP with batches of >= 200MB, lots of uncommitted messages (bear in mind the apply also used large UoW’s) meant buffer pools needed to be larger even though queue depths didn’t appear high.
Have the queue manager enabled for monitoring data.
For example ALTER QMGR MONQ(HIGH) MONCHL(HIGH). This allows you to collect additional information about channels and queues. Set the MONQ attribute for queues and the MONCHL for channels
Check your MQ NETTIME with with TCP PING time.
The DIS CHSTATUS NETTIME value gives an indication of the time on the network. This should be comparable with a TSO PING command. If it is much higher you may need to tune the TCP buffers. See here
Monitor channel BATCHSZ and XBATCHSZ.
XBATCHSZ is what your channels have if this is smaller than batch size – then if the messages are small there are not enough messages to fill a batch. You may get small XBATCHSZ. if the channel is limited by BATCHLIM
Can you avoid channels disconnecting and reconnecting soon afterwards?
Do you have a business need for channels to stop soon after they are idle ?
You can use DIS CHL(*) where(DISCINT,NE,0) to see which channels have DISCINT specified – this applies to both ends of the channel.
You may be able to see channels start and stop messages in the job log(CSQX500I and CSQ501I) but you may have configured MQ to suppress these.
Channels with a large batch size (over 50) are more efficient than a small batch size. You may want to use BATCHLIM to limit how much data is sent in a batch – useful when processing very large messages
If there is a large distance between the queue managers – increasing the batch size up to 1000 may help.
Use DIS CHL(*) where(BATCHSZ,LT,50)
Have enough active logs to avoid any delays due to archiving.
Make sure the logs are of reasonable size – i.e. why would you not use 4GB logs these days once on V8 or later.
Monitor your systems so you know what is normal behavior
- Turn on all statistics keep the SMF records for at least a week
- Know typical queue depth of your application queues.
- Know the average depth of your transmission queues
- Know the nettime of your key channels at peak time
- Turn on Class(3) accounting sometimes for a short period( 5 minutes)
Channels with SSLCIPH set
Is the channel negotiating at a ‘sensible’ frequency. Negotiations are expensive and slow the flow of data over the channel.
Is cryptographic offload available to negotiate the secret key
Are your channels resources constrained?
Having too few adapters can cause delays. Having too many adapters should have not impact. Channels with lots of large Persistent messages may need more adapters.
If you find channels seems to be slow, try stopping and restarting the channel. It it seems to go faster then the dispatched it was on may be constrained.
Check the Chinit SMF to see if a dispatcher is constrained
March 15 2017
Ive been working at a customer and they could see during the day that the IMS bridge queue was usually empty, but at 8pm to 9pm every day it grew to hundreds of messages – for the same message rate. What is going on ?
I saw two problems.
- On z/OS The MQ log IO response time doubled during the problem period – they started the batch workload at 8PM. Because the IBM bridge task processing one queue is a single server – doubling the IO time – halves the throughput During the peak period the IMS bridge task was only processing half the volume.
How to fix this. Use multiple IMS bridge queues – so the processing of messages on these queues can be done in parallel
- Looking at the chinit trace I could see messages flowing down, and the time delay before the ‘end of batch flow’. During the day, the time before the end of batch was about 1 ms. During the problem period this was about 50 ms. The back end Windows system uses a SAN for its MQ files. At 8PM every night they backup the site to the SAN – so the response time seen by MQ was very bad.
The impact of this long response time is that
- on Windows every commit was taking 50 ms – so two commits took 100 ms.
- because the ‘end of batch’ flag took so long to flow to z/OS this meant that the message was on z/OS but could not be got because it was within syncpoint
- when the reply flowed back – again the end of batch processing was delayed
So overall I estimate the impact of the backups to the SAN was adding more than 200 ms to each message duration.
- Check your IO response time and compare it from good times and bad times
- Use multiple servers to process messages is usually faster than having just one server.
March 13 2017
With 10 minutes of thinking, it is ‘obvious’ that if an application queue has lots of messages – then this is clearly a bottleneck because the getting applications are not getting the messages fast enough.
With an hour of thinking you may realize this is not always true. For example I was looking at a customer’s set up where RESET QSTATS showed the max depth was 1000 messages in the interval – so it is ‘clearly’ a problem. We were using a channel with a batch size of 1000 ( the queue managers were a very long way apart) so at end of batch – suddenly there are 1000 messages available. This means a depth of 1000 is expected.
In this case, if the maximum queue depth was greater than 1000 then I think there was a bottleneck – for 1000 or below – then I dont think there was a problem
Note 1: We used a batch size of 1000 because PING gave us a network delay time of about 20 milliseconds.
Note 2: Because DIS CHS() NETTIME gave us time much longer than the PING time – we had to tell TCP to use bigger buffers. Search for Dynamic Right Sizing – or search for NETSTAT in this blog.
These blog posts are from when I worked at IBM and are copyright © IBM 2017.