IBM Blog 2017 June

Prevent a major whoops – check it now

June 28 2017

I have been reviewing a customers MQ joblog and saw message IGD17364I which essentially causes MQ archive logs to be deleted on creation.
You should check your MQ Logs today to see if you have the problem.

See Whoops – where has my archive log gone? z/OS threw it away.

How we tracked down a cunning network problem

June 15 2017

We had a problem on our build z/OS machine – where the build was taking days instead of minutes. I thought the process of finding the cunning problem interesting and worth sharing.

We extract code from the build system ( on Linux),sent it to z/OS compiled it and sent information back to Linux.

We did all of the usual checks on z/OS – CPU, disk IO, paging etc and this looked fine. A ping gave a good response time – and an FTP of a small file was good. A network trace sometimes showed long delays ( 10 seconds) between packets – but we could not tell if this was due to the applications pausing at each end or the network. The Linux machine was reportedly fine – only the Linux to z/OS build had problems.

How we found the problem

Our support teams set up a batch script doing

  1. print time of day
  2. ftp a 100MB file
  3. print time of day
  4. ftp a 100MB file
  5. etc

When they ran this, sometimes the FTP was very fast Gigabytes/second, and other times it was Kilobytes per second. They reconfigured the link, and it was fast all of the time. This pointed to the link as the root cause.

Looking further into the broken network, they found one component was ‘degrading ‘- usually sometimes good – but sometime bad. If it had failed then then this would have been detected and reported. It felt that whenever we looked at the failing component – it cunningly worked normally – and it misbehaved when we were not looking at it.

Lesson learned

I learned from this that a ping is not enough to tell if the network has a problem. You need to have a similar FTP script and run this during the problem time so see if it is truly a network problem.

How fast is a piece of string?

June 14 2017

I was at a customer discussing MQ distributed performance and they asked should they use SAN, local Hard Disk Drives (HDD), Solid State Drives(SSD), mirrored disks etc fpr their MQ files. Especially when they machines are virtualized.

I said this was difficult to say because your hardware may be different to my hardware – you may have other people using the SAN and so slowing it down at peak times. Your IO may be slow – but if it meets the business needs ( and projected growth) does it matter?

I do not think we can produce a report which will meet people’s requirements – but I am hoping to be proved wrong.

What is the best format for giving you performance information?

June 14 2017

Someone asked me to do some performance measurements on the use of selectionstrings (so you can say give me the next message with the following message properties).
We have had some discussions about the best way of reporting the information. It is very easy to provide lots of data – but not information, so I thought I would ask people – you – for any preferences.

When you look at performance data what are you looking for? I think it is questions like

  1. If I use messages properties – how much will CPU increase ( and so will I run out of CPU)
  2. If I use a selectionString – how much does this cost me.
  3. If I have to skip 1000 messages to find a message which matches the selectionString – how much does it cost/

Solving so called ‘MQ performance problems’ – what to check

June 5 2017

I was working with a customer who had an “MQ problem” which turned out to be too many virtual machines (VMs) running on the box – so causing lack of CPU. They fixed this and “the MQ problem went away”

There is well know Maslow’s hierarchy of needs which says you need air before you think about safety, sense of belonging etc.

So here is Colin’s hierarchy of needs… so fix 1) then fix 2) etc
1. CPU – is the image short of CPU?
2. Memory (real and virtual storage) – is there any paging
3. IO – check the IO response time is good
4. Check network response time is good
5. Check subsystem – eg MQ, DB2 are giving good response time
6. Check applications

If you have fixed a problem – start your checks from the top. For example fixing the IO problem allows much more work to flow – so there may now be a CPU problem.

If you have a performance problem, go through the list to see where the problem is. It may save you time before calling for help, as the support team may assume you have gone through the list.

I showed this list to a colleague who said it is really obvious – but if it is so obvious – why do we have so many problems caused by it!

As I was writing this, I was asked about another ‘MQ problem’ which turned out to be CPU.

Here are some real examples of problems I have dealt with. Tick the one which you have experienced

  1. “The MQ performance was so bad – I could not even logon to the machine to display the MQ error log” – this was a lack of CPU in the VM
  2. “It cannot be a CPU problem there are 20 cores on this machine” – yes but the VM is only configured to have one core. Defective End User
  3. This server does not have a CPU problem” – yes – but half the messages are being routed (using MQ clustering) to that server which does have a CPU problem – problem between keyboard and chair.
  4. “On average the CPU is only 50% busy” – yes – that is because you have peak workload where you run out of CPU followed by long periods where nothing happens.
  5. Whoops I made the MQ buffer pool so big – it caused paging.
  6. Throughput dropped at 8pm each evening – they did backups at 8pm – and the IO response time doubled – so commits took twice as long and transaction rate halved.
  7. MQ distributed performance was poor – someone had reconfigured the connection to the SAN – IO problem
  8. “You are running MQ on that system?- that SAN is due to be replace next month as it is old and overloaded. The reason why that machine was not being used is that it is so old and about to be scrapped – and you are running production MQ on it? ” – lack of planning and communications
  9. MQ throughput between MQ on z/OS and Linux died every Saturday. – Backups taken from all distributed machine to z/OS – which swamped the network
  10. MQGETs are slow since we made the messages persistent – messages were out of syncpoiint – so IO for every message.
  11. MQ throughput very low – because the application is doing a remote database insert over the network. The MQGET was very quick – the database update was not..

Summarising MQ usage in the usage report and Sub Capacity Pricing report

June 2 2017

The IFAURP program (the usage report program) used to process SMF89 records for usage and Sub Capacity Pricing reports prints out the data for products.

By default this reported data for MQ split down by queue manager, and area.

5655-MQ9 MQM MVS/ESA V9 R0.3 MQPA     .278# # 02Jun17
5655-MQ9 MQM MVS/ESA V9 R0.3 MQPACHIN .051# # 02Jun17
5655-MQ9 MQM MVS/ESA V9 R0.1 MQPC     .011# # 02Jun17
5655-MQ9 MQM MVS/ESA V9 R0.1 MQPCBATC .000# # 02Jun17

If you put the *.SCSQLOAD library containing IFAUMQM# (an alias of CSQ8UBEX) in the //STEPLIB concatenation, this will summarize the data for all of the MQ

5655-MQ9 MQM MVS/ESA 9.0 .34 2# # 02Jun17 (MQM#9501)

The MQM#9501 tells you which version of the exit is being used.

These blog posts are from when I worked at IBM and are copyright © IBM 2017