Flashes of insight into the MQ SMF data

To misquote Raymond Smullyan’s quote in What Is the Name of This Book?

  • If Professor A says it is obvious then a child of 10 can instantly see it
  • If Professor B says it is obvious then if you spend 10 minutes on it you will see it
  • If Professor C says it is obvious then you might spend the rest of your life and not see it
  • If Professor D says it is obvious – then it is false

I hope the “one liners” below are obvious if you spend 10 minutes on it

I was sent a report of key fields (Opened queue name, number of Mopens, MQgets, MQPUTS etc) and a list of questions

Why do I have records for queues which are not on my queue manager?

If the application put a message to a reply to queue then typically the MQOPEN uses the the remote queue name and remote queue manager name.  This maps to another queue, such as the SYSTEM.CLUSTER.XMIT.QUEUE, or a remote queue name.

The name in the OPENNAME in the SMF record will be the remote queue name.  If you look at the BASENAME as well, you will see the real queue it uses.  So the queue in the OPENNAME may not exist on the system – but the BASENAME will exist.

I have puts but no gets for a queue

  1. This could be caused by using a shared queue.  Messages were put on one queue manager and got from another queue manager in the QSG.   The SMF data from the other QMGR should have the gets.
  2. The IMS bridge got the messages and sent them to IMS over the OTMA interface.  This code does not record any SMF data.

This applies to gets but no puts.

I have 1 SMF record for this queue, but 1000 records for that queue

  • If you have a long running task then the SMF data is produced at a regular time – typically half an hour.  This will produce one record.  If you break out the data into a set of queue records you will have few records (typically more than 1 – one for the putting task (channel), and one for the getting server task
  • If you have short lived transactions such as CICS, or IMS MPP transactions then you will get an SMF record when the transaction ends.  If you break out the data into a set of queue records you will have many records one for each transaction, and  one for each the putting/getting task (channel)

What is the impact of REVDNS(DISABLE) and use of host names?

This blog explains some of the effects of disabling REVDNS in MQ, and gives some best practices on use of host names in CHLAUTH statements.
You can convert a host name to an ip address format – for example (my,org.com) to ip address  (9.2.2.2) using a DNS service.  For example the command nslookup google.com gave me 216.58.198.238 .
You can convert an IP address to a host name using a reverse DNS service.  Normally a DNS server transparently does both!
nslookup 216.58.198.238   gave me name = lhr26s04-in-f238.1e100.net.  This shows that an IP address can have an alias as well as host name.
When a TCP Connection request comes in to MQ, the ip address ( 9.2.2.2) is passed in.
Channel auth rules can use host names or ip address formats.
If your CHLAUTH rules have host names rather than IP address, the CHLAUTH code needs to call the reverse DNS service to convert the incoming IP address to a hostname.   The CHLAUTH code can then use the returned hostname in its checks.
The DNS server will usually have  cache of recently used host names and IP addresses.  If an unusual IP address comes in and it is not in the DNS server cache, the DNS goes and asks another server for information about the request.
If there is a problem in the DNS setup, or the network, these requests can take a long time (many seconds).  This is not good, as in the MQ code other DNS lookups are blocked.
You can disable this reverse DNS lookup using ALTER QMGR REVDNS(DISABLED).
If you do this you need to change your CHLAUTH definitions to use the IP address instead of a  hostname.
Morag said
Hostnames are not a particularly secure manner to use for identification.  They are much more easily spoofed than IP addresses (which are of course spoof-able too with some effort).   You need to use digital certificates for identification.
You should NEVER use a hostname for a blocking rule but only for a positive allowing CHLAUTH rule.

Actions

  1. If you need to use REVDNS(DISABLE) check your CHLAUTH statements and replace hostname with IP address.
  2. Check your CHLAUTH and replace HOSTNAME with IP address where it is being used for identification
  3. Check CHLAUTH and change hostnames in blocking rules to be IP addresses.

The fast and slow of clients doing MQCONN and the impact of DNS

I was doing some investigation into the rate at which clients could connect and disconnect from MQ.  I found that about every 50-70th request took over 5 seconds!

I used export MQSERVER=SYSTEM.DEF.SVRCONN/TCP/’LOCALHOST(1414)’  and a client program which did MQCONN, MQDISC and looped.  It timed each call.

I was running on Ubuntu 18.04.

I did a bit more digging and found the following in AMQERR01.LOG
06/10/18 16:39:06 – Process(7703.1) User(colinpaice) Program(connc)
                    Host(colinpaice) Installation(Installation1)
                    VRMF(9.1.0.0)
                    Time(2018-10-06T15:39:06.124Z)
                    ArithInsert1(5)
                    CommentInsert1(LOCALHOST)
                    CommentInsert2(getaddrinfo)
                  
AMQ9788W: Slow DNS lookup for address ‘LOCALHOST’.EXPLANATION:
An attempt to resolve address ‘LOCALHOST’ using the ‘getaddrinfo’ function call
took 5 seconds to complete. This might indicate a problem with the DNS
configuration.
ACTION:
Ensure that DNS is correctly configured on the local system.

If the address was an IP address then the slow operation was a reverse DNS
lookup. Some DNS configurations are not capable of reverse DNS lookups and some
IP addresses have no valid reverse DNS entries. If the problem persists,
consider disabling reverse DNS lookups until the issue with the DNS can be
resolved.

There are some discussions about this on various fora, and there are discussions about the use of IP V6.

This failed consistently for a couple of days, but when I was sitting in an airport waiting for my plane, I turned MQ trace on – and the problem did not occur.  I turned trace off, and I have not seen the problem since, so it must have been related to sitting in a hotel room.

If I specify export MQSERVER=SYSTEM.DEF.SVRCONN/TCP/’LOCALHOST(1414)’ I get

MQCONN Mean:, 11.08,Max:, 15.51,Min:, 8.92,ms rate, 90.24 a second

If I specify export MQSERVER=SYSTEM.DEF.SVRCONN/TCP/’127.0.0.1(1414)’ I get

MQCONN Mean:, 1.31,Max:, 2.60,Min:, 1.09,ms rate,766.23 a second.

Lessons learned

  1. Use dotted IP addresses to avoid the DNS lookup time
  2. Look at the MQ error logs!
  3. There may be things going on outside of your control which should have no impact on you – but do have an impact ( eg my hotel room)
  4. Morag suggested you can disable revdns, but see What is the impact of REVDNS(DISABLE) and use of host names?

Does this matter?

If your clients are well behaved and connect once a week this is not a problem.  If your clients are connecting for every message, this will be a killer due to the extra CPU, and the elongated time.

CHLAUTH may be using DNS or reverse ( to convert dotted address into a name).   This will add to the extended duration.

Many “frameworks” designed to make it easier to use MQ have the settings to do an MQCONN for every message – so check your configurations – and tune it so it connects infrequently!

You may get better performance by having 10 threads connected all of the time, rather than have the framework expand and shrink the number of connections.

Notes on DNS

On Ubuntu I used the dig command to measure the response time to the DNS. For example

dig -4 -u LOCALHOST any

this gave me a lot of output – one line of which was

;; Query time: 354 usec

I then put the -4 -u LOCALHOST any into a file and repeated it for 100 lines.

I then used

dig -f dns.txt |grep Query |sort -n -k4 |less

This ran the commands in dns.txt, extracted the line with the response time, sorted it by the 4th field (ascending time), and displayed it

The majority of the requests were under 100 microseconds but one took 3829 microseconds

Another command

I found

systemd-resolve --status

an interesting command, and gave more info about the DNS

 

 

 

What is this “long running UOW problem”

Badly designed applications which put or get messages in syncpoint and then fail to commit or rollback are not good for the queue manager.

Eventually the log records needed to recover will almost fall out of the active logs.    For many years MQ detects this, and moves all of the log records needed for recovery of this unit of work, into the current active log.  This is additional work, and impacts logging, and needs CPU to move the data.  It can also increase the restart time.

The queue manager detects this and give you a message

CSQJ160I MQPR CSQJLRUR Long-running UOW found URID=0000000012340000 Connection name=MQPRCHIN

Great – but what do you do with this?

You can use the command

DIS CONN(*) WHERE(QMURID,EQ,0000000012340000 ) ALL

This displays enough information to see what task is causing the problem.

If this is a channel, you get the channel and conname which should be enough to identify which is causing the problem.

Of course you want to change your automation so that when you get the CSQJ160I MQPR  Long-running UOW found…

you issue the display command.

How do you display which UOW are active

for example

echo "dis conn(*) type(all) where(uowstate NE NONE )" | runmqsc QMA

 

or  echo "dis conn(*) type(all) where(uowstti ne ' ' )" | runmqsc QMA |grep APPLTAG

I struggled with getting the WHERE clause to work with empty data for example

UOWSTTI( )

You have to specify (WHERE UOWSTTI EQ ‘ ‘) with a blank .

 

 

 

Can I have my MQ listener on Linux start when I start MQ?

Each time after I had started my queue manager with strmqm, my client program was getting

MQCONN failed with CompCode:2, Reason:2538  MQRC_HOST_NOT_AVAILABLE

This was because my listener was not started.

I got fed up with using strtmq then using runmqsc to issue START LISTENER …

I found, to get it to start when the queue manager starts, use the command

ALTER LISTENER(SYSTEM.DEFAULT.LISTENER.TCP) trptype(TCP) control(QMGR)

The doc says for control(QMGR)

The listener being defined is to be started and stopped at the same time as the queue manager is started and stopped.

The default is Manual!

Easy when you know how!

strmqm command fails with RMLIMIT_NOFILE message

I had been using strmqm quite happily for days,  but after reboot I got the following message when I tried to use strmqm

The system resource RLIMIT_NOFILE is set at an unusually low level for IBM MQ.
IBM MQ queue manager ‘QMA’ starting.
The queue manager is associated with installation ‘Installation1’.
The queue manager ended for reason 545284129, ”.
I cant find references to RMLIMIT_NOFILE in the doc – except in trace entries.
In AMQERROR01.LOG
06/10/18 06:40:36 – Process(4665.1) User(colinpaice) Program(strmqm)
Host(colinpaice) Installation(Installation1)
VRMF(9.1.0.0) QMgr(QMA)
Time(2018-10-06T05:40:36.942Z)
ArithInsert1(1024) ArithInsert2(10240)
CommentInsert1(RLIMIT_NOFILE)

06/10/18 06:40:36 – Process(4665.1) User(colinpaice) Program(strmqm)
Host(colinpaice) Installation(Installation1)
VRMF(9.1.0.0) QMgr(QMA)
Time(2018-10-06T05:40:36.942Z)
ArithInsert1(1024) ArithInsert2(10240)
CommentInsert1(RLIMIT_NOFILE)

AMQ5657W: The system resource RLIMIT_NOFILE is set at an unusually low level
for IBM MQ.

EXPLANATION:
The system resource RLIMIT_NOFILE is currently set to 1024 which is below the
usual minimum level of 10240, recommended for IBM MQ.
ACTION:
If possible, increase the current setting to at least 10240.

The doc told me to add
mqm       hard  nofile     10240
mqm       soft  nofile     10240
to /etc/security/limits.conf file .
If I use ulimit -n it showed userid colinpaice has only 1024.   The error message says user(colinpaice) – so the error message was clear once I read it carefully.
The answer is to change the “soft” limit for all users who issue the strmqm command.
I just added it to /etc/security/limits.conf file logged off, and logged on again and it worked!

This poisoned message is killing me – what can I do?

A poisoned  message is a messages that kills the application. This may be as simple as having unexpected data in a field, or simply a bug in the application which is hit when the message is processed.

The classic scenario is

  • MQGET
  •  Abend taking a dump
  • Rollback
  • Loop

Symptoms are typically thousands of dumps, or the application draining all of the CPU from an engine (think black hole).

I have seen this problem suddenly disappear – because the message expired after 30 seconds. This caused much head scratching as they started investigating the problem, and the problem went away.!

What can you do about this?

For JMS the work is done for you.  See here.

The BackoutCount field in the MQMD is a count of the number of times the message has been backed out. This is usually 0.
You should have logic which says if BackoutCount > 1 then do special processing.

Special Processing:

Use MQINQ on the queue name to find the queue type. If the queue type is a QALIAS you need the name of the base queue.
Use MQINQ on the base queue to inquire BOQNAME and BOTHRESH

If MQMD.BackoutCount >= BOTHRESH and BOQNAME != ' ' then
do
  create a header for the message
  use MQPUT1 to put the message to the BOQNAME 
end
else write the message to the DLQ with the appropriate header
Produce an alert to notify the applications team that there 
was a problem, and give date time, 
and any message identifier such as MSGID and Correlid

Then fix the underlying problems!