What is the impact of REVDNS(DISABLE) and use of host names?

This blog explains some of the effects of disabling REVDNS in MQ, and gives some best practices on use of host names in CHLAUTH statements.
You can convert a host name to an ip address format – for example (my.org.com) to ip address  (9.2.2.2) using a DNS service.  For example the command nslookup google.com gave me 216.58.198.238 .
You can convert an IP address to a host name using a reverse DNS service.  Normally a DNS server transparently does both!
nslookup 216.58.198.238   gave me name = lhr26s04-in-f238.1e100.net.  This shows that an IP address can have an alias as well as host name.
When a TCP Connection request comes in to MQ, the ip address ( 9.2.2.2) is passed in.
Channel auth rules can use host names or ip address formats.
If your CHLAUTH rules have host names rather than IP address, the CHLAUTH code needs to call the reverse DNS service to convert the incoming IP address to a hostname.   The CHLAUTH code can then use the returned hostname in its checks.
The DNS server will usually have  cache of recently used host names and IP addresses.  If an unusual IP address comes in and it is not in the DNS server cache, the DNS goes and asks another server for information about the request.
If there is a problem in the DNS setup, or the network, these requests can take a long time (many seconds).  This is not good, as in the MQ code other DNS lookups are blocked.
You can disable this reverse DNS lookup using ALTER QMGR REVDNS(DISABLED).
If you do this you need to change your CHLAUTH definitions to use the IP address instead of a  hostname.
Morag said
Hostnames are not a particularly secure manner to use for identification.  They are much more easily spoofed than IP addresses (which are of course spoof-able too with some effort).   You need to use digital certificates for identification.
You should NEVER use a hostname for a blocking rule but only for a positive allowing CHLAUTH rule.

Actions

  1. If you need to use REVDNS(DISABLE) check your CHLAUTH statements and replace hostname with IP address.
  2. Check your CHLAUTH and replace HOSTNAME with IP address where it is being used for identification
  3. Check CHLAUTH and change hostnames in blocking rules to be IP addresses.

The fast and slow of clients doing MQCONN and the impact of DNS

I was doing some investigation into the rate at which clients could connect and disconnect from MQ.  I found that about every 50-70th request took over 5 seconds!

I used export MQSERVER=SYSTEM.DEF.SVRCONN/TCP/’LOCALHOST(1414)’  and a client program which did MQCONN, MQDISC and looped.  It timed each call.

I was running on Ubuntu 18.04.

I did a bit more digging and found the following in AMQERR01.LOG
06/10/18 16:39:06 – Process(7703.1) User(colinpaice) Program(connc)
                    Host(colinpaice) Installation(Installation1)
                    VRMF(9.1.0.0)
                    Time(2018-10-06T15:39:06.124Z)
                    ArithInsert1(5)
                    CommentInsert1(LOCALHOST)
                    CommentInsert2(getaddrinfo)
                  
AMQ9788W: Slow DNS lookup for address ‘LOCALHOST’.EXPLANATION:
An attempt to resolve address ‘LOCALHOST’ using the ‘getaddrinfo’ function call
took 5 seconds to complete. This might indicate a problem with the DNS
configuration.
ACTION:
Ensure that DNS is correctly configured on the local system.

If the address was an IP address then the slow operation was a reverse DNS
lookup. Some DNS configurations are not capable of reverse DNS lookups and some
IP addresses have no valid reverse DNS entries. If the problem persists,
consider disabling reverse DNS lookups until the issue with the DNS can be
resolved.

There are some discussions about this on various fora, and there are discussions about the use of IP V6.

This failed consistently for a couple of days, but when I was sitting in an airport waiting for my plane, I turned MQ trace on – and the problem did not occur.  I turned trace off, and I have not seen the problem since, so it must have been related to sitting in a hotel room.

If I specify export MQSERVER=SYSTEM.DEF.SVRCONN/TCP/’LOCALHOST(1414)’ I get

MQCONN Mean:, 11.08,Max:, 15.51,Min:, 8.92,ms rate, 90.24 a second

If I specify export MQSERVER=SYSTEM.DEF.SVRCONN/TCP/’127.0.0.1(1414)’ I get

MQCONN Mean:, 1.31,Max:, 2.60,Min:, 1.09,ms rate,766.23 a second.

Lessons learned

  1. Use dotted IP addresses to avoid the DNS lookup time
  2. Look at the MQ error logs!
  3. There may be things going on outside of your control which should have no impact on you – but do have an impact ( eg my hotel room)
  4. Morag suggested you can disable revdns, but see What is the impact of REVDNS(DISABLE) and use of host names?

Does this matter?

If your clients are well behaved and connect once a week this is not a problem.  If your clients are connecting for every message, this will be a killer due to the extra CPU, and the elongated time.

CHLAUTH may be using DNS or reverse ( to convert dotted address into a name).   This will add to the extended duration.

Many “frameworks” designed to make it easier to use MQ have the settings to do an MQCONN for every message – so check your configurations – and tune it so it connects infrequently!

You may get better performance by having 10 threads connected all of the time, rather than have the framework expand and shrink the number of connections.

Notes on DNS

On Ubuntu I used the dig command to measure the response time to the DNS. For example

dig -4 -u LOCALHOST any

this gave me a lot of output – one line of which was

;; Query time: 354 usec

I then put the -4 -u LOCALHOST any into a file and repeated it for 100 lines.

I then used

dig -f dns.txt |grep Query |sort -n -k4 |less

This ran the commands in dns.txt, extracted the line with the response time, sorted it by the 4th field (ascending time), and displayed it

The majority of the requests were under 100 microseconds but one took 3829 microseconds

Another command

I found

systemd-resolve --status

an interesting command, and gave more info about the DNS

 

 

 

What is this “long running UOW problem”

Badly designed applications which put or get messages in syncpoint and then fail to commit or rollback are not good for the queue manager.

Eventually the log records needed to recover will almost fall out of the active logs.    For many years MQ detects this, and moves all of the log records needed for recovery of this unit of work, into the current active log.  This is additional work, and impacts logging, and needs CPU to move the data.  It can also increase the restart time.

The queue manager detects this and give you a message

CSQJ160I MQPR CSQJLRUR Long-running UOW found URID=0000000012340000 Connection name=MQPRCHIN

Great – but what do you do with this?

You can use the command

DIS CONN(*) WHERE(QMURID,EQ,0000000012340000 ) ALL

This displays enough information to see what task is causing the problem.

If this is a channel, you get the channel and conname which should be enough to identify which is causing the problem.

Of course you want to change your automation so that when you get the CSQJ160I MQPR  Long-running UOW found…

you issue the display command.

How do you display which UOW are active

for example

echo "dis conn(*) type(all) where(uowstate NE NONE )" | runmqsc QMA

 

or  echo "dis conn(*) type(all) where(uowstti ne ' ' )" | runmqsc QMA |grep APPLTAG

I struggled with getting the WHERE clause to work with empty data for example

UOWSTTI( )

You have to specify (WHERE UOWSTTI EQ ‘ ‘) with a blank .

 

 

 

Can I have my MQ listener on Linux start when I start MQ?

Each time after I had started my queue manager with strmqm, my client program was getting

MQCONN failed with CompCode:2, Reason:2538  MQRC_HOST_NOT_AVAILABLE

This was because my listener was not started.

I got fed up with using strtmq then using runmqsc to issue START LISTENER …

I found, to get it to start when the queue manager starts, use the command

ALTER LISTENER(SYSTEM.DEFAULT.LISTENER.TCP) trptype(TCP) control(QMGR)

The doc says for control(QMGR)

The listener being defined is to be started and stopped at the same time as the queue manager is started and stopped.

The default is Manual!

Easy when you know how!

strmqm command fails with 545284129 (and sometimes RMLIMIT_NOFILE message)

I had been using strmqm quite happily for days,  but after reboot I got the following message when I tried to use strmqm.

April 2019 update…

Even making the recommended changes I still had the problem.
This post told me what to do – and it now works.  I think /etc/security/limits.conf has limits when you are running su mode.

Late April 2019 update…

Another day one of my queue managers would not start.

QMGR .. failed to start.
The queue manager is associated with installation ‘Installation1’.
The queue manager ended for reason 545284129, ”.
There were no other messages, or entries in AMQERR01.LOG.

Running /opt/mqm/bin/mqconfig said every thing passed.
I logged off logged on again and it worked, sigh.

End of updates

The system resource RLIMIT_NOFILE is set at an unusually low level for IBM MQ.
IBM MQ queue manager ‘QMA’ starting.
The queue manager is associated with installation ‘Installation1’.
The queue manager ended for reason 545284129, ”.

I cant find references to RMLIMIT_NOFILE in the doc – except in trace entries.
In AMQERROR01.LOG
06/10/18 06:40:36 – Process(4665.1) User(colinpaice) Program(strmqm)
Host(colinpaice) Installation(Installation1)
VRMF(9.1.0.0) QMgr(QMA)
Time(2018-10-06T05:40:36.942Z)
ArithInsert1(1024) ArithInsert2(10240)
CommentInsert1(RLIMIT_NOFILE)06/10/18 06:40:36 – Process(4665.1) User(colinpaice) Program(strmqm)
Host(colinpaice) Installation(Installation1)
VRMF(9.1.0.0) QMgr(QMA)
Time(2018-10-06T05:40:36.942Z)
ArithInsert1(1024) ArithInsert2(10240)
CommentInsert1(RLIMIT_NOFILE)

AMQ5657W: The system resource RLIMIT_NOFILE is set at an unusually low level
for IBM MQ.

EXPLANATION:
The system resource RLIMIT_NOFILE is currently set to 1024 which is below the
usual minimum level of 10240, recommended for IBM MQ.
ACTION:
If possible, increase the current setting to at least 10240.

The doc told me to add
mqm       hard  nofile     10240
mqm       soft  nofile     10240
to /etc/security/limits.conf file .
If I use ulimit -n it showed userid colinpaice has only 1024.   The error message says user(colinpaice) – so the error message was clear once I read it carefully.
The answer is to change the “soft” limit for all users who issue the strmqm command.
I just added it to /etc/security/limits.conf file logged off, and logged on again and it worked!

2022 update

After I upgraded my Linux to 20.04, I got the message back again. After hunting around I found…

The limits in /etc/security/limits.conf were indeed being applied, but not to the graphical login. You need to add the following line to /etc/systemd/user.conf:

DefaultLimitNOFILE=65535

That change works, but only affecting the soft limit. Leaving us capped with a hard limit of 4096 still. In order to affect the hard limit, you must modify /etc/systemd/system.conf with the same changes.

For me the file had all fields commented out,

#DefaultLimitNOFILE=

so make the change, remove the # and reboot.

This poisoned message is killing me – what can I do?

A poisoned  message is a messages that kills the application. This may be as simple as having unexpected data in a field, or simply a bug in the application which is hit when the message is processed.

The classic scenario is

  • MQGET
  •  Abend taking a dump
  • Rollback
  • Loop

Symptoms are typically thousands of dumps, or the application draining all of the CPU from an engine (think black hole).

I have seen this problem suddenly disappear – because the message expired after 30 seconds. This caused much head scratching as they started investigating the problem, and the problem went away.!

What can you do about this?

For JMS the work is done for you.  See here.

The BackoutCount field in the MQMD is a count of the number of times the message has been backed out. This is usually 0.
You should have logic which says if BackoutCount > 1 then do special processing.

Special Processing:

Use MQINQ on the queue name to find the queue type. If the queue type is a QALIAS you need the name of the base queue.
Use MQINQ on the base queue to inquire BOQNAME and BOTHRESH

If MQMD.BackoutCount >= BOTHRESH and BOQNAME != ' ' then
do
  create a header for the message
  use MQPUT1 to put the message to the BOQNAME 
end
else write the message to the DLQ with the appropriate header
Produce an alert to notify the applications team that there 
was a problem, and give date time, 
and any message identifier such as MSGID and Correlid

Then fix the underlying problems!

 

 

Why do we still need psychic programmers?

Before I answer that questions – why do we have end users who need to be psychic?

I’ve been on holiday with my family and one of my family (let’s call her my cousin) was ranting on about a wonderful new, agile, cloud based tool for entering data about patients. The rants were because the tool was not intuitive, and now they have to enter the data in both the old tool AND the new tool because the old tools can’t read the data in the new tool. I kept hearing comments like “how was I meant to know that – I am not psychic”
I was mulling over this over as I stood on the platform waiting for a train to take me to London. I had a first class ticket. I arrived on the platform, the sign to the left of me said first class at the front of the train, and the the sign to the right of me also said first class at the front of the train. It didn’t say which way the train would be going, so how was I to know which way to go – left or right ? I am not psychic! In Japan they have a sign saying “12 coach train. Coach K” where the train door for coach K will be – this is great! The people in the UK who develop the signs, clearly use the platform every day, so automatically knew which way the trains were heading. Us poor users have no idea. You either had to ask, or there was a 50% chance you went to the wrong end of the train and had to sprint along 12 coaches to get to the other end. Our train was almost delayed by a old person who raced as fast as his two legs and stick could go, to get into first class.

Wikipedia said
A psychic is a person who claims to use extrasensory perception (ESP) to identify information hidden from the normal senses, particularly involving telepathy or clairvoyance, or who performs acts that are apparently inexplicable by natural laws.

Back to my ranting family.
Some web based tools save data every 5 second so if you have a problem, it can show you a draft version. My cousin said she entered lots of data in the new tool and was ready to save it – but there was no save button, so she assumed it was auto saved. There were some icons at the top of the screen. The icon pictures had no relevant to any tasks, and there was no hover text to say what she did. (She said one icon looked like the clay dog Grommet from the Wallis and Grommet films). She pressed an icon and went onto the next screen. After a few minutes she found she needed to update the previous data. She could not find a way to go back, so she quit and re-entered the tool. Non of the data she had entered was there. She phoned the help desk and the conversation went like
Cousin:       It lost all of my data
Helpdesk: Did you press the save key?
Cousin:      There is no save key
Helpdesk: It is on the right hand side.
Cousin:      No it is not.
Helpdesk: Try maximising the window
Cousin:      I see a save button now. How was I meant to know there was a save button hidden to one side.
Helpdesk: Every one knows it is there
Cousin:      Well I didn’t – I am not psychic. There were no horizontal scroll bars to indicate there was more info to one side.
Helpdesk: It is in the help text – did you look there
Cousin:      Why would I look there ?
Helpdesk: To tell you how to save the data
Cousin:      I expected it to autosave – so it could be recovered in the case of problems.
Helpdesk: It doesn’t do that.
Cousing:    If I had pressed the help button – would it have saved my data before displaying help?
Helpdesk: Only if you had pressed the save button.
(goto to line 1)

Cousin:       What do the icons mean?
Helpdesk:  The one with an H in an circle – means help
Cousin:       Ah in the previous tool this meant a list of hospitals.
What is this circle with an arrow – like a refresh icon in a browser
Helpdesk: That is the start again button. It deletes all of the stuff you have entered
Cousin:       You mean like Ctrl-A ( to select all the text) delete?
Helpdesk:  Could be
Cousin:        Is there an undo-delete button?
Helpdesk:   No – you have to save the data often
Cousin:        How was I to know that – I am not psychic!

Cousin:       Which is the stop icon?
Helpdesk:  The red one
Cousin:        Which one is that
Helpdesk:   Are you colour blind ?!
Cousin:       Yes
Helpdesk: Well it’s the one next to the green icon.
Cousin:      Which one is that?
OK this is not all 100% true, but it is based on comments from my family and our struggles to do our jobs despite the tools available to us.

It makes me think these tools are designed  by the Dilbert character Mordac, the preventer of information services.

How is this relevant to MQ and the psychic programmers?

I was contacted by someone looking for assistance.
Her company has some guidelines on how to program with MQ. These looked great, with a lot of detail – it must have taken weeks to write it all down.  There was a section on each area of MQ programming.
For example “Use of persistent messages. You should use persistent messages if your data is important. If you cannot afford to lose data then make the messages persistent.”
Her arguments went along the lines of

  • If it was not important, I would not be working on it. I am working on it,  therefore it is important, so all messages must be persistent.
  • This is an audit type message – I think it ok to lose the occasional message ( < 1 %) but not ok to lose 10% of the messages. I cannot make a message 99% persistent.

Another point she raised was that she was told to put a message to this queue to go to the server, and wait for a response. How should she know what persistence to use as it was not documented.

Often there are no guidelines easily available to help application programmers. For example

  • There may be no guidelines
  • The guidelines are not easy to find
  • The guidelines are not clear
  • The guidelines are written from the wrong viewpoint.
  • The guidelines may be very detailed – but too many details may make it less clear.

You can avoid the need for psychic programmers by having clear guidelines, and business template – and not at the MQ capability level.

For example

  •  Business template 1 – request reply model
    The messages contains important information. If messages cannot be processed then you need logic to be able to retry and possible duplicate requests.
    Each business application has queues defined specifically for it, use these queues.
    For messages you are unable to process they must be put on an error queue.   This queue must trigger an alert, and you must have a process in place to process these bad messages

So how do you avoid the need for psychics?

  • Make sure your information is easy to find
  •  Write the information knowing the lowest level of knowledge your audience has. Put yourself in their position.   You know it is obvious because you already know the answer.
  • Discuss it with your users, or people who support you, get feedback and update your documents.
  • If the rules do not match your environment, change the environment, or change the rules!