Are my digital certificates still valid and are they slowing down my channel start?

Digital certificates are great. They allow program to program communication so each end can get information to identify the other end, and the programs can then communicate securely, with encryption, or just checking the payload has not changed.

A certificate is basically a file with two parts (or two files)  – a public certificate and a private key. You can publicize the public part to any one who wants it (which is why is is called the public part). Anyone with the private key can use it to say they are you. (If you can get access to the private key, then you can impersonate the identity)

There are times when you want to say this certificate is not longer valid. For example when I worked at IBM, I had a certificate on my laptop to access the IBM mail servers.

  • If my laptop was stolen, IBM would need to revoke the certificate.
  • When I retired from IBM, IBM revoked my certificate to prevent me from trying to access my IBM mail using my old certificate from my personal laptop.

Managing these certificate is difficult. There could be billions of certificates in use today.

Your server should validate every connection request to ensure the certificate sent from the client is still valid.

A client should validate the certificate sent by the server to ensure that it is connecting to a valid server.

In the early days of certificates, there was a big list of revoked certificates – the Certificate Revocation List(CRL). If a certificate is on the list then it has been revoked. You tend to have an LDAP server within your firewall which contains these lists of revoked certificates.

This was a step in the right direction, but it is difficult to keep these lists up to date, when you consider how many certificates are in use today, and how many organizations generate certificates. How often do I need to refresh my list? If the CRL server was to refresh it every day, it may be up to one day out of date, and report “this certificate is ok” – when it had been revoked.

These days there is a technique called Online Certificate Status Protocol (OCSP). Basically this says go and ask the site which issued this certificate if it is still valid. This is a good idea – and they say the simple ideas are the best.
How do you know who to ask? A certificate can have url information within in it Authority Information Access: OCSP – URI:http://ColinsCert.Checker.com/, or you can specify URL information in the queue manager configuration for those certificates without the Authority Information Access(AIA) information.

Often the URL in the certificate is outside of your organization, and outside of your firewall. To access the OCSP site you may need to have an SSL Proxy server which has access through the firewall.

You can configure MQ to use a (one) OCSP server for those certificates not using AIA information.  If your organization is a multinational company, you may be working with other companies who use different Certificate Authorities.   If you have certificates from more than one CA, you will not be able use MQ to check all of them to see if they are still valid.  You may want to set up an offline job which runs periodically and checks the validity of the certificates.

Starting the MQ channel can be slow

When an TLS/SSL MQ channel is started, you can use OCSP or CRL to check that a certificate is valid. This means sending a request to a remote server and waiting for a response.   The LDAP server for CRL requests is likely to be within your domain,  as your organization manages it. The OCSP  server could be outside of your control, and in the external network.  If this server is slow, or the access to the server is slow, the channel will be slow to start.  For many customers the network time within a site is under 1 millisecond, between two sites could be 50 ms. Going to an external site the time could be seconds – and dependent on the traffic to the external site.
This time may be acceptable when starting the channel, first thing in the morning, but restarting a channel during a busy period can cause a spike in traffic because no messages flow while the channel is starting. For example

ChannelwithRestart

No messages will flow while the channel is starting, and this delay will add to the round trip time of the messages.

How do I check to see if I have a problem

This is tough. MQ does not provide any information. I used MQ internal trace when debugging problems, but you cannot run with trace enabled during normal running.

There are two parts to the validation request. The time to get to and from the server, and the time to process your request once it has got to the server.

You can use TCP Ping to get to the server (or to the proxy server).  If you are using a proxy server you cannot “ping” through the proxy server.

Openssl provides a many functions for creating and managing certificates.

You can use the command

time openssl s_client -connect server:port
or
time openssl s_client -connect -proxy host:port server:port

The “time” is a linux command which tells your the duration in milliseconds of the command following it.

The openssl s_client is a powerful ssl client program. The -connect… tries connecting to server:port. You can specify -proxy host:port to use the proxy.

The server at the remote end may not recognize the request – but you will get the response time of the request.

Running this on my laptop I got

time openssl s_client -connect 127.0.0.1:8888
CONNECTED(00000005)
140566846001600:error:1408F10B:SSL routines:ssl3_get_record:wrong version number:../ssl/record/ssl3_record.c:332:

no peer certificate available

No client certificate CA names sent

SSL handshake has read 5 bytes and written 233 bytes

Verification: OK

real 0m0.012s

The request took 0.012 seconds.

My openssl OCSP server reported

OCSP Response Data:
OCSP Response Status: malformedrequest (0x1)

You can use openssl ocsp …. to send a certificate validation request to the OCSP server and check the validity of the system – but you would have to first extract the certificates from the iKeyman keystore.

How do I test this?

I’ve created the instructions I used:

More information about OCSP.

There is a good article in the IBM Knowledge Centre here .

The article says
To check the revocation status of a digital certificate using OCSP, IBM MQ determines which OCSP responder to contact in one of two ways:

  1. Using a URL specified in an authentication information object or specified by a client application.
  2. Using the AuthorityInfoAccess (AIA) certificate extension in the certificate to be checked.

Configure a QM for OSCP checking for certificates without AIA information.

You can configure a queue manager to do OCSP checking, for those certificates without AIA information within the certificate.

There is a queue manager attribute SSLCRLNL ( SSL Certificate RevocationList Name List) which points to a name list.   This name list has a list of AUTHINFO object names.
The name list can have up to one AUTHINFO object for OCSP checking and up to 10 AUTHINFO objects for CRL checking.

You define an AUTHINFO object to define the URL of an OCSP server.

Define AUTHINFO(MYOSCP) AUTHTYPE(OCSP) OCSPURL(‘HTTP://MyOSCP.server’)

Create a name list, and add the AUTHINFO to it.

Use alter qmgr  SSLCRLNL(name) and refresh security type(SSL)
You need to change the qm.ini file SSL stanza of the queue manager configuration file see here.

OCSPAuthentication=REQUIRED|WARN|OPTIONAL
OCSPCheckExtensions= YES|NO
SSLHTTPProxyName= string

If you have a fire wall around your network, you can use SSLHTTPProxyName to get through your fire wall.

There is some good information here.

Configure client OSCP checking for certificates without AIA information.

You need the CCDT created by a queue manager rather than a JSON CCDT.
When you configure a queue manager with the AUTHINFO objects and the queue manager SSLCRLNL attributes, the information is copied to the CCDT.

This CCDT is in the usual location, for example the /prefix/qmgrs/QUEUEMANAGERNAME/@ipcc directory.

You can use a CCDT from one queue manager, when accessing other queue managers.

You need to make the CCDT file available to the client machines, for example email or FTP,  or use URL access to the CCDT.

You also should configure the mqclient.ini file see here.

How do I check to see if my certificates have AIA information.

You can use the iKeyman GUI to display details about the certificate, or a command line like

/opt/mqm/bin/runmqckm -cert -details -db key.kdb -pw password -label CLIENT

This gives output like

Label: CLIENT
Key Size: 2048
Version: X509 V3
Serial Number: 01
Issued by: ….

Extensions:
– AuthorityInfoAccess: ObjectId: 1.3.6.1.5.5.7.1.1 Criticality=false
AuthorityInfoAccess [
[accessMethod: ocsp
accessLocation: URIName: http://ocsp
server.my.host/
]
]

Can I turn this OCSP checking off ?

For example if you think you have a problem with OCSP server response time.

In the mqclient.ini you can set

  • ClientRevocationChecks = DISABLED No attempt is made to load certificate revocation configuration from the CCDT and no certificate revocation checking is done
  • OCSPCheckExtensions = NO   This says ignore the URL in the AIA information within a certificate.

See SSL stanza of the client configuration file.

In the qm.ini you can set

  • OCSPCheckExtensions=NO  This says ignore the URL in the AIA information within a certificate.
  • alter qmgr SSLCRLNL(‘ ‘) and refresh security type(SSL)

See SSL stanza of the queue manager configuration file and Revoked certificates and OCSP.

How do I tell if I have a problem with OCSP?

There are no events or messages which tell you the response time of requests.

You may get message AMQ9716

AMQ9716
Remote SSL certificate revocation status check failed for channel …
Explanation
IBM MQ failed to determine the revocation status of the remote SSL certificate for one of the following reasons:
(a) The channel was unable to contact any of the CRL servers or OCSP responders for the certificate.
(b) None of the OCSP responders contacted knows the revocation status of the certificate.
(c) An OCSP response was received, but the digital signature of the response could not be verified.
You can change the queue manager configuration to not produce these messages, by setting ClientRevocationChecks = OPTIONAL
From this message you cannot tell if the request got to the server.
The easiest way may be to ask the network people to take a packet trace to the URL(s) and review the time of the requests and the responses.

Using the AuthorityInfoAccess (AIA) certificate extension in the certificate.

You can create certificates containing the URL needed to validate the certificate. Most of the IBM MQ documentation assumes you have already have a certificate with this information in it.
You can use openssl to create a certificate with AIA information, and import it into the iKeyman keystore.   See here.

You cannot use IBM GSKIT program iKeyman to generate this data because it does not support it. You can use iKeyman to display the information once it is inside the keystore.

Timing the validation request

Openssl has a command to validate a certificate for example

openssl ocsp -CAfile cacert.pem -issuer cacert.pem -cert servercert.pem -url http://OCSP.server.com:port -resp_text

You can use the linux time command, for example

time openssl ocsp -CAfile cacert.pem -issuer cacert.pem -cert servercert.pem -url http://OCSP.server.com:port -resp_text

you get

Response verify OK
servercert.pem: good
real 0m0.040s

The time taken to go to a OCSP server on the same machine is 40 milliseconds. The time for a ping to 127.0.0.1 was also 40 ms.

Thanks…

Thanks to Morag of MQGEM, and Gwydion at IBM for helping me get my head round this topic.

Whoops my QM emergency recovery procedures did not recover QM in an emergency!

I was working with someone, and we managed to kill a test queue manager on midrange.  I suggested we test out the “emergency procedures” as if this was production and see if we could get “production” back in 30 minutes.

We learned so much from this exercise, so we are now working on a new emergency recovery procedure.

What killed the queue manager

The whole experience started when we thought had better clean up some of the MQ recovery logs.   With circular logging the when the last log fills up it overwrites the first one.  This is fine for many people but it means you may not be able to recover a queue if it is damaged.
We had linear logging, where the logs are not overwritten, MQ just keeps creating new logs.   You can recover queues if they are damaged, because you can go back through the logs.
As our disk was filling up someone deleted some old logs – which were over a week old – and were “obviously” not needed.

MQ was shut down, and restarted – and failed to start.

Lesson 1:  With linear logging you are meant to use the rcdmqimg command which copies queue contents to the log.   You get a message telling you which logs are needed for object recovery, and which logs are needed for restart.   This information is also in the  AMQERRxx.LOG.  You cannot just delete old logs as they may still be needed.

Issue the command at least daily.

Lesson 2:  HA disks do not give you HA.   The disks were mirrored to the backup site – also to the DR site.  The delete file command was reliably done on all disk copies.  We could not start MQ on any site because of the missing file.  We should have had a second queue manager.

These HA disk solutions and active/standby give you faster access to your data, in  a multi site environment, they do not give you High Availability

Initial panic – what options do we have

Lesson 3: Your instructions on how to recover from critical situations need to be readily available.  They should be tested regularly.    We could not find any. You need a process to follow which works, and you have timings for.  So you do not have a half hour discussion “should we restore from backup?”, “how long will it take?”, “will it work?”, “how do we restore from backup”.   The optimum solution may be to shoot  the queue manager and recreate it.  This may be the optimum route to getting MQ “production” back.  You should not have to make critical decisions under pressure, the decision path should have been documented when you have the luxury of time.

Lesson 4: you need to capture the output of every command you are doing.  Support teams will ask “please send me the error logs”.  You do not want to have to copy an paste all of your terminal data.  Linux has a “script” command which does this.  They could not email me the log showing the problems, so we had to have a conference call, and “share screens” to see what was going on, which made it hard for me to look at the problem “up a bit, down a bit – too far”.  All of which extended the recovery period.

Lesson 5: “Let’s restore from the backups”  These backups were taken  12 hours before and were not current, and we did not know how to restore them.
(Little thought, should backups be taken when a QM is down, or do you get integrity issues because files and logs were backed up at different times? – I know z/OS can handle this – Feedback from Matt at IBM.  Yes the queue manager should be shut down for backups – so you need two or three queue managers in your environment.

Make sure you backup your SSL keystore.

Let’s recreate the queue manager

Lesson  6: Do you have any special actions to delete multi instance queue managers.

Do you need linux people to do anything with the shared disks?

Lesson 7: Save key queue manager configuration files.  When you delete a queue manager instance – it deletes any qm.in and MQAT.ini files – you need them as they may have additional customising, for example SSL information.
Of course you are backing these files up -and of course you (personally)  have tested that you can  recover them from the backup.

Copy qm.in and MQAT.ini to a safe location before you delete the queue manager.

Lesson 8:  Ensure people have enough authority to be able to do all the tasks – or have an emergency “break glass userid”.  Many sites only allow operations people to access production with change capability.

Lesson 9:  You need to know how the  create queue manager command and parameters used to create the queue manager.
Some queue manager options can be changed after the queue manager has been started.  Others cannot – for example linear logging|circular logging.  Size of log files etc.

You need to have saved the original command used with all of the options.   Do not forget that when you did it the first time it was MQ V7.5 – you are now migrated to MQ V9, so it should work OK!

Lesson 10: Copy the qm.ini files etc and overwrite the newly created ones.

Start the queue manager.

Lesson 11:  Customize the queue manager.  You need to have a file of all of your objects queues and channels etc.  You may have a file which you use to create all new queue managers, but this may not  be up to date.  It is better to run dmpmqcfg every day to dump the definitions to get the “current” state of the objects which you can reload.
The -o 1line option is useful as then you can use grep to select objects with all the parameters.

Lesson 12: In your emergency recreate document  note how long each stage takes.  One step, closing down the queue manager took several minutes.  We were discussing if was looping or not – and should we cancel it.  Eventually it shut down.  It would have been better to know this stage takes 5 minutes.

Lesson 13: Document the expected output from each stage – and highlight any stage which gives warnings or errors.  We ran runmqsc with a file of definitions, and it reported 7 errors.  We wasted time checking these out.  Afterward we were told “We always get those”.

Lesson 14: Do you need to do work for your multi instance queue managers?

Getting the queue manager back into “production”. 

Lesson 15: Resetting receiver channel sequence numbers.   Sender and receiver channels will have the wrong sequence number.  You can reset the sender channels yourself.  Receiver channels are a bit harder, as the “other end” has to reset the sequence number. You can  either

  • Contact the people responsible for the other end (you do have their contact details dont you?) and
  • ask them to reset the channel,

or you wait till their queue manager sends you a message – and then you get notified of the sequence number mismatch, and can use reset channel to reset your number to the expected number.   The channel will retry and this time it will  work.  This means you need to sit by your computer, waiting for these events.  Maybe no messages will be sent over the weekend, and so you can logon first thing Monday morning to catch the events.

Lesson 16: Your SSL keystore is still available isnt it?

Lesson 17: Is every one who has the on-call phone familiar with this procedure, and has practiced it?

Lesson 18: People need to be familiar with the tools on the platform.  You may normally use notepad to edit files on your personal work station.  On the production box you only have “vi”.

Overall – this is one process that needs to work – and to get your queue manager up in the optimum time.  You need to practice it, and get it right.

Summary

You need to practice emergency recovery situations

I used to do Scuba diving.  You learn, and have to practice “ditch and retrieve” where you take your kit off under water and have to put it on again.   Once I needed to do this in the sea.  It was dark, I got caught in a fishing net, so I had to take my kit off, untangle it (by touch), and put it on again.  If I had not practiced this I would not be here today.

 

Checking the daisy chain around my MQ network

Young children collect flowers and chain them to make a circle and so make a daisy chain.

People also talk about daisy chaining electrical extension leads together to make a very long lead out of lots of small leads.

In MQ we can also have daisy chains.  One use is to check all of the links are working, and there are no delays on the channels.

If an application puts a message onto the Clustered Request Queue(BQ) on QMA, it goes around and the reply can be got from the Reply queue, then we have checked all the links are working; we have daisy chained the requests.

DaisyChainOnce I Once I got it working the definitions were simple.

 

On QMB

DEFINE QR(BQ) RQMNAME(QMC) RNAME(CQ) CLUSTER(MYCLUSTER)

On QMC

DEFINE QREMOTE(CQ) CLUSTER(MYCLUSTER) RQMNAME(QMA) RNAME(REPLY)

On QMA

DEFINE QL(REPLY)

Once we have set the definitions up I could can use the MQ utility dspmqrte to show us the path.  For example

  • dspmqrte puts a message to the BQ .  This is a clustered queued on QMB,  It reports the queue BQ is being used, and stores the message on SYSTEM.CLUSTER.XMIT.QUEUE.
  • On QMA the channel TO.B gets the message from the SCTQ and sends it
  • On QMB the channel TO.B says put this to BQ, which is defined as CQ, and stores it on the SCTQ.
  • On QMB the channel TO.C gets the message from the SCTQ and sends it to QMC.
  • On QMC the channel TO.C says put this to CQ, which is defined as REPLY,  and stores if on SCTQ
  • On QMC the channel TO.A gets the message from the SCTQ and sends it to QMA.
  • On QMA the channel TO.A puts it to the Reply queue.

I used dspmqrte -m QMA -q BQ …, and it worked like magic.   I requested summary information(-v summary) and I got the following output, which shows the intermediate queues used.

AMQ8653I: DSPMQRTE command started with options ‘-m QMA -qBQ -rqCP0001 -rqm QMA -v summary -d yes -w5’.
AMQ8659I: DSPMQRTE command successfully put a message on queue ‘SYSTEM.CLUSTER.TRANSMIT.QUEUE’, queue manager ‘QMA’.
AMQ8674I: DSPMQRTE command is now waiting for information to display.
AMQ8666I: Queue ‘SYSTEM.CLUSTER.TRANSMIT.QUEUE’ on queue manager ‘QMA’.
AMQ8666I: Queue ‘SYSTEM.CLUSTER.TRANSMIT.QUEUE’ on queue manager ‘QMB’.
AMQ8666I: Queue ‘SYSTEM.CLUSTER.TRANSMIT.QUEUE’ on queue manager ‘QMC’.
AMQ8666I: Queue ‘REPLY’ on queue manager ‘QMA’.
AMQ8652I: DSPMQRTE command has finished.

Note, specifying RQMNAME is not required, and clustering will pick a queue manager which hosts the queue.  This means that you may be testing a different path to what you expected.  By using it you specify the route.

When I stopped QMC and retried the dspmqrte command , I got

AMQ8653I: DSPMQRTE command started with options ‘-m QMA -qBQ -rqCP0001 -rqm QMA -v summary -d yes -w5’.
AMQ8659I: DSPMQRTE command successfully put a message on queue ‘SYSTEM.CLUSTER.TRANSMIT.QUEUE’, queue manager ‘QMA’.
AMQ8674I: DSPMQRTE command is now waiting for information to display.
AMQ8666I: Queue ‘SYSTEM.CLUSTER.TRANSMIT.QUEUE’ on queue manager ‘QMA’.
AMQ8666I: Queue ‘SYSTEM.CLUSTER.TRANSMIT.QUEUE’ on queue manager ‘QMB’.
AMQ8652I: DSPMQRTE command has finished.

It does not report that there were any problems;  it just did not report two hops.

To see if there are problem, I think the best thing to do is pipe the output into a file

dspmqrte… > today

and compare this with a good day.

diff today goodday -d  gave me the differences – so I could see there was a problem because I was missing

> AMQ8666I: Queue ‘SYSTEM.CLUSTER.TRANSMIT.QUEUE’ on queue manager ‘QMC’.
> AMQ8666I: Queue ‘REPLY’ on queue manager ‘QMA’.


I had tried to define a clustered queue alias queues instead of a remote queue.  I got responses like

Feedback: UnknownAliasBaseQ, MQRC_UNKNOWN_ALIAS_BASE_Q, RC2082.

Should I use dynamic queues? – probably not for high volume production work

I’ve been looking into temporary dynamic queues on midrange MQ.  People tend to use these as a scratch queue which can be deleted when it has been finished with.   Rather than use a shared reply queue, and get messages by msgid or correlid, you can just say get next message, as you know there will only be messages for your application instance  on the  queue.

The application team may save money by using a dynamic instead of a common replyTo queue,  but you are likely to pay for it else where – including management time.

Using dynamic queues may be easy to program – but can have some serious implications.

Sometimes a dynamic queue is used because it saves the programmer about 5 minutes of coding by not specify nor testing gets specifying  msgid or correlid.
One customer had a paranoid architect  who did not want an application instance to have access to messages from another instance.  With a well designed, coded, and reviewed application this should not occur – or use IBM MQ AMS to encrypt the messages.

I had a email from someone who said they are having problems with statistics on midrange because of the  large number of dynamic queue – more of this later.   I think they said they get statistics  for 10 million dynamic queues a day, hence the question if they should change their applications.

This temp dynamic queue can be used for

  • a one-of:  MQOPEN of a model queue to create a  dynamic queue. Then  MQPUT of request, the reply comes back to the dynamic queue, MQCLOSE of the reply queue, which deletes the queue.
  • open the queue in the morning, use it all day when there is no work to do. The application ends, and deletes the queue.

You get the messages, close the queue, and the queue disappears – easy – what can go wrong?

I’ll cover

  • the impact on the applications of using dynamic queues,
  • the impact of systems management.

The impact of applications using dynamic queues

The documentation says The queues are deleted when the application that issued the MQOPEN call that created the queue closes the queue or terminates. … If the queue is in use at this time (by the creating, or another application), the queue is marked as being logically deleted, and is only physically deleted when closed by the last application using the queue.

For performance reasons, a receiver channel does not use MQPUT1 to put the messages to the queue, it does MQOPEN, MQPUT, MQPUT…  as is many cases the same few queues are used. and MQOPEN, MQPUT, MQPUT is more efficient.  After some time, if the queue has not been used the channel will close the queue.
The implications of this are that if the application is a one-of,  the program started and finished in 1 second,  a channel may keep hold of the queue. You think there is only one dynamic queue in use at any time, but if you have 1000 channels – there is potentially a lot of queues being only logically deleted and still around.

Conceptually the queue manager has to created the queue every time a dynamic queue is needed.   In practice the queue manager has a pool of these which can be serially reused and so keep the costs down.   This is fine for a small number of queues.

I did some totally unscientific measurement on my laptop to give a indication of the impact of using dynamic queues.  The applications I used are completely non typical.  The measurement give an indication and are meant to make you think about using dynamic queues rather than giving information you can use.

First measurement open the queues

Have a thread open a queue, 20,00 times.  The application does 2000 opens, report statistics,  wait 1 second, repeat.

Queue type Average CPU per open Average Elapsed time uSeconds per open
Permanent First 2000 : 7
Last 2000 : 8
First 2000 : 64
Last 2000:  101
Dynamic First 2000 : 8
Last 2000 : 8
First 2000 : 183
Last 2000: 303

The CPU usage looks similar, but the elapsed time increases when dynamic queues are used.

While the first program had its dynamic queues open, a second instance was run, the first 2000 took  290 microseconds, the last 2000 took 343 microsecond per open on average, so there is an increase in open time depending on the number of temp dynamic queues which are open.  The increase is not just dependent of the number of queues opened per thread.

Second measurement – close the queues.

The queues were closed

  • test 2A, in the order they were opened,
  • test 2B,  in reverse order of opening.
Queue type Order Average CPU Average Elapsed time uSeconds
Permanent First opened, first closed First 2000 : 284 Last 2000 : 11 First 2000 : 1533
Last 2000: 83
Permanent Last opened, first closed First 2000: 5
Last: 2000: 8
First 2000 : 58
Last 2000: 65
Dynamic First opened, first closed First 2000 : 284 Last 2000 : 16 First 2000 : 1533
Last 2000: 783
Dynamic Last opened, first closed First 2000: 8
Last: 2000: 8
First 2000: 295
Last: 2000: 736

When the queue is last-opened-first-closed this is much cheaper than first-opened-first- closed.  It feels like there is a linear list of handles.  It starts at the last opened, and scans backwards looking for the specific handle.

  • An application should not have hundreds of queues open.
  • If it does, the cost will depend, on where the handle is in the list – the “cheap end”, or the “expensive end”.  You may have no control over this, so expect significant variation in the cost and elapsed time of performing the MQCLOSE on the queue.
  • The dynamic queues took longer to close than the permanent queues
  • In general it is quicker and uses less CPU time to open a permanent queue  than using a dynamic queue.   If you open and shut a queue 100 times a second this may be relevant.  If you open and close them once a day, and keep them open all day,  this is not so important.

Overwhelmed by statistics data.

All good production environment should be collecting statistics on queue usage.  If the STATQ attribute is ON ( either for the queue, or for the queue manager) then a statistics record will be produced every STATINT seconds.

Each time I ran my test  I had over 200 messages on SYSTEM.ADMIN.STATISTICS.QUEUE, and 20,000 records, one for each dynamic queue (100 record per statistics message).
For STATINT of 5 minutes, if I had opened the queues, and left them open, in  a 10 hour day this would be 12 * 10  * 20,000 which is 2.4 million records.  (If it reported the model queue name, instead of the generated name this would be just 120 records and would be so much more useful)!.
When I ran my program many times in the 5 minute interval I got many * 20,000 rows.

You take these records and store them in your statistics database, and instead of having one database row saying “at 0805 queue PAYROLL had 10000 MQPUTS”, you have 10,000 rows saying “At 0805 Queue AMQ.5CF22BF6214…. had one MQPUT”.   As you keep your data for at least 13 months (so you can look back at last year’s peak periods), you have a lot of data and wasted space.

What you can do.

  1. Turn off stats collection for these dynamic queues – not very helpful.
  2. Post process the data and aggregate all of the dynamic queues for a period into one record (select (sum(puts),….) where qname like ‘AMQ%’….) – and then delete the individual records.  Extra work which could be avoided if you did not use dynamic queues.
  3. Use a model queue name for each business application so you get  MOBILE.5CF22BF62144A7A4, and CreditCheck.5CF22BF62144A7A4 instead of AMQ.5CF22BF62144A7A4 to make it easier to see where the queues are used.
  4. Change the applications to use a shared reply queue. This may be the hardest one to do.
    1. Ensure the applications get their message by MSGID or CORRELID, and not just get next.
    2. Use a common reply queue names to provide isolation – such as MOBILE.REPLY and CREDITCHECK.REPLY.
    3. You may have to implement best practices of cleaning up old messages.   If a dynamic queue was closed any messages on the queue would be deleted.  As you are not deleting the common queue – you need to clean these up yourself.   This can occur when an application timed out and then the reply arrived.

In summary

The applications team may have saved some money  by using a dynamic queue instead of a shared reply queue – but you will spend much more money, on CPU, disk space for database tables.   Then of course there is the management time spent discussing why the reporting of dynamic queue usage is not so useful.

 

 

 

 

Featured

When is activity trace enabled?

I found the documentation for activity trace was not clear as to the activity trace settings.

In mqat.ini you can provide information as to what applications (including channels) you want traced.

For example

applicationTrace:
ApplClass=USER
ApplName=progput
Trace=OFF

This file and trace value are checked when the application connects.  If you have TRACE=ON when the application connects, and you change it to TRACE=OFF, it will continue tracing.

If you have TRACE=OFF specified, and the application connects, changing it to TRACE=ON will not produce any records.

With

  • TRACE=ON, the application will be traced
  • TRACE=OFF the application will not be traced
  • TRACE= or omitted then the tracing depends on alter qmgr ACTVTRC(ON|OFF).   For a long running transaction using alter qmgr to turn it on, and then off, you will get trace records for the application from in the gap.

If you have

applicationTrace:
ApplClass=USER 
ApplName=prog* 
Trace=OFF

applicationTrace:
ApplClass=USER 
ApplName=progp*
Trace=ON

then program progput will have trace turned on because the definition is more specific.

You could have

applicationTrace:
ApplClass=USER 
ApplName=progzzz
Trace=OFF

applicationTrace:
ApplClass=USER 
ApplName=prog*
Trace=

to  be able to turn trace on for all programs beginning with prog, but not to trace progzzz.

 

Thanks to Morag of MQGEM  who got in contact with me, and said  long running tasks are notified of a change to the mqat.ini file, if the file has changed, and a queue manager attributed has been changed – even if it is changed to the same variable.

This and lots of other great info about activity trace (a whole presentation’s worth of information) is available here.

Buried options to dspmqrte make it more useful, but…

I was bemoaning that I dspmqrte did not actually tell me when the message arrived on the queue (COA), and  when the application did an MQGET ( message delivered, COD).

I piped the output from the dspmqrte command into a few lines of python and got

AMQ8659I: DSPMQRTE command successfully put a message on queue SYSTEM.CLUSTER.TRANSMIT.QUEUE , queue manager QMA .
AMQ8674I: DSPMQRTE command is now waiting for information to display.
Put QM:QMA QN:CSERVER RemoteQMName:QMC
Get QM:QMA QN:SYSTEM.CLUSTER.TRANSMIT.QUEUE 
Send QM:QMA RQMN:QMC Chl:CL.QMC ChlT:ClusSdr 
Receive QM:QMC RQMN:QMA Chl:CL.QMC 
Discard QM:QMC QN:CSERVER
AMQ8652I: DSPMQRTE command has finished.

We can see the message got there because it was discarded.

I changed my back end application to ignore messages with

MsgType :MQMT_DATAGRAM  and Format :MQADMIN.

Now with the command

dspmqrte -m QMA -q CSERVER -rq CP0001 -rqm QMA -d yes -v outline -w5 -ro activity,coa,cod

I got additional messages

AMQ8654I: Trace-route message arrived on queue manager QMC .
AMQ8662I: Trace-route message delivered on queue manager QMC .

When I stopped my backend application I got the message arrived – but not the message delivered, as expected.

For a small of work in the application I got a huge improvement in problems diagnostic tools.    I was really please with my progress, and had a cup of tea to celebrate.

The title said “but…”.  This is because the message arrived, and message delivered are not displayed in real time, but are produced after the time out interval specified in the -w option had expired.  I had hoped to use dspmqrte to tell me where the delays were in route to the queue.
So all I can tell is the message got there – but not how long it took.  What a wasted opportunity to provide useful information.

So get your applications changed to ignore these trace route messages – and use the -ro COA and COD to tell you if your messages are being processed in a timely manner!


Update.
If you specify the -d yes option the message can be got by your application.
If you then mqput this message to the next application in a sequence of applications, you will get the channel activity for sending it on, and the COA and COD messages sent back when it arrives at each hop.
Unfortunately, the dspmqrte program displays the last COA, and the last COD messages.

I sent the request CSERVER on QMC.   The application sent this to queue CP0000 on QMA.  The COA and COD message displayed were for QMA.  The COA and COD messages do not tell you which queue the event applies to!

 

 

 

 

 

 

Sorting out the MQ application trace knotted spaghetti.

You can turn on report = MQRO_ACTIVITY to get activity traces sent to a queue.   This shows the hops and activity of your message.
You can create your own trace route messages to be sent to a remote queue, and get back the hops to get to the queue, or you can use the dspmqrte command to do this for you.

Which ever way you do it, the result is a collection of messages in your specified reply queue.  The problem is how do you untangle the messages.  It is not easy with for a single message.  If you are getting these activity messages every 10 seconds from multiple transactions, you quickly  get knotted spaghetti!   To entangle the spaghetti even more, you could have a central site processing these data from many queue managers, so you get data from multiple messages, and multiple queue managers.

You can get a message for part of your application or transaction.  For example,

  • a message with information about the first 10 MQ verbs your program uses.
  • a message for the sender channel with the MQGET and the send for the local queue manager, and the remote queue manager will send a message with the channel’s receive and MQPUT.

The easy bit – messages for activity on your queue manager.

The event message has a header section.  This has information including

QueueManager: ‘QMA’
Host Name: ‘colinpaice’
SeqNumber: 0
ApplicationName: ‘amqsact’
ApplicationPid: 28683
UserId: ‘colinpaice’

From  QueueManager: ‘QMA‘ and Host Name: ‘colinpaice‘, you know which machine and queue manager you are on.

From ApplicationPid: 28683 SeqNumber: 0, you can see the records for this applications Process ID, and the sequence number.   This happens to be for a program ApplicationName: ‘amqsact’ and UserId: ‘colinpaice’.  I dont know when the sequence number wraps.  If the application ends, and the same process is reused,  I would expect the sequence number to be reset to 0.

You may have many threads running in a process , such as for a web server.  For each MQ operation  there is information for example

MQI Operation: 0
Operation Id: MQXF_PUT
ApplicationTid: 81
OperationDate: ‘2019-05-25′
OperationTime: ’14:28:18’
High Res Time: 1558790898843979
QMgr Operation Duration: 114

We can see that this is for Task ID 81.

So to tie up all of the activity for a program, you have to select the records with the same ApplicationPid, and check the SeqNumber to make sure you are not missing records.  Then you can locate the record with the same TID.
You also need to remember that a thread behaviour can be complex ( like adding meat balls to the spaghetti).  Because of thread pooling, an application may finish with the thread, and the thread can be reused.  If a thread is not being used, it can be deleted, so you will get MQBACK and MQDISC occurring after a period of time.

It is similar for channels

For a sending channel you get the following fields.

QueueManager: ‘QMA’
Host Name: ‘colinpaice’
SeqNumber: 1723
ApplicationName: ‘runmqchl’
Application Type: MQAT_QMGR
ApplicationPid: 5157
ConnName: ‘127.0.0.1(1416)’
Channel Type: MQCHT_CLUSSDR

MQI Operation: 0
Operation Id: MQXF_GET
ApplicationTid: 1

For a  receiving channel you get

QueueManager: ‘QMA’
Host Name: ‘colinpaice’

SeqNumber: 1746
ApplicationName: ‘amqrmppa’
ApplicationPid: 4509
UserId: ‘colinpaice’

Channel Name: ‘CL.QMA’
ConnName: ‘127.0.0.1’
Channel Type: MQCHT_CLUSRCVR

MQI Operation: 0
Operation Id: MQXF_OPEN
ApplicationTid: 5
….
MQI Operation: 1
Operation Id: MQXF_PUT
ApplicationTid: 5
….

As there can be many receiver channels with the same name for example an Receiver MCA channel, you should be able to use the CONNAME IP address to identify the channel being used.

They may have the same or different ApplicationPid.

It might be easier just search all of the channels for the messages with matching msgid and correlid!