facepalm: Why my mqconsole was refreshing every 60 seconds

I had just upgraded MQ to v915 and was trying out the new mqconsole via the web stuff.  It was really annoying because every 60 seconds the web page would refresh, and I would have to start again.

I checked my mqwebuser.xml file for “60” seconds, but there were no hits.

Gwydion suggested I check the ltpaExpiration value.   This specifies how long a logon token is valid for; after it has expired you have to logon. again .  There were several problems here:

  • I had <variable name=”ltpaExpiration” value=”30″/> .   This says your session expires after 30 minutes.  Because I was using certificates to logon, I automatically get relogged on after this interval expires, this was 30 minutes not 1 minute so not my problem.
  • I used grep ltpa mqwebuser.xml and found I had several  <variable name=”ltpaExpiration” value=…./> statements. The last one had a value of 01 – one minute.   This was the statement causing my problems.

What did I learn from this?

  • Use the command  dspmqweb properties -a  to display all of the parameters.   If they are not as you expect, check the mqwebuser.xml file.
  • The setmqweb command deletes all the entries except for the last one, and updates the last entry with the specified value.   This solved my duplicate entry problem.
  • I specified setmqweb properties -k ltpaExpiration -v 015 to set it to 15 minutes.
  • You can put a comment  <!– unit in minutes –> before(near) the statement.
  • I also added unit=minutes into the statement <variable name=”ltpaExpiration” unit=”minutes” value=”016″/>.  setmqweb, and startup ignore this.

I tend not to use the setmqweb command as I pretend to be an enterprise where every change has to have a change record and where configurations are created centrally and deployed to the machines.

What is the connection between MQ, a thatched cottage, a museum, an aeroplane and a factory?

Do you give up on the question – so do I!

I upgraded to 915 and used the “improved” mqconsole web interface.  This has several problems. Under “MQ Basics, what is messaging?” you get some pictures.  For example, we seem to have random meteors going to random places,

It is hard to understand, as it uses a white background and off-white colours for connections.   I took the source of this, and by a bit of python magic changed the html and produced

This is clearer, but I am non the wiser.  We have  icons for Home,  Shopping Trolley, Thatched cottage, High rise building with garage, Factory, Aeroplane, and Museum.  The aeroplane sends stuff to the blue box( queue manager?) but messages are not processed.

My blog posts from the old IBM community

The old IBM community site is being replaced by a new IBM community web site, and old blogs are not being copied across.

I’ve managed to copy the posts I wrote when I worked for IBM and copied them to this web site. It has been a trip down memory lane reading the posts, and the topic I covered.

NETTIME does not just mean net time

I saw a post which mentioned NETTIME and where people assume it is the network time.   It is more subtle than that.

If NETTIME is small then dont worry.   If NETTIME is large it can mean

  • Slow network times
  • Problems processing messages at the remote end

Consider a well behaved channel where there are 2 messages on the transmission queue

Sender end Receiver end
  • MQGET first message from the queue
  • TCP Send message in  buffer 1
  • MQGET second message from the queue
  • TCP Send message in buffer 2
  • MQGET – no message found
  • Do end of batch processing
    • TCP Send “End of batch” in buffer 3
    • Start timer
    • Wait
    • buffer arrives, wake up application
    • Stop timer. Interval is “Sender wait time”
  • Extract “receiver processing time” from reply buffer
  • Calculate NETTIME = “sender wait time” – “receiver processing time”
  • buffer 1 arrives, wake up Receiver channel application
  • buffer 2 arrives
  • TCP receive buffer 1 from network
  • MQPUT message 1 to the queue
  • buffer 3 arrives
  • TCP receive buffer 2 from network
  • MQPUT message 2  to the queue
  • TCP receive buffer 3 from the network – get  “end of batch” flag
    • Start timer
    • Issue commit
    • Stop timer
    • Send “OK + time interval back to Sender

In this case the NETTIME is the total time less the time at the receiver end.  So NETTIME is the network time.

In round numbers

  • it takes 2 millisecond from sending the data to it being received
  • get + send takes 0 ms ( the duration of these calls is measured in microseconds)
  • receive (when there is data) + MQPUT and put works, takes 0 ms
  • commit takes 10 ms
  • it takes 1 ms between sending the response and it arriving.
  • “10 ms” is sent in the response message

This is a simplified picture with details omitted.

The sender channel sees 13 ms  between the last send and getting the response.  (13 ms – 10 m)s is 3 ms – the time spent in the network.

 

Now introduce a queue full situation at the receiver end

Sender end Receiver end
  • MQGET first message from the queue
  • TCP Send message in buffer 1
  • MQGET second message from the queue
  • TCP Send message in buffer 2
  • MQGET – no message found
  • Do end of batch processing
    • TCP Send “End of batch” in buffer 3
    • Start timer
    • Wait
    • buffer arrives, wake up application
    • Stop timer. Interval is “Sender wait time”
  • Extract “receiver processing time” from reply buffer
  • Calculate NETTIME = “sender wait time” – “receiver processing time”
  • buffer 1 arrives, wake up Receiver channel application
  • buffer 2 arrives
  • TCP receive buffer 1 from network
  • MQPUT message 1 to the queue – it gets queue full, it pauses
  • buffer 3 arrives.  All of the data is in the buffers in TCP at this end.
  • after 500 ms the MQPUT succeeds.
  • TCP receive buffer 2 from network
  • MQPUT message 2 to the queue
  • TCP receive buffer 3 from the network – get “end of batch” flag
    • Start timer
    • Issue commit
    • Stop timer
    • Send “OK + time interval back to Sender

In round numbers

  • it takes 2 millisecond from sending the data to it being received
  • get + send takes 0 ms ( it is in microseconds)
  • receive (when there is data) takes 0 ms
  • the pause and retry took 500 ms
  • the second receive and MQPUT takes 0 ms
  • commit takes 10 ms
  • it takes 1 ms between sending the response and it arriving.
  • “10 ms” is sent ( as before) in the response message (the time between the channel code seeing the “end of batch” flag and the end of its processing
  • Buffer 3 with the “end of batch” flag was sitting in the TCP buffers for 500 ms

The sender channel sees 513 ms  between the last send and getting the response.  513 ms – 10 ms is 503  ms – and reports this as ” the time spent in the network” when in fact the network time was 3 ms, and 500 ms was spent wait to put the message.

Regardless of the root cause of the problem, a large nettime should be investigated:

  • do a TCP ping to do a quick check of the network
  • check the error logs at the receiver end
  • check events etc to see if the queues are filling up at the receiver end

The ccdt file was exactly where I specified it.

I had been successfully using a json ccdt file to use runmqsc as a client.  Because of a “.” it didn’t work when I used it elsewhere.

I had a directory of useful shell scripts.  In this directory I had my mqclient.ini with

CHANNELS:
MQReconnectTimeout=30
ReconDelay=(1000,200)(2000,200)(4000,1000)
ChannelDefinitionDirectory=.
ChannelDefinitionFile=ccdt.json

I had another directory with some shell scripts for some other applications.  I used

export MQCLNTCF=/home/colinpaice/… to point to the mqclient.ini file.

When I used runmqsc -c QMA I got

AMQ8118E: IBM MQ queue manager does not exist.

In /var/mqm/errors/AMQERR01.LOG was the error message AMQ9518E: File ‘./ccdt.json’ not found.

This was because I had specified ChannelDefinitionDirectory=. instead of a fully qualified name.  I had wrongly assumed that “.” meant the same directory as the mqclient.ini file.

The moral of this user error is to

  • use a fully qualified directory instead of “.”
  • or omit this parameter, and put the ccdt.json in the default location /var/mqm – but you may not have write access to this directory.
  • and go to the bottom of the class.

Do not panic so much if you are still running MQ V8

I had a question from a customer about being unable to use the REST API because they were still on MQ V8.

The announced end of support date for V8 was the end of April 2020. This gives them no time to upgrade test system, pre-production, and production.

Do not panic if you haven’t  upgraded yet – because the page says

This gives you a bit of time to get to MQ 9.1 – dont delay, raise a change request today.

Using Activity Trace to show a picture of which queues and queue managers your application used.

I used the midrange MQ activity trace to show what my simple application, putting a request to a cluster server queue and getting the reply, was doing.  As a proof of concept (200 lines of Python), I  produced the following

This output is a .png format.   You can create it as an HTML image, and have the nodes and links as clickable html links.

Ive ignored any SYSTEM.* queues, so the SYSTEM.CLUSTER.TRANSMIT.QUEUE does not appear.

The red arrows show the “high level” flow between queue managers at the “architectural”, hand waving level.

  • The application oemput on QMA did a put to a clustered queue CSERVER, there is an instance of the queue on QMB and QMC.   There is a red line from QMA.oemput to the queue CSERVER on QMB and QMC
  • The server programs, server running on QMB and QMC put the reply message to queue CP0000 on queue manager A

The blue arrows show puts to the application specified queue name – even though this may map to the S.C.T.Q.  There are two blue lines from QMA.oemput because one message went to QMC.CSERVER, and another went to QMB.CSERVER

The yellow lines show the path the message took between queue managers.  The message was put by QMA.oemput to queue CSERVER; under the covers this was put to the SCTQ.  From the accounting trace record this shows the remote queue manager and queue name:  the the yellow line links them.

The black line is getting from the local queue

The green line is the get from the underlying queue.  So if I had a reply queue called CP0000, with a QAlias of QAREPLY. If the application does a get from QAREPLY,  There would be a black line to CP0000, and a green line to QAREPLY

How did I get this?

I used the midrange activity trace.

On QMA I had in mqat.ini

applicationTrace:
ApplClass=USER # Application type
ApplName=oemp* # Application name (may be wildcarded)
Trace=ON # Activity trace switch for application
ActivityInterval=30 # Time interval between trace messages
ActivityCount=10 # Number of operations between trace msgs
TraceLevel=MEDIUM  # Amount of data traced for each operation
TraceMessageData=0 # Amount of message data traced

I turned on the activity trace using the runmqsc command

ALTER QMGR ACTVTRC(ON)

I ran some work load, and turned the trace off few seconds later.

I processed the trace data into a json file using

/opt/mqm/samp/bin/amqsevt -m QMA -q SYSTEM.ADMIN.TRACE.ACTIVITY.QUEUE -w 1 -o json > aa.json

I captured the trace on QMB, then on QMC, so I had three files aa.json, bb.json, cc.json.  Although I captured these at different times, I could have collected them all at the same time.

jq is a “sed” like processor for processing json data.   I used it to process these json files and produce one output file which the Python json support can handle.

jq . --slurp aa.json bb.json cc.json  > all.json

The two small python files are zipped here. AT.

I used ATJson.py python script to process the all.json file and extract out key data in the following format:

server,COLIN,127.0.0.1,QMC,Put1,CP0000,SYSTEM.CLUSTER.TRANSMIT.QUEUE,QMC,CP0000,QMA, 400

  • server, the name of the application program
  • COLIN, the channel name, or “Local”
  • 127.0.0.1, the IP address, or “Local”
  • QMC, on this queue manager
  • Put1, the verb
  • CP0000, the name of the object used by the application
  • SYSTEM.CLUSTER.TRANSMIT.QUEUE, the queue actually used, under the covers
  • QMC, which queue manager is the SCTQ on
  • CP0000, the destination (remote) queue name
  • QMA, the destination queue manager
  • 400 the number of times this was used, so 400 puts to this queue.

I had another python program Process.py which took this table and used python graphviz to draw the graph of the contents.  This produces a file with DOT  (graph descriptor language)parameters, and used one of the many programs to draw the chart.

This shows you what can be done, it is not a turn-key solution, but I am willing to spend a bit of time making it easier to use, so you can automate it.  If so please send me your Activity Trace data, and I’ll see what I can do.

How do I use the mid-range MQ accounting data?

Like many topics I look at, this topic seems much harder than I expected.  I thought about this as I walked the long way to the shops, up a steep hill and down the other side (and came back the short way along the flat).  I wondered if it was because I ignore the small problems (like one queue manager and two queues), and always look at the bigger scale problems, for example with hundreds of queue managers and hundreds of queues.

You have to be careful processing and interpreting the data, as it is easy to give an inaccurate picture see “Be careful” at the bottom of this post.

Here are some of the things I found out about the data.

Enabling accounting trace

  • You need to alter qmgr acctQ(ON) acctqMQI(ON) before starting your applications  (and channels).  Enabling it after the program has started running does not capture the data for existing tasks.  You may want to consider enabling accounting trace, then shutdown and restart the queue manager to ensure that all channels are capturing data.
  • The interval ACCTINT is a minimum time.  The accounting records are produced after this time, and with some MQ activity.  If there is no MQ activity, no accounting record is produced, so be careful calculating rates eg messages processed a second.
  • Altering ACCTINT does not take effect until after the accounting record has been produced.  For example if you have ACCTINT(1800) and you change it to ACCTINT(60), the “ACCTINT(1800)” will have to expire, produce the records, then the ACCTINT(60) becomes operative.
  • You can use the queue attribute ACCTQ() to disable the queue accounting information for a queue.

The data

  • The MQI accounting has channel and connection information, the queue accounting does not.  This means you cannot tell the end point from just the queue data.
  • The MQI accounting and Queue accounting have a common field the “connectionId”.   It looks like this is made up with the first 12 characters of the the queue manager name, and a unique large number (perhaps a time stamp).  If you are using many machines, with similar long queue manager names you may want to use a combination field of machineName.connectionId, to make this truly unique.
  • I had an application using a dynamic reply to queue.  I ran this application 1000 times, so 1000 dynamic queues were used. When the application put to the server, and used the dynamic reply queue, the server had a queue record for each dynamic queue.   There were up to 100 queue sections in each queue record, and 11 accounting queue messages for the server ( 1000 dynamic queues, and one server input queue).  These were produced at the end of the accounting interval, they all had the same header information, connectionId, start time etc. You do not know in advance how many queue records there will be.
  • Compare this to using a clustered queue on a remote queue manager, the server queue accounting record on the remote system.had just two queues, the server input queue, and the SYSTEM.CLUSTER.TRANSMIT.QUEUE.
    • The cluster receiver channel on the originator’s queue manager had a queue entry for each dynamic queue.
  • In all my testing the MQI record was produced before the queue accounting record for a program.   If you want to merge the MQI and the Queue records, save information from the MQI record in a table, keyed with the connectionId. When the Queue records come along you use same connectioID key to get the connection information and MQI data.
    • You can get rid of the MQI key data from your table when queue record has less than 100 queues.
    • If the queue record has exactly 100 queues, you do not know if this is middle in a series or the last of the series.  To prevent a storage leak, you may want to store the time within the table and have a timer periodically delete these entries after a few seconds – or just restart the program once a day.
  • The header part of the data has a “sequenceNumber” field.    This  is usually incremented with every set of records.
  • On the SYSTEM.ADMIN.ACCOUNTING.QUEUE, messages for different program instances can be interleaved, for example client1 MQI, client2 MQI, client1 Queue, client3 MQI, client2 Queue, client1 Queue, client3 Queue.
  • You do not get information about the queue name as used by an application, the record has the queue name as used by the queue manager (which may be the same as that which the application used).  For example if your program uses a QALIAS -> clustered queue.   The queue record will have the remote queue name used: SYSTEM.CLUSTER.TRANSMIT.QUEUE, not what the application used.
    • You can use the activity trace to get a profile of what queues an application uses, and feed this into your processing
  • You do not get information about topics or subscriptions names.
  • You may want to convert connectionName  from 127.0.0.1 format to a domain name.

Using the data in your enterprise

You need to capture and reduce the data into a usable form.

From a central machine you can use the amqsevt sample over a client channel for each queue manager and output the data in json format.
I used a python script to process this data. For example:

  • Output the data into a file of format yymmdd.hh.machine.queueManager.json.  You can then use program like jq to take the json data for a particular day (or hour of day) and merge the output from all your queue managers to one stream, for reporting.
    • You get a new file every day/every hour for each queue manager, and allows you to archive or delete old files.
  • Depending on what you want to do with the data, you need to select different fields.  You may not be able to summarise the data by queue name,  as you may find that all application are using clustered queues, and so it is reported as SYSTEM.CLUSTER.TRANSMIT.QUEUE.  You might consider
    • ConnectionName – the IP address where the client came from
    • Channel Name
    • Userid
    • The queue manager where the application connected to
    • The queue manager group (see below)
    • The program name
    • Record start time, and record end time
    • The interval of the record – for a client application, this may be 1 second or less.  For a long running server or channel this will be the accounting interval
    • Number of puts that worked.
    • Number of gets that worked
    • Number of gets that failed.
    • Queue name used for puts
    • Queue name used for gets.
  • You can now display interesting things like
    • Over a day, the average number of queue managers used by an application.   Was this just one (a single point of failure), mostly using one (an imbalance), or spread across multiple queue managers(good).
    • If application did a few messages and ended, and repeated this frequently.  You can review the application and get it to stay connected for longer, and avoid expensive MQCONN and MQDISC, and so save CPU.
    • Did an application stay connected to one queue manager all day?  It is good practice to disconnect and reconnect perhaps hourly to spread the connections across available queue managers
    • You could charge departments for usage based on userid, or program name, and charge on the number or size of messages processed and connections (remembering that an MQCONN and MQDISC are very expensive).
    • You may want to group queue managers  so the queue managers FrontEnd1, FrontEnd2, FrontEnd3 would be grouped as FrontEndALL.   This provides summary information.

Be careful

The time interval of an accounting record can vary.  For example if you are collecting data at hh:15 and hh:45, and you business hours are  0900 to 1700.  If you have no traffic after 1700, and the next MQ request is 0900 the next day, the accounting record will be from 1645 today to 0901 tomorrow.

  • If you are calculating rates eg messages per second then the rates will be wrong, as the time interval is too long.
  • The start times of an interval may vary from day to day.  Yesterday it may have been 0910 to 0940, today it is 0929 to 0959.  It makes it hard to compare like for like.
  • If you try to partition the data into hourly buckets 0900 to 1000,1000 to 1100 etc. using data from 0950: 10:20 by putting a third of the numbers in the 0900:1000 bucket, and the other two thirds into 1000:1100 bucket, then for a record from1645 today to 0900 tomorrow will spread the data across the night and so give a false picture.

You may want to set your accounting interval to 30 minutes, and restart your servers  at 0830, so they are recording on the hour.  At the end of the day, shut down and restart your servers to capture all of the data in the correct time interval bucket

When is mid-range accounting information produced?

I was using the mid-range accounting information to draw graphs of usage, and I found I was missing some data.

There is a  “Collect Accounting” Time for every queue every ACCTINT seconds (default 1800 seconds = 30 minutes).  After this time, any MQ activity will cause the accounting record to be produced.  This does not mean you get records every half hour as you do on z/OS, it means you get records with a minimum interval of 30 minutes for long running tasks.

Setup

I had a server which got from its input queue and put a reply message to the reply-to-queue.

Every minutes an application started once a minute which put messages to this server, got the replies and ended.

When are the records produced?

Accounting data is produced (if collecting is enabled) when:

  • an MQDISC is issued, either explicitly implicitly
  • for long running tasks  the accounting record(s) seems to be produced at when the current time is past the “Collect Accounting time”, when there has been some MQ activity. For example  there were accounting records for a server at the following times
    • The queue manager was started at 12:35:51, and the server started soon afterwards
    • 12:36:04 to 13:06:33.   An application put a message to the server queue and got the response back.   This is 27 seconds after the half hour
    • 13:06:33 to 13:36:42  The application had been putting messages to the server and getting the responses back.   This is 6 seconds after the half hour
    • 13:36:42 to 14:29:48 this interval is 57 minutes.  The server did no work from 1400 to 14:29:48 ( as I was having my lunch).  At 14:29:48 a message arrived, and the accounting record was written for the server.
    • 14:29:48 to 15:00:27 during this time messages were being processed, the interval is just over the 30 minutes.

What does this mean?

  • If you want accounting data with an interval “on the half hour”, you need to start your queue manager “just before the half hour”.
  • Data may not be in the time period you expect.  If you have accounting record produced at 1645, the data collected between 1645 and 17:14  may not appear until the first message is processed the next day. The record havean  interval  from 16:45 to  09:00:01 the next day.  You may not want to display averages if the interval is longer than 45 minutes.
  • You may want to stop and restart the servers every night to have the accounting data in the correct day.

 

My mid-range accounting is missing data

Like many of the questions I get asked, it started off as a simple question.

Over a day, what people and machines are using this queue?”  Simple I said.  Just use the accounting data.

Problem 1

I had a program which puts a message to a clustered queue on a remote queue manager.  The message flows to the server and a reply is sent back.  I thought I would check it works.  No it didn’t.

The accounting data has my program, but it recorded a message put to the SYSTEM.CLUSTER.TRANSMIT.QUEUE, and a message got from my reply queue, so I cannot tell what queue was used, but I can tell how many bytes were processed

Problem 2

My application doing just a publish did not show up in the accounting data.

I used the samples amqspub, and amqssub to publish, and subscribe respectively.

The accounting data had a record for the amqssub, getting from a queue, but there was no record for the publishing application.

As Morag kindly pointed out, you can tell how many publishes you did from the MQI accounting data, but not to which topic.

Ive started looking at the activity trace, but this looks like a heavyweight solution to the simple question.

Problem 3

Someone else pointed out that if you want to know the channel name, and IP address of the application using a queue, you have to do some extra work.

  • Collect the MQI accounting
  • Process queue accounting and MQI accounting, and use the common field connectionId to merge the data

Using amqsevt to display the accounting data in json format you get

“eventData” : {
“queueMgrName” : “QMC”,
“startDate” : “2020-04-17”,
“startTime” : “09.43.23”,
“endDate” : “2020-04-17”,
“endTime” : “10.18.58”,
“commandLevel” : 913,
“connectionId” : “414D5143514D432020202020202020201361995E0A24A923”,
“sequenceNumber” : 1,
“applName” : “server”,
“processId” : 29103,
“threadId” : 11,
“userIdentifier” : “colinpaice”,
   “connDate” : “2020-04-17”,
   “connTime” : “09.12.16”,
   “connectionName” : “127.0.0.1”,
   “channelName” : “COLIN”,

Where the bold text is for the data only in the MQI accounting information.

It looks like the MQI accounting record is produced before the Queue accounting record.

You will need to save the connectionId key with the connection specific information.  When the queue accounting record is processed, use the connectionId to look up the information and append it to the queue record.  If many queues are being used, you may get multiple queue accounting records – so you cannot simply delete the connection key and connection specific information.