When is activity trace enabled?

I found the documentation for activity trace was not clear as to the activity trace settings.

In mqat.ini you can provide information as to what applications (including channels) you want traced.

For example


This file and trace value are checked when the application connects.  If you have TRACE=ON when the application connects, and you change it to TRACE=OFF, it will continue tracing.

If you have TRACE=OFF specified, and the application connects, changing it to TRACE=ON will not produce any records.


  • TRACE=ON, the application will be traced
  • TRACE=OFF the application will not be traced
  • TRACE= or omitted then the tracing depends on alter qmgr ACTVTRC(ON|OFF).   For a long running transaction using alter qmgr to turn it on, and then off, you will get trace records for the application from in the gap.

If you have



then program progput will have trace turned on because the definition is more specific.

You could have



to  be able to turn trace on for all programs beginning with prog, but not to trace progzzz.


Thanks to Morag of MQGEM  who got in contact with me, and said  long running tasks are notified of a change to the mqat.ini file, if the file has changed, and a queue manager attributed has been changed – even if it is changed to the same variable.

This and lots of other great info about activity trace (a whole presentation’s worth of information) is available here.

Are all your jms messages persistent?

While debugging my application to see why it was so slow, I found from the MQ activity trace that my replies were all persistent.

The first problem was that by default all jms messages are persistent, so I used

int deliveryMode = message.getJMSDeliveryMode();

to get the persistence of the input message,

and used the obvious code to set the JMSDeliveryMode,

TextMessage response = session.createTextMessage("my reply");

to set it the same as the input message.  I reran my test and the reply was still persistent.

Eventually I found you need

producer = session.createProducer(dest);

And this worked!  It is all explained here.

How do I check?

You can either check your code (bearing in mind that this may be hidden by the productivity tools you use ( Swing, Camel etc)), or turn on activity trace for a couple of seconds to check.

What do I need to make my business applications resilient?

In the same way that a three legged seat needs three strong legs,  a business transaction has three legs.  The business transactions needs
  1. An architecture and infrastructure which can provide the resilence
  2. Application which are well designed, well coded and robust
  3. Operations that can detect problems and automatically take actions to remedy the problems.
If any one is weak, the whole business transaction is not resilient.
For the infrastructure perspective, the question of needing MQ shared queue, MQ midrange or appliance comes down to the requirements of the business and the management of risk.
For your business application you need to understand the impact to your business.  If the application was not available for
  • 1 second
  • 1 minute
  • 1 hour
The cost could be your reputation, rules and regulations of your industry, and financial cost.  For example an outage may cost you 1 million dollars a minute in fines and compensation.  Your reputation could suffer if many people reported problems on twitter if your service is not available.

Overview of availability options.

  1. Queue sharing groups on z/OS give the highest level of availability, with the highest upfront cost (preventing an outage might be worth that cost, and more and more businesses are using QSGs now)
  2. The data replication features in the appliance and replicated data queue managers (RDQM) are the best ways to achieve high availability of queue managers on distributed. See RDQM for HA, and RDQM for Disaster Recovery.
  3. Multi-instance queue managers, where you have an active and a standby queue manager, and clusters can be useful too.
The applications need to be written to be reliable and resilient, so as to:
  1. Not cause an outage, and use MQ (and other software in the stack) as efficiently as possible.  Many “outages” are cause by badly written applications
  2. Deal well with a problem if one occurs.
  3. Make it easy to diagnose any problems that occur
You need to automate your operations so errors are quickly picked up and actioned.
What availability do your business applications need?
You need to be able to handle planned outages.  These may occur once a week.  You stop work going one route, and so it flows via a different route.  Once “all the pipes are empty” you can perform shut down.  This should be transparent to the applications.
You need to be able to handle unplanned outages where messages may be in flight in the queue manager and network.  These may occur once a year.  If there is a problem, messages in flight could be stuck on a queue manager until the queue manager is restarted.  Once a problem is detected, new messages should be able to flow via an alternative route.  In this case a few seconds, or minutes worth of messages could be unavailable.
You can use clustering to automatically route traffic over available channels while a problem in one queue manager is being resolved.
Do you have a requirement for serialized transactions where the order of execution must be maintained?  For example trading stocks and shares.  The price of the second request depends on the trade of the first request.   If so, this means you can only have one back end server, no parallelism, and one route to the back end.  This does not provide a robust solution.
How smart are your applications?
If your application gets no reply within 1 second, the application could try resending the request, and it may take a different route through the network, and succeed.  For inquiry transactions, a duplicate request should have little impact.  For an update requests, the applications need logic to handle a possible duplicate request, where it detects the request has already been processed, and a negative response is sent back.
The business application may need a program to clear up possible unprocessed, duplicate,  responses and take compensating action.
Having smart applications which are resilient means the infrastructure does not need to be so smart.
Operational maturity
For the best reliability and availability you need a mature operations environment.
The infrastructure is usually reliable.  “outages” usually occur because of human intervention,  a change, or a bad application.  For example an application can continually try a failing connection, and fill up the MQ error logs.
Examples of operational maturity include
  1. Do not make a change to two critical systems at once, have a day between changes.
  2. Make sure every change has a back-out procedure which has been tested.
  3. You monitor the systems, so you can quickly tell if there is abnormal behavior.
It can take several minutes to detect a problem, shut down, and restart a queue manager (perhaps in a different place).
If you have 100 linux servers to support, it takes a lot of work to make changes on all of these servers (from making a configuration change to applying fixes).  It may be less work on z/OS.
You need to make sure that the infrastructure has sufficient capacity, and a queue manager is not short of CPU, nor has long disk response time.
Below are several configurations and configurations:
Shared queue across multiple machines, across sites
A message in a Queue Sharing Group can be processed by any queue manager in a QSG,  providing high availability.
Good for business transactions where
  1. You cannot have messages “paused” for minutes while a server is restart.
  2. You can tolerate a “pause” a few seconds if one QM in the QSG goes down, and the channel restarts to a different queue manager in the QSG.
  3. Your applications are not smart.
  4. There is a need for serialized message processing.
  5. The cost of an outage would cover the cost of running z/OS.
Multiple mid-range machines configured across multiple machines across sites (RDQM).  Use of MQ appliance
For business transactions where
  1. Messages can be spread across any of the servers to provide scalability and availability.
  2. If you have a requirement for short response time, you need smart applications which can retry sending the message and handle duplicate requests and responses.
  3. If you can tolerate waiting for in-flight message whilst a queue manager is restarted, the applications do not need to be so smart.

These mid-range systems, can take a minute or so to restart after an outage.

RDQM queue managers are generally better than Multi Instance queue managers. See the performance report here.

Single server
This is a single point of failure, and not suitable for production work.
Your enterprise may have combinations of the above patterns.

You need to consider each business application and evaluate the risk.

For example

  • My applications are not smart. They are running on mid-range with 2 servers.  If I  had an unplanned outage which lasted for 5 minutes then with my typical message volumes, this means I could have 6000 requests stuck until queue manager was restarted.  My management would not be happy with this.
  • If I had an outage on these two servers…  ahh that would be a problem.  I need more servers.

Many thanks to Gwydion of IBM for his comments and suggestions.

Configuring your WebSphere Liberty MDB properly

I found the documentation on how to use and monitor an MDB in a WebSphere Liberty web server environment was not very good.  Some of the documentation is wrong, and some is missing.

I’ll document “how I found it worked”,  in another post I’ll document what the Liberty statistics mean, and how they connects to the configuration.


The application

The application is a simple Message Driven Bean.  When this is deployed you specify the queue manager, and which queue the listener task should get messages from.

There are many “moving parts” that need to have matching configuration.  I’ll try to show which bits must match up.

The application deployment

  1. The java IVTMDB.java program has
    1. onMessage(Message message){..} This method is given the message to process.
    2. ConnectionFactory cf = (ConnectionFactory)ctx.lookup(“CF3”); Where CF3 is defined below
  2. Within the WMQ_IVT_MDB.jar
    1. META-INF/ejb-jar.xml has
      1. <ejb-name>WMQ_IVT_MDB_EJBNAME</ejb-name>.  This name is used in the Liberty server.xml file.
      2. <ejb-class>ejbs.IVTMDB</ejb-class>. With a ‘.’ in the name.  Within the jar file is ejbs/IVTMDB.class.   This is the java program that gets executed.  If you specify “ejbs/IVTMDB” you get a java exception IllegalName: ejbs/IVTMDB.
      3. <method><ejb-name>WMQ_IVT_MDB</ejb-name> <method-name>onMessage</method-name> This is the method within the java program which gets the message.  The program has public void onMessage(Message message)
    2. META-INF/MANIFEST.MF This is usually generated automatically
    3. ejbs/IVTMDB.class the actual class file to be used.  This is what was described in the <ejb-class> above.
    4. Other files which may add configuration information for specific web servers.
  3. Within the CCP.ear file
    1. The WMQ_IVT_MDB.jar file described above
    2. META-INF/MANIFEST.MF.   This gets created if one does not exist.
  4. The .ear file is copied to ~/wlp/usr/servers/test/dropins/

The server.xml file for the Liberty instance has

<jmsActivationSpec id="CCP/WMQ_IVT_MDB/WMQ_IVT_MDB_EJBNAME">
  <authData id="auth1" user="colinpaice" password="ret1red"/>
<jmsQueue id="AAAA" jndiName="IVTQueue">
  <properties.wmqJms baseQueueName="IVTQueue"/>

<jmsConnectionFactory jndiName="CF3" id="CF3ID">
  <connectionManager maxPoolSize="6" connectionTimeout="7s"/> 
  <properties.wmqJms queueManager="QMA"



  • <jmsActivationSpec> defines the application to the web server.  See  here  for the definition of the content of the jmsActivationSpec.
    • id is composed of
      • CCP is the name of the .ear file
      • WMQ_IVT_MDB is the name of the .jar file
      •  WMQ_IVT_MDB_EJBNAME is the name in the <ejb-name> within the ejb-jar.xml file.
    • The destinationRef=”AAAA” connects the jmsActiviationSpec to the queue name IVTQueue, see jmsQueue below.
    • transportType, channel, hostName, port define how the program connects to the queue manager.  The other choice is transportType=”BINDINGS”.
    • clientID I could not see where this is used.
    • applicationName is only used when transportType=CLIENT.  If you use runmqsc to display the connection, it will have this name if a client connection is used.
    • maxPoolDepth this is the number of instances of your program. If you use runmqsc DIS QSTATUS() the number of IPPROCS can be up to the maxPoolDepth+1.
    • poolTimeout  see below.
    • queueManager is used when transportType=”BINDINGS”.
    • <authdata…> is the userid to be used.
  • </jmsActivation> is the end of the definition.
  • <jmsQueue..> defines a queue.
    • id=…  matches the jmsActivationSpec destinationRef= entry above.
    • jndiName the specified value can be used in an application to look up the queue name.
  • </jmsQueue> defines the end of the queue definition
  • <jmsConnectionFactory.. > defines how the program connects to the queue manager
    • jndiName=”CF3″.  The application issued ConnectionFactory cf = (ConnectionFactory)ctx.lookup(“CF3”) which does a jndi lookup of CF3
    • <connectionManager>defines the  connection properties
      • maxPoolSize=”6″  This means that at most 6 of (onMessage) application instances can get a connection.  If there are 10 instances running –  6 can get a connection and run, 4 will have to wait.
      • connectionTimeout=”7s”  This is meant to say the pool can be shrunk if connections are not used, and not used for 7 seconds.   This allows connections to be freed up.



How do I configure the numbers?

With the definition <jmsActivationSpec … <properties.wmqJms  maxPoolDepth=”50″… then up to 50 threads can have the queue open, and be getting messages.  Each listener which has got a message will pass the message to the onMessage() method of your application.  Typically the application connects to the queue manager and puts a reply back to the originator.

This means that the connection pool used by the application (CF3 in my case) needs at least a maxPoolDepth connections as the number as the listeners jmsActivationSpec.maxPoolDepth.  The application will wait if there are no connections available.   Liberty provides some basic statistics on the number of connections used, and the number of requests that had to wait.

If you have more than one application using the connection pool, then you need to size the pool for all the potential applications.

I could not find any Liberty statistics as to the number of instances with the input queue open, so you will need to issue the runmqsc DIS QSTATUS(..) and display the number of IPPROCS.

You can change the server.xml configuration to change the connection properties (such as making the maxPoolDepth larger).   This causes all existing instances to stop, and restart, which, in a busy system can cause a short blip in your throughtput.

When connections are not used for a period, they can be freed.  See Using JMS connection pooling with WebSphereApplication Server and WebSphere MQ, Part 1

and Part2.

Unused connections move from the connectionPool to an mqjms holding pool.  Periodically this pool is purged.  After running a workload, I could see from the application trace that some MQDISCs were done 3 minutes + afterwards.

Tuning the inbound “connection pool”.

For the jmsActivationSpec there is no connectionPool as such. There is an internal inbound connectionPool for all MDB listeners.  The maxPoolDepth limits how many connections can be used by the listeners. Every 300 seconds a task wakes up and checks all the “inbound” connections.  If it has not been used for the poolTimeOut duration, then the connection is release.

If you specify a poolTimeOut of 1 second, then the connections could be release after 1 to 301 seconds.  This behaviour means that when the task wakes up, you may have many connections released (MQDISC).  You may want to set the poolTimeOut to 300 seconds so some connections are released when the task runs, and the remainder are released the next time the task runs, to spread the load.

If the poolTimeOut is too small you may get a lot of MQCONN, MQDISC activity.  By using a longer value of poolTimeOut you may avoid this behaviour, so the listeners connect at the start of the day, stay connected most of the day, and disconnect at the end of the day.

You can use maxPoolDepth to throttle the work being processed.  If the number is too small, work will be delayed.  If the number is too large, you may get a spike in activity.  If you use DIS QSTATUS(‘queuename’) you will see the number of threads with the queue open for input and the current depth.  Vary the maxPoolDepth till you get the best balance.

How long will it take my queue manager to fail over and restart on midrange?

Following on from my blog post and making sure your file systems are part of a consistency group – so the data is consistent after a restart, the next question is

“how long will it take to fail over?”.

There are two areas you need to look at

  1. The time to detect and outage.  This can be broken down into the time the active queue manager releases the lock, and the time taken for the release of the lock to be reflected to the standby system.  You need to test and measure this time.  For example you may need to adjust your network configuration.
  2. The time taken to restart the queue manager.  There is an excellent blog post on this from Ant at IBM.
    1. The blog post talks about the rate at which clients can connect to MQ.  Yes MQ can support 10,000 client connections.  But if it takes a significant time to reconnect them all, you may want multiple queue managers, and have parallelism
    2. Avoid deep queues.  In my time at IBM I saw many customers with thousands of messages on a queue with an age over 1 year old!  You need to clean up the queue.  Your applications team should have a process that runs perhaps once a day which cleans up all old messages.  For example there was a getting application instance which timed out and terminated, then the reply arrived.
    3. During normal running most of the IO is write – and this often goes into cache, so very fast.   During recovery the IO is reading from disk which might be from rotating disks rather than solid state.

One lesson from this is you need to test the recover.   You need to test a realistic scenario – take the worst case of number of connections, with peak throughput, and then pull the plug on the active system.

Another lesson is you need to do this regularly – for example monthly as configurations can change.

Are your mirrored file systems consistent?

It started with a question “Several years ago you told us about checking your MQ disks are consistent,  can you provide us with a link to any documentation please?”.

I’ll explain why this is important and what you need to do to ensure you have data integrity and you do not lose data integrity when you go to a backup site.

With some applications that write to multiple files, the order that data is actually written to the disk does not matter.  For example when you print data, it often stays in a buffer, and is written out when the buffer is full.

A transaction manager

With programs that handle transactions (a transaction manager) it is critical that writes to disk are done in the order they are issued.  If the writes are not in the correct order then if there if the system crashes and tries to restore the transaction the recovery may be missing key data  (“it has taken the money from your account..  it cannot see who should get the money?”) and so data integrity is lost.

With local disks, the sequence is

  • Write to file1,
  • Wait for confirmation that the IO has completed
  • Write to file2,
  • Wait for confirmation that the IO has completed

Consider the case where file1 and file 2 are on different file systems.  For example file1 could be transaction log, file2 could be queue data.  (Picture file system1 on slow disks, and file system 2 on fast disks – so IO for file 2 is faster than IO to file 1).

With mirrored disks with synchronous replication, the sequence is

  • Write to file1 local copy; send data to remote site,  write to file1, send back OK when completed
  • Wait for confirmation that both IOs have completed
  • Write to file2 local copy,send data to remote site,  write to file2, send back OK when completed
  • Wait for confirmation that both IOs have completed

With synchronous replication the two locations need to be within 10s of kilometers.  The response time of the file write depends on the distance.

With Asynchronous replication the two locations can be 100s of kilometers apart.

In this case the sequence is

  • Write to file1 local copy; send data to remote site,  write to file1, send back OK when completed
  • Wait for confirmation that the local IO has completed.
  • Write to file2 local copy,send data to remote site,  write to file2, send back OK when completed
  • Wait for confirmation that the local IO has completed.

The disk subsystems manages the responses coming back from the remote end.

For capacity reasons there are usually multiple paths between the two sites.  It is possible that the data for file 2 gets there before the data for file1.  If the writes are done in the wrong order, this could be bad news.

Consistency group

The architecture of the mirroring systems have the concept of a consistency group.   You define one or more consistency groups.  You put file systems into a consistence group.  For any files in the consistency group the write order will be honoured.  So in the case above, if the two files are in the same consistency group, it will wait, write the data to file 1, then write to file 2.  This gives a solution with data integrity.

The lurking problem.

Someone needs to define the file systems to each consistency group.   The storage manager may have said

  • “all file systems are part of one consistency group”.
  • “”production data is in one consistency group, test data is in another consistency group”
  • “I’ll guess, and hope people tell me their requirements”

How will I know if I have a problem?

The sure fire way of finding out if you have a problem is to lose a site ( for example a power outage).  For 99 times out of a 100 it may be fine, and then one time in a hundred, you find you cannot restart your systems on the other site.  This is clearly the wrong time to find out.

Check with your storage administrator and give them information about the file systems that need to be part of the same consistency group.

Practice your fail over – perhaps weekly – at least monthly.

Using webLogic web server and the IBM MQ resource adapter.

Ive documented the steps I took to configure webLogic to use the IBM MQ Resource Adapter, so I could use connection pools and avoid 1 million MQCONNs a day.
The document grew and grew, and I found it easiest to create a PDF document, here.

The document covers

  • Getting the IBM Resource Adapter from IBM
  • Adding the webLogic specific file into the Resource Adapter.
  • Deploying the resource adapter
  • Adding additional configurations to the Resource Adapter
  • Changing your MDB deployment to use the Resource Adapter instead of native JMS
  • How  you tell if a connnectionFactory is using the Resource Adapter and not native JMS.

I provide a sample weblogic-ra file to get you started.

Making MQ swing with Spring. Tuning Camel and Spring (to get rid of the 1 second wait in every message).

Spring is a framework which sits on top of Apache Camel, which runs on Oracle WebLogic web server.   It simplifies writing java applications for processing messages.

I was involved in tracking down some performance problems when the round trip time for a simple application was over 1 second – coming from a z/OS background I thought 50 milliseconds was a long time – so over 1 second is an age!

The application is basically a Message Drive Bean.  It does

  1. Listening applications get a message from a queue.  There  can be multiple listening threads
  2. Get an application thread
  3. Pass it via an internal queue to the application running on another thread – the listening thread is blocked while the application is running.
  4. This application sends (puts) a message to an MQ queue for the backend server and waits for the response.
  5. Return to the listening application
  6. Free the application thread.

As this is essentially the same as a classic MDB, we had configured the number of application threads in the thread pool the same number as the listener thread pool.

Shortage of threads

The symptoms of the problem looked like a shortage of threads problem.

When we increased the number of threads in the application pool ( we gave it 4* the number of listener threads) The response time dropped – good news.  I dont know how many threads you need – is it n+1 or 2* n. I’ll leave finding the right number  as an exercise for the reader!

The hard coded 1 second wait before get

One symptom we saw was the queue depth on the replyTo queue on the server was deeper than normal.

For the reply coming back from the server, I believe there is one thread getting from the queue.

When the reply to queue is not shared

The application thread has sent the request off, and is now waiting.  This getting thread does an MQ destructive GET with wait.  When the message arrives, it looks at the content, and decides which application thread is waiting for the reply, and wakes it up.

When the reply queue is shared between instances

For example you have configured two instances for availability.  The above logic is not used.

Instance1 cannot destructively get a message because the message could be destined for instance2.  Similarly instance2 cannot get destructively get the message because it could be destined for instance1.

One way to solve this would be to do a get next browse of the message, and if it is for the instance do a get_message_under_cursor.  This works great for MQ, but not other message providers which do not have this capability.

The implementation used  is to use polling!

If there are 3 applications tasks waiting for a replies, reply1, reply3, reply4.  The logic is as follows

  1. For each reply id, use MQ message selectors to try to get reply1, reply3, reply4.   This is not a get by messageID or correlID – for which the queue manager has indexes to quickly locate the message, this is a get with message selector, which means look in every message on the queue.
  2. For any messages found – pass them to the waiting applications
  3. Do an MQGET with a message selector which does not exist – with a wait time of receiveTimeout (defaults to 1 second).  Every message is scanned looking for the message selector string, and it is not found.

Looking at a time line – in seconds and tenths of seconds
0.1 send request1, wait for reply
0.2 getting task does MQGET  wait for 1 second with non exising selector
0.2 send request 2, wait for reply
0.3 reply1 comes back
0.4 send request3, wait for reply
0.5 reply2 comes back
0.6 reply3 comes back
nothing happens
1.2 getting task waiting for 1 second times out
1.2 getting tasks gets reply1 and posts task1
1.2 getting tasks gets reply2 and posts task2
1.3 getting tasks gets reply3 and posts task3

So although  reply1 was back within 0.2 seconds (0.3 – 0.1) it was not got until time 1.2.  The message had been waiting on the queue for 0.9 seconds.

  • Total wait time for reply1 was  1.1 seconds.
  • Total wait time for reply2 was  1.2 – 0.2 = 1.0 seconds
  • Total wait time for reply3  was 1.3 – 0.4 = 0.9 seconds

Wow – what a response time killer!

You can tune this time by specifying receiveTimeout.  If you make it smaller, the wait time will be shorter, so messages will be processed faster, but the CPU cost will go up as more empty gets are being done.

This solution does not scale.

You have had a slow down, and there are now 1000 messages on this queue. (990 of these are unprocessed messages, due to timeout .  There is no task waiting for them – they never get processed – nor do they expire!)

  • MQGET for reply1.  This scans all 1000 messages – looking in each message for the message with the matching selector.  This takes 0.2 seconds.
  • MQGET for reply2. This scans all 1000 messages – looking in each message for the message with the matching selector.  This takes 0.2 seconds.
  • You have 10 threads waiting for messages, so each message has to wait for typically 10 * 0.2 seconds = 2 seconds a message!


What can you do about it.

See the Camel documentation Request-Reply over JMS, parameters, Exclusive, concurrentConsumers, and Request-Reply over JMS Using an Exclusive Fixed Reply Queue 

  1. Avoid sharing the queue.  Give each instance its own queue.  Set the Exclusive flag
  2. Tune the ReceiveTime out – making it a shorter time interval can increase the CPU as you are doing more empty gets.  You might want to set it, to a value which is 95% percentile time between the send to the server, and the reply comes back.  So if the average time is 40 ms, set it to 60 ms or 80 ms.
  3. If you are going to share the queue, make sure you clean out old messages from the queue – for example use expiry, or have a process which periodically scans the queue and moves messages older than 1 minute.
  4. Did I mention avoid sharing the queue.
  5. If you get into a vicious spiral where the response time gets longer and longer, and the reply queue from the server gets deeper – be brave and purge server reply queue.
  6. Avoid sharing the queue.