Do not do unnatural things with clustering.

I’ll cover an interesting clustering scenario, and discuss how it could be improved, but first I’d like to mention my grandfather’s axe. I still have it. My father replaced the head, and I replaced the handle – but it is still my grandfather’s axe.

I was looking at a customer’s configuration, and was told “this is the original architecture”. Except they replaced this part with a cluster, and they restructured those applications to be in a different cluster, but it is still their original configuration, and of course the picture is ten times the size from when they started with MQ.


The simplified picture has a blue cluster and a yellow cluster, and the full repository acts for both clusters.

An application attached to QMA to send a message to QMB, using a clustered Queue Remote defined in the full repository(FR). This mapped to a clustered queue in QMB. So for the MQPUT, the message flowed to the full repository, and was put to a clustered queue, and the message was sent to QMB where it was processed.

This is not efficient as you get double puts and gets, and more opportunities for breakages. Yes, it is using clustering, but it is not a natural use of clustering.

It would make much more sense to put QMA and QMB in the same cluster and save a lot of CPU. This would also avoid a mess when trying to sort it out.

We had a discussion about the architecture and if we could change it. The original architect retired 10 years ago, and the chart(singular) describing the architecture and the ideas behind it, was lost when a laptop was returned and the hard drive was reformatted.

Quick summary of channels used in clustering

In a cluster there are three types of cluster channels

  1. The cluster receiver – this is defined for a queue manager to provide a template for other queue managers to connect to it.
  2. The cluster sender – which connects to the full repository. You do not need to connect to all the full repositories as the definitions for the other full repositories will flow down.
  3. Automatically defined channels between two queue managers. For queue manager QMA to create a channel to QMB, it uses the cluster receiver channel defined on QMB and sent to the full repository.

Is there any advantages in having the existing configuration?

I cannot think of a very good reason for this, I can think of reasons for which this strange configuration is valid – but they still feel wrong!

  1. Before clustering some people had bad experiences of connecting a queue manager to all other queue managers, and the nightmare of managing these connections. Clustering solved the definitional problem. You have only to define two channels per queue manager, not hundreds or thousands. When clustering is used, channels between queue managers will be created dynamically and started as needed. You may get hundreds to channels started, but you do not have to define them. With the overlapping clusters in the picture, you limit the number of channels being started, and force a “hub and spoke” rather than the direct link you get with clustering. With a good automation package, the you should be able to automate the management of the channels, and collect performance data etc.
  2. Number of connections. If you have a large MQ estate, for example 100 queue managers at the back end. You may more than 100 cluster channels active. This should not a problem, you may just have to configure your queue managers to handle more connections. (If there were 10,000 connections we would have a different discussion).
  3. Capacity. QMA and QMB may not have the capacity to store a large number of message, so using the full repository with space for deep queues may be a solution. (But remember a good queue is an almost empty queue).
  4. Security. By having a channel exit on the full repository, you can check the data and authorization. If the control data is on the full repository system, it may be hard to put the exits on the other queue manager. I think you should review the architecture, and look at caching security data on the queue manager machines.
  5. Message logging. This could be duplicating a message, or updating a database with message content. It feels the architecture is wrong. I think a better architecture would be to do two puts in the original application, or an MQPUT and a remote DB2 insert. – but this could affect performance.

How do we fix this?

In principle you just move QMB into the blue cluster, and just remove the QREMOTE definitions from the full repository.

The word that jumps out at me is “just”.

You can change the channel and queue on QMB to use a namelist of both clusters. That is easy, it is the next steps that could cause a hiccup.

With asynchronous processing, events can happen at different times. You define a queue over here, and delete a queue from over there, and on a queue manger far, far away these operations get done in the reverse order.

Let the clustered remote queue on the full repository is called SERVER_on_FR, which points to the queue SERVER_on_QMB, a clustered queue on QMB.
The application attached to QMA does MQOPEN to SERVER_on_FR, and due to the magic of clustering it all works as expected, a message arrives on the SERVER_on_QMB queue.

If you define a clustered QR(SERVER_on_FR) on QMB, pointing to SERVER_on_QMB. There will now be two queues called SERVER_on_FR in the cluster. Both queues may be used, depending on the configuration.

You cannot just delete the QR definition SERVER_on_FR on the FR as there may be messages on cluster transmit queues heading for this queue, and some queue managers may not have seen the updates about the new queue definition. Receiver channels on FR may try putting to the queue to find it gone. (If you get confused, as I did, try reading the section again)

You need to alter the queue on FR to make it cluster(), that is, remove it from all clusters. Over time (minutes to days) this will propagate to all queue managers, and so queue managers will not use it. Message in the cluster transmit queue should all have been processed.

After a suitable interval you can then delete the QR from the FR system.

Your troubles are not over, as now you have a queue called “SERVER_on_FR” on other queue managers than FR. On QMA you could create a QR called “SERVER_on_FR” which points to SERVER_on_QMB, or (better) change the application to use queue SERVER_on_QMB, or even better just use queue name SERVER! but there is a good chance you’ve lost the source for this application.

If you now scale this up to an enterprise you see what a mess this now is.

As a result of doing unnatural things with clustering, you have extra puts and gets, indirect channels, and a mess of queue names – it is much easier to “Keep It Simple Stupid”, and let clustering do what it was designed to do.

Uniform clustering in 9.1.2 gets a tick – and a caution from me.

In MQ 91.2. there is a new function called Uniform Clustering, which I thought looked interesting (with my background in performance and real customer usage of MQ).

Ive had a play with it, and written up what I have found.

What is it?

When Uniform Clustering is active and it detects an imbalance in the number of conversations across queue managers, it can send a request to a connected application to request disconnect and reconnect. This happens under the covers, and it means you do not need to write code to handle this.

MQ has supported client reconnect for a few years. In V8.0 you can stop a channel, or use endmqm -r to get the channels to automagically disconnect and reconnect to a different queue manager with no application code.

I would call it conversation balancing with a side effect of workload balancing. It helps solve the problem where one server is getting most of the work and other servers are under utilized.

By having the connections for an application spread across all of the available queue managers, it should spread the workload across the available queue managers, but the workload balancing depends on the spread of work on each connection.

The documentation originally talked about application balancing – which I think was confusing, as is does not balance applications, it balances where the applications connect to.

A good client has the following characteristics

  1. It connects for a long time, and avoids frequent short lived connections.
  2. It periodically disconnects and reconnects, so over time the connections are spread across all servers.
  3. More instances can be started if needed to service the queues. These instances can be spread around the available servers.
  4. Instances can shut down if there is no work for them. For example MQGET wait for 10 minutes and no message arrives.

The Uniform Clustering helps automate the periodic disconnect and reconnect (situation 2 above).

The IBM documentation says it simplifies the administration and set up – I cannot see how this helps, as you have to define the queues and channels anyway – they do not need to be clustered.

The IBM documentation says Uniform Clustering moves reconnection logic from the application to the queue manager. This is true, but production ready applications need to have additional logic in them to support this (see below).

You should not just turn on Uniform Clustering, you need to review your applications to check they can run in this environment. If you just turn it on, it may appear to work; the problems may be subtle, show up at a later date, and also make trouble shooting harder.

How does it work?

Once the queue managers have been set up, they monitor the number of instances of applications connected to the queue manager. If you have two queue managers and have 20 instances of serverprog connected to QMA, and 0 instances connected to QMC, then over time some of the connections to QMA will be told to disconnect and reconnect, some may reconnect to QMA, and some may reconnect to QMC. Over time the number of conversations should balance out across the available queue managers.

Below are some charts of showing how this balancing works. I had a number of “server” program connected as a client. They started and all sessions connected to QMA. They did not process any messages. From the reports produced by my MQCB program, I could see when application instances were asked to disconnect and reconnect.

The chart below shows the rate of reconnecting for 20 servers connecting as clients to 2 queue managers – doing no work. After 300 seconds there were 10 connections to each queue manager.undefined

The chart below shows the rate of reconnecting for 80 servers connecting as clients to 2 queue managers – doing no work. After 468 seconds there were 40 connections to each queue manager.

We can see that balancing requests are sent out every minute or two. The number of conversations moved depends on how unbalanced the configuration is. The time before the connections were balanced varied from run to run, but the above charts are typical.

What gets balanced.

I had two applications running into my queue managers. If you use DIS CONN(*) APPLTAG, it shows you the names of the programs running.

My client programs had APPLTAG(myclient), my server programs had APPLTAG(serverprog).

The uniform clustering will balance myclient programs as a group, and serverprog programs as a group.

You may have many client programs, for example hundreds of sessions in a web server, and only a few server programs processing the requests from the clients, so they may get balanced at different rates.

This looks like a really useful capability, but you need to be careful.

The MQ reconnection code will open the queue names you were using, and it is transparent to the application.

A thread may get a request to disconnect and reconnect, while the application is processing an MQ request, waiting for a message, or doing other work. For some application patterns this may not matter, for others you may need to take action.

Where’s my reply?

For a server application which does MQGET, MQPUT MQCOMMIT. If the reconnect request happens, the work can get backed out. Another application can process the work. Great – no problems.

For a client application, these do (MQPUT to server queue, MQCOMMIT), (MQGET wait on reply-to-queue, MQCOMMIT). The reconnection request can happen during the MQGET wait. The MQPUT request specified a reply-to queue, and reply-to queue manager. If the application has a reconnect request, it may connected to a different queue manager, so will not be able to get the reply message (as the message is on the original queue manager).

This problem is due to the reconnection support, and has been around for a long time, so most people will have a process in place to handle this. Uniform Clustering makes no difference to this, it happens without you knowing.

Reporting the wrong queue manager.

Good applications report problems with enough information to identify the problems. For example queue manager name, queue and unexpected return code. If you did MQINQ to find the queue manager name at startup, and if your application instance has been reconnected, the queue manager name may now be wrong.

  1. You can use MQCB to capture and report these queue manager changes, so the reconnects and new queue manager name are written to the application log.
  2. You could issue MQINQ for the queue manager name when you report an problem, but the connection may have moved by the time you report an problem.
  3. You also need to handle which queue manager the MQPUT was done on, as this could be different to where the MQGET completed. This might just be a matter of saving the queue manager name in a MQPUT_QM variable every time you do an MQPUT. You need to do this when tracking down missing messages – you need to know which system the MQPUT was done on.
  4. You could keep the time of the MQPUT, report “Reply not received, MQPUT was put at 12:33:44” and then review the application log (1 above) to see what it was connected to at that time.

What gets balanced

Conversations get balanced. So if you have a channel with 4 shared conversations, (DIS CHS gives CURSHRCNV(4)), you might end up with a channel to QMA with one conversation, a channel to QMB with two conversations and a channel to QMC with one conversation. Some channels may have only one conversation per channel instance.

Are there any new commands?

I could not find any new commands.

Can I turn it off this automatic rebalancing?

To put your queue manager in and out of maintenance mode, see here

This is a “challenge” with reconnection, not with Uniform Cluster support. If you change the qm.ini file and remove the


statements, this just means the applications connected to this queue manager will not get told to rebalance. You will still get applications trying to connect to the queue manager.

How do I put a queue manager in and out of maintenance mode when using client reconnect?

You want to do some maintenance on one of my queue managers, and want to stop work coming in to the queue manager, and restart work when the maintenance has finished – without causing operational problems.

Applications using reconnection support, can reconnect to an available queue manager. To stop an application connecting to a particular queue manager you need to stop the channel(s). STOP CHL(…) STATUS(STOPPED). An application using the channel will get notified, or reconnected. An application trying to connect, will fail, and go somewhere else.

If you have two channels, one for the web server clients, and a second channel for the server application on the queue manager, I dont think it matters which one you stop first.

  1. If you stop the client program, then the message will go to the server application, be processed and put on the reply queue. The client will not get the reply, as it has been switched.
  2. If you stop the server applications first, then the messages will accumulate on the server queue, until the server applications reconnect to the queue manager and process the queue.

In either case you can have orphaned messages on the reply to queue. You need a process to resolve these, or for non persistent message set a message expiry time.

Once you have done your maintenance work, use START CHL(…) for the server channel, wait for a server to connect to the queue manager and then use START CHL(…) for the client channel. It may take minutes for a server application to connect to the queue manager.

Do it in this order as you want the server to be running before client applications put to the server queue, otherwise you will have to handle time out situations from the application.

Some secrets of shared conversations and other dark corners of MQ

I was looking into how to balance the number of server threads processing messages, and discovered I knew nothing about shared conversations and related topics. Of course I could draw them on a white board and wave my hands around, but I could not actually describe how they work.

Firstly some things I expect every one knows (except for me).

  1. You can define a shared connection handle. This can be used in different threads – but only serially. See Shared(thread independent) connections with MQCONNX.
  2. A thread can only connect to MQ once using a non shared connection, otherwise you get MQRC_ALREADY_CONNECTED: “A thread can have no more than one nonshared handle.”
  3. A non shared connection cannot be shared between threads. I got MQRC_HCONN_ERROR: “The handle is a nonshared handle that is being used a thread that did not create the handle”.

Multi threaded program

I set up a program which did

do I = 1 to number of threads;
pthread_create – use subroutine

The subroutine did

MQCB (set up MQCB to get queue manager change events such as reconnect)

Each thread needed its own MQCONN, and MQCB to capture queue manager events such as disconnect request, and reconnected events.

DIS CHSTATUS shows conversations spread across channels

My CLNTCONN channel was defined with SHARECNV(10). I started my program and specified 15 threads. DIS CHS(COLIN) gave me two channel instances:

 AMQ8417I: Display Channel Status details.

AMQ8417I: Display Channel Status details.

One channel instance had CURurrent SHared CoNVersations (CURSHCNV) of 5, the other had 10. 5+10 = 15 was the number of threads I had running in my program. With 25 threads, I had three channels active and a total of 25 CURSHCNV.

When running my program the value to DIS QMSTATUS CONNS increased by 25, the number of threads I had running.

Morag wrote a post onMaxChannels vs DIS QMSTATUS CONNS.

Things that didn’t work

I tried to issue one MQCONN, and share the connection within the threads – this did not work as it gave me MQRC_HCONN_ERROR: a non shared thread cannot be shared between threads.

This error MQRC_HCONN_ERROR: a non shared thread cannot be shared between threads is not entirely true.

I use an MQCB to get notified about queue manager events. You specify MQCB and pass the hConn. In my MQCB routine, I could issue MQINQ using the same hConn. So I did have the same hConn being used by different threads – but one of these is a special thread.

I tried to use Async Consume, where you use MQCB to specify a message handler program to process the message when a message arrives. You do MQCONN, and then the hConn is used by the asynchronous process. The hConn cannot be used by other MQ API requests or a second Async Get. In my main program I tried to issue 15 MQCONN, and use one hConn for each Async get. I got MQRC_ALREADY_CONNECTED “A thread can have no more than one nonshared handle.”

I solved this by the same technique as above

do I = 1 to number of threads; 
pthread_create – use subroutine
subroutine: use Async Consume.
MQCB for queue manager events
MQCB for Async Consume

I had an email exchange with Morag (thank you) who said

You can have one MQCONN and 15 async getters if you want, if you use the shared handle connection option. (cno.Options … + MQCNO_HANDLE_SHARE_BLOCK)

Only one Async Callback function (and thus one message and application logic) can be processed at a time. One connection equals one channel (or conversation over a channel if you are sharing them – i.e. SHARECNV > 1).
Equally you have have 15 MQCONNs and association each MQCB with a different hConn.
It all depends what sort of concurrency you want in your application. Do you want parallel processing because your workload is heavy, or do you just want to monitor and process 15 different, lightly used queues in the simplest way possible?
If an hConn is currently in use by one callback call, another will not be invoked until the first callback completes.

So if you have an Async consumer for queue1, and an Async consumer for queue2, and a message arrives on each queue, it will work as follows

  • Async code for queue1 is invoked with the message, it does a database update, and an MQPUT1 to the reply-to queue. This application returns.
  • only after the previous code has returned, can the Async code for queue2 be invoked; which does a database update, and an MQPUT1 to the reply-to queue, and returns.

It is not worth having more than one Async consumer per queue, as you will not get parallel processing. You will get

  • Wait until previous consumer to finish, do Async consumer 1 for queue … return;
  • Wait until previous consumer to finish, do Async consumer 2 for same queue … return;

You might just as well have one Async consumer per queue.

As Morag said It all depends what sort of concurrency you want in your application. Do you want parallel processing because your workload is heavy, or do you just want to monitor and process 15 different, lightly used queues in the simplest way possible?

With one application and 15 Async consumers set up, DIS CHS(..) gave me CURSHCNV(1).

What does SHARECNV on a svrconn channel do?

On QMA, I changed SHARECNV(10) to SHARECNV(0). When QMA was the only queue manager running, I got


The reason is An MQ client application that has been configured to use automatic reconnection attempted to connect using a channel defined with SHARECNV(0).

When I had both QMA and QMC running, there was a couple of second delay during which the threads connected to QMA, got back MQRC_ENVIRONMENT_ERROR, tried to connect to QMC – and succeeded. There were no error messages in /var/mqm/errors/AMQERR01.LOG to tell me there was a problem in QMA.

On QMA, I changed SHARECNV(10) to SHARECNV(1). When QMA was the only queue manager running, I got 10 channel instances of COLIN running, each with CURSHCNV(1), as expected.

I changed the svrconn channel and specified SHARECNV(30), and used 30 threads. I got 3 channel instances each with 10 connections. This was a surprise to me.

This page says If the CLNTCONN SHARECNV value does not match the SVRCONN SHARECNV value, the lower of the two values is used.

I was using the ccdt in json and added the sharingConversations to the ccdt.

"sharingConversations": 30,

"name": "COLIN",

When I restarted my application and specified 30 threads, I had one channel started with DIS CHS… giving CURSHCNV(30).

The Knowledge Centre says Use SHARECNV(1). Use this setting whenever possible. It eliminates contention to use the receiving thread, and your client applications can take advantage of new features. So although you can make SHARECNV large, a value of 10 or 1 may best. It is a balance between having more connections which use more resources, and the impact of sharing a channel on channel throughput.

Uniform Clusters and shared conversations.

I started up 8 threads, and had one channel with 8 conversations on it. I had an MQCB to report when the conversation balancing occurred: that is when a conversation got disconnected and reconnected.

At start up all conversations connected to QMA. Over time, some conversations moved to QMC.

Eventually, I had

  • one channel instance to QMA with CURSHCNV(4) and
  • one channel instance to QMC with CURSHCNV(4)

So even with shared conversations you get balancing across channels.

How to start more servers on midrange

I came upon this question when looking into the new Uniform Clustering support in V9.1.2.

5 years ago, a common pattern was to have a machine, containing a front end web server, MQ, and back end servers (in bindings mode), processing the requests, going to a remote database. For this to do more work, you increase the number of servers, and perhaps add more CPUs to the machine.

These days you have MQ in its own (virtual) machine, and the front end web server in its own (virtual) machine connected to MQ over a client interface, with the server application in its own (virtual) machine connected to MQ over a client interface, and going to a remote database.

To scale this, you add more MQ machines, or more servers machines. In my view this solves some administration problems, but introduces more problems – but this is not today’s discussion.

Given this modern configuration, how do you start enough servers to manage the workload?

Consider the scenario where you have MACHINEMQ with the queue manager on it, MACHINEA and MACHINEB with the server applications on it.

Having “smarts in the application”

  1. You want enough servers running, but not too many. (Too many can flood the downstream processes, for example cause contention in a database. Using MQ as a throttle can sometimes improve overall throughput).
  2. If a server thread is not doing any work, then shut it down
  3. If there is a backlog then start more instances of the server threads.

In the server application you might have logic like

MQINQ curdepth, ipprocs.

If( curdepth > X & number of processes and number of processes with queue open for input(ipprocs) < Y then




If get_wait timed out and IPPROCS > 2 then return and free up the session.

For CICS on z/OS, it was easy; do_something was “EXEC CICS START TRAN…”

When running on Unix the “do_something” is a bit harder.

My first thoughts were…

It is not easy to create new processes to run more work.

  1. You can use spawn to do this – not very easy or elegant.
  2. I next thought the application instances could create a trigger message and so a trigger monitor could run and start more processes. This means
    1. Unless you are really clever, the trigger monitor starts a process on its local machine. So running a trigger monitor on MACHINEA, would create more processes on MACHINEA.
    2. This means you need a trigger monitor on MACHINEA and MACHINEB.
    3. If you put a trigger message, the message may always go to MACHINEA, always go to MACHINEB, or go to either. This may not help if one machine is overloaded and gets all of the trigger messages.
  3. I thought you could have one process and lots of threads. I played with this, and found out enough to write another blog post. It was difficult to increase the number of threads dynamically. I found it easiest to pass in a value for the number of threads to the application, and not try to dynamically change the number of threads.
  4. The best “do_something” was to produce an event or alert and have automation start the applications. Automation should have access to other information, so you can have rules such as “Pick MACHINEA or MACHINEB which has the lowest CPU usage over the last 5 minutes – and start the application there”

And to make it more complex.

Today’s scenario is to have multiple queue manager machines, for availability and scalability, so now you have to worry about which queue manager you need to connect to, as well as processing the messages on the queue,
MQ 9.1.2 introduced Uniform Clustering which balances the number of client channel connections across queue manager servers, and can, under the covers, tell an application to connect to a different queue manager.

This should make the balancing simpler. Assuming the queue managers are doing equal amounts of work, you should get workload balancing.

Notes on setting up your server.

You need to be careful to define you CCDT with CLNTWGHT. If CLNTWGHT is 0, then the first available queue manager in the list is used, so all your connects would go to that queue manager. By making all CLNTWGHT > 0, you can bias which queue manager gets selected.

Thanks to Morag for her help in developing this article.