In MQ 91.2. there is a new function called Uniform Clustering, which I thought looked interesting (with my background in performance and real customer usage of MQ).
Ive had a play with it, and written up what I have found.
What is it?
When Uniform Clustering is active and it detects an imbalance in the number of conversations across queue managers, it can send a request to a connected application to request disconnect and reconnect. This happens under the covers, and it means you do not need to write code to handle this.
MQ has supported client reconnect for a few years. In V8.0 you can stop a channel, or use endmqm -r to get the channels to automagically disconnect and reconnect to a different queue manager with no application code.
I would call it conversation balancing with a side effect of workload balancing. It helps solve the problem where one server is getting most of the work and other servers are under utilized.
By having the connections for an application spread across all of the available queue managers, it should spread the workload across the available queue managers, but the workload balancing depends on the spread of work on each connection.
The documentation originally talked about application balancing – which I think was confusing, as is does not balance applications, it balances where the applications connect to.
A good client has the following characteristics
- It connects for a long time, and avoids frequent short lived connections.
- It periodically disconnects and reconnects, so over time the connections are spread across all servers.
- More instances can be started if needed to service the queues. These instances can be spread around the available servers.
- Instances can shut down if there is no work for them. For example MQGET wait for 10 minutes and no message arrives.
The Uniform Clustering helps automate the periodic disconnect and reconnect (situation 2 above).
The IBM documentation says it simplifies the administration and set up – I cannot see how this helps, as you have to define the queues and channels anyway – they do not need to be clustered.
The IBM documentation says Uniform Clustering moves reconnection logic from the application to the queue manager. This is true, but production ready applications need to have additional logic in them to support this (see below).
You should not just turn on Uniform Clustering, you need to review your applications to check they can run in this environment. If you just turn it on, it may appear to work; the problems may be subtle, show up at a later date, and also make trouble shooting harder.
How does it work?
Once the queue managers have been set up, they monitor the number of instances of applications connected to the queue manager. If you have two queue managers and have 20 instances of serverprog connected to QMA, and 0 instances connected to QMC, then over time some of the connections to QMA will be told to disconnect and reconnect, some may reconnect to QMA, and some may reconnect to QMC. Over time the number of conversations should balance out across the available queue managers.
Below are some charts of showing how this balancing works. I had a number of “server” program connected as a client. They started and all sessions connected to QMA. They did not process any messages. From the reports produced by my MQCB program, I could see when application instances were asked to disconnect and reconnect.
The chart below shows the rate of reconnecting for 20 servers connecting as clients to 2 queue managers – doing no work. After 300 seconds there were 10 connections to each queue manager.
The chart below shows the rate of reconnecting for 80 servers connecting as clients to 2 queue managers – doing no work. After 468 seconds there were 40 connections to each queue manager.
We can see that balancing requests are sent out every minute or two. The number of conversations moved depends on how unbalanced the configuration is. The time before the connections were balanced varied from run to run, but the above charts are typical.
What gets balanced.
I had two applications running into my queue managers. If you use DIS CONN(*) APPLTAG, it shows you the names of the programs running.
My client programs had APPLTAG(myclient), my server programs had APPLTAG(serverprog).
The uniform clustering will balance myclient programs as a group, and serverprog programs as a group.
You may have many client programs, for example hundreds of sessions in a web server, and only a few server programs processing the requests from the clients, so they may get balanced at different rates.
This looks like a really useful capability, but you need to be careful.
The MQ reconnection code will open the queue names you were using, and it is transparent to the application.
A thread may get a request to disconnect and reconnect, while the application is processing an MQ request, waiting for a message, or doing other work. For some application patterns this may not matter, for others you may need to take action.
Where’s my reply?
For a server application which does MQGET, MQPUT MQCOMMIT. If the reconnect request happens, the work can get backed out. Another application can process the work. Great – no problems.
For a client application, these do (MQPUT to server queue, MQCOMMIT), (MQGET wait on reply-to-queue, MQCOMMIT). The reconnection request can happen during the MQGET wait. The MQPUT request specified a reply-to queue, and reply-to queue manager. If the application has a reconnect request, it may connected to a different queue manager, so will not be able to get the reply message (as the message is on the original queue manager).
This problem is due to the reconnection support, and has been around for a long time, so most people will have a process in place to handle this. Uniform Clustering makes no difference to this, it happens without you knowing.
Reporting the wrong queue manager.
Good applications report problems with enough information to identify the problems. For example queue manager name, queue and unexpected return code. If you did MQINQ to find the queue manager name at startup, and if your application instance has been reconnected, the queue manager name may now be wrong.
- You can use MQCB to capture and report these queue manager changes, so the reconnects and new queue manager name are written to the application log.
- You could issue MQINQ for the queue manager name when you report an problem, but the connection may have moved by the time you report an problem.
- You also need to handle which queue manager the MQPUT was done on, as this could be different to where the MQGET completed. This might just be a matter of saving the queue manager name in a MQPUT_QM variable every time you do an MQPUT. You need to do this when tracking down missing messages – you need to know which system the MQPUT was done on.
- You could keep the time of the MQPUT, report “Reply not received, MQPUT was put at 12:33:44” and then review the application log (1 above) to see what it was connected to at that time.
What gets balanced
Conversations get balanced. So if you have a channel with 4 shared conversations, (DIS CHS gives CURSHRCNV(4)), you might end up with a channel to QMA with one conversation, a channel to QMB with two conversations and a channel to QMC with one conversation. Some channels may have only one conversation per channel instance.
Are there any new commands?
I could not find any new commands.
Can I turn it off this automatic rebalancing?
To put your queue manager in and out of maintenance mode, see here
This is a “challenge” with reconnection, not with Uniform Cluster support. If you change the qm.ini file and remove the
statements, this just means the applications connected to this queue manager will not get told to rebalance. You will still get applications trying to connect to the queue manager.