Taming your MQ environment.

You may have started with a well run MQ organisation, where things were nice and calm, and with no major problems.  You wake up one day, and find your time is spent fighting and losing, against the enterprise.   There are task forces to look into the application stability, you find you have a large bill for all the queue managers in your enterprise which do not have a license, and you cannot catch up on the “important work” you want to do.

When I worked for IBM it was rare for me to find a well managed MQ environment, where the systems were stable, the environment was neither too big nor too small, and there were few problems to manage.  The fact that I was an MQ trouble shooter might partly explain it.

Over time an enterprise’s MQ environment changes.

  • It can grow  organically, it can grow by mergers and acquisitions,
  • There may have been many half implemented changes in direction,
  • There are people who think they can set up an MQ server better than the MQ team – until they find the  queue manager is not licensed, it has had no maintenance, and a child of 10 could get through its security.
  • The applications have suffered the curse of cut and paste.  It took five minutes to cut and paste some code which worked else where, but it took other people six months work to get the stability and performance that was needed.  If they applications developers had taken more time, following best practices and using defensive coding (report if this call takes a long time), the application would take a lot less effort to support.  Being agile means someone else gets the bullets.

It is difficult to fight such an environment.  Having policies or strict standards should help narrow the environment and get things onto a narrow well worn track.  With just one way of doing something, it makes it easier to manage.   Applications following best practices helps keep them on the golden path, and are easier to support.

Below are Colin’s Commandments for getting your environment under control.   Some are very obvious, but people may think they do not apply to them.  You will get resistance from people when you want them to change, but the benefits should outweigh the costs.

I hope they will help make the life better for the MQ system programmers and MQ administrators.

I would welcome any comments on these commandments – or suggestions for other areas I may have missed.

The mission is

To provide an MQ environment which is

  1. Stable
  2. Performs well
  3. Has the availability characteristics required by the business application
  4. At the right cost, both in money and capacity

This is a combined effort of all teams; the MQ team, the application development teams, other support teams and managers.

Colin’s commandments

The list below is the summary, more detail follows after the summary.

Providing queue managers:

  • The MQ team is responsible for providing MQ Servers. Other teams shall not install MQ Servers.
  • The applications teams are responsible for using a current level of MQ client and Java code.
  • The MQ team are responsible for maintaining the levels of software
  • The MQ team shall keep the MQ Servers current.

New applications and significant changes

  • Application teams will work with the MQ team so there is a common understanding of the quality of service requirements.

Security

  • Authorisation to MQ resources will be done using groups or roles, not userids. Individual accesses shall be removed.
  • Identity and authentication shall be done using digital certificates.
  • The levels of encryption on channels will be reviewed annually, and levels of security improved if needed.
  • You shall limit which people need access to queue data in production.
  • Use shall abide by the agreed encryption policy
  • There will be no access where just a userid and password is required to access MQ
  • Every queue must be protected with a security profile

Operations

  •  Objects will be defined in a central repository and deployed to the machines as needed.   Application objects not in the repository will be deleted from the queue manager.

Automation

  • Events to be produced when abnormality detected , either from application, tools like AppDynamics or dependent software(MQ)

You shall plan for availability problems

  • Disaster recovery – loss of main site. You will lose data.
  • Fail over – disk are mirrored, queue managers and other applications can be restarted on the  other site.
    • There should be no loss of persistent data, it may take minutes to become available.
    • Assume all non persistent messages are lost.
  • Systems can be configured so new work flows to a different system while the queue managers are recovering.  This needs good planning and configuration

Applications

  • Information about the connection to MQ shall be outside of the application, for example in a parameter file, passed as a parameter in an MQ CCDT
  • Applications should avoid frequent connects.  Use a Java connectionFactory with connection pooling
  • Applications should disconnect and reconnect every 15 minutes or so. This should be done at the end of a business transaction.
  • Application shall explicitly use MQDISC or close the connection factory.
  • All applications using MQ shall be rebuilt at least once a year.

Messages

  • There shall be no messages on application queues older than 8 days.
  • The appropriate message persistence shall be used. Persistent for critical data, non persistent for non critical data or where there is an end-to-end recovery solution at the application level. This shall be decided by the applications team.
  • Use of Persistent messages.   These will not have message expiry interval specified
  • Use of non persistent message.  Non persistent messages shall have message expiry set.
  • Applications will check the backout count of a message and use of Backout Queue
  • There shall be a process for taking messages from this BO queue and notifying the application team.
  • The MQ team is responsible for MQ messages, on transmission queues, clustering, events, etc.
  • Application owners own the application data.

Application coding

  • Messages shall have the appropriate msgid and correlid set.
  • MQGETS should use the GMO_FAIL_IF_QUIESCING.
  • Applications shall handle return codes appropriately
  • Applications shall use get with convert option.
  • If an application is getting a specify reply, and the application times out,  there must be a process to handle the message if it finally arrives.
  • All gets shall check the Backout Count value in the message.
  • The get wait time out value shall be less than 10 minutes.
  • Applications shall not poll queue on a short timer (under 1 minute)

Operations:

  • Naming standards shall be used
  • Applications can be moved between different MQ servers, to allow MQ team to manage the queue managers
  • “Object not found” shall be an error.
  • Unused objects shall be deleted after a period not less than 13 months

The commandments in more detail…

Providing queue managers

The MQ team is responsible for providing MQ Servers. Other teams shall not install MQ Servers.

Why?

  • To prevent unauthorized installation of mq
  • Management of licenses
  •  It is important to keep servers up to date with fixes
  • Instance life cycle – for example deleting queue managers when they are not needed , and merging little used ones to reduce the number of queue managers in use.

The MQ team is responsible for making MQ client code etc available to application teams

The applications teams are responsible for using a current level of MQ client code.  The applications team are the only people who can incorporate the new level of MQ code into the applications.

The applications teams need to incorporate the provided level of the products, for example recompile applications to use current header files and Java classes.

The MQ team are responsible for maintaining the levels of software

  • The MQ team shall keep the MQ Servers current. If version 9.1 is available then all servers should be at V9.0 or above. Servers shall have a recent fix pack applied, at least once a year.
  • Application software (MQ client, web server, java) has to at a supported level  agreed with the manager of the MQ team.
  • Any non-compliant MQ client connecting to a MQ server can/will be rejected by the MQ manager.  You can do this by writing an channel exit which checks the level of the connection.

Why?

  • It is important to keep current with fixes. There is less impact if a new or urgent fix needs to be applied.  Applying three months worth of fixes is easier to install than 2 years of fixes.
  • Newer versions of client code have fixes for stability etc.
  • Back level clients not being allowed to connect. People may not know they are using an MQ client.  Cutting off their access makes sure that people know, and so they can change the MQ client.

New applications and significant changes

Application teams will work with the MQ teams so there is a common understanding of the quality of service requirements.

  • If the application is business critical
  • What are the availability requirements (always available, or could accept a half hour delay while the server is restarted). This will help with deciding which systems should host the queues
  •  Expected capacity and throughput. Are the queues going to hold millions of messages in normal use? What is the expected throughput of data in MB per second, or per day. Once the application is in production, statistics can be used to show usage and queue depths

Why?

  • Decide if a queue manager be shared with other user
  • Decide if it needs the high availability of MQ on zOS or if  midrange can provide the level of service
  • Decide on how many queue manager instances are required, and how they need to be configured.

Security

Access to MQ resources will be done using groups or roles, not userids. Individual accesses shall be removed.

If there are multiple identical queue managers, they shall share a common security policy.  For example using LDAP, or RACF, or deploy setauth commands automatically to all these queue managers.

Why?

  • This makes it easier to manage. The manager of the group can decide if the person or userid needs access to the resource.  The MQ team do not know if a person needs access or not.
  • It makes management much easier from an MQ perspective. As there are fewer security objects.  A person is connected to a group and gets access to all of the queue managers.

Identity and authentication shall be done using digital certificates.

These will be renewed annually, and will have the agreed levels of security (size of key, and algorithms)

Why?

  • Increasing the key size will increase the security
  • Newer algorithms can increase security, and some can be offloaded to special chips (z/OS ZIIPs), and so reduce the CPU used

The levels of encryption on channels will be reviewed annually, and levels of security improved if needed.

Why?

  •  Newer algorithms provide better encryption, and lower overall CPU cost.
  • It can be a major piece of work upgrading clients and channels to a different cipher spec.

Limit which people need access to queue data in production.

  • Maintain an audit list of users when they access production data (opening a production queue for input or output).
  • Most people do not need access to the message data.  Automation should be able to work with any messages (such as offloading them to a file) if needed.
  • Applications people will need access to the message content, if there is a problem with the message content.  A special userid could be used for this sort of activity.

Why?

  • Limit how much data can leak
  • Be able to identify who had access to data at given times.

 

Use of encryption

  • Channels to external partners to use TLS
  • You need to agree if you will use TLS on channels within your organisation.
  • Decide if you will use AMS or not to protect message content in flight
  • Document your policy

There will be no access where just a userid and password is required to access MQ

Using userid and password is inherently insecure.   Using certificates is more secure.  Using Multi Factor Authentication is even more secure.  For example you need a digital certificate and an one time password generated by a dongle.

Every Queue must be protected with a security profile

This could be a generic profile PAYROLL.*

Why?

  • The MQ team do not know what business data will be stored the queue

Operations

Deployment of definitions

  • Objects will be defined in a central repository and be deployed to the machines as needed
  • There will be no manual definition on the queue managers
  • If an object on a queue manager is out of step with the repository, then object definition will be updated to the repository standard, or the definitions removed from the queue manager
  • Use queue manager change events to monitor when objects are changed.

Why?

  • If you are deploying the same definitions to multiple queue managers you need a process to do it.   If you are deploying to just one queue manager – why have two processes instead of one.  It is also easier to “quickly create another instance of the queue manager”.
  • Have the MQ objects for a business application in one file.   If you make a change to one object, deploy all of them with a “define replace”.  Any objects that may have been changed by hand will be reset to the standards.
  • Application objects which do not belong to a supported application should be deleted.  This is good housekeeping practice.

Automation

  • Events to be produced when abnormality detected , either from application, tools like AppDynamics or dependent software(MQ)
  • Use automation to respond to messages and events.  If a message occurs which is not automated – then change the automation to include it.  This means you need to check events and messages every day.
  • Use tools for identifying trends, this could be capacity, or a channel is connecting more often.

Why?

  • Applications need to produce events or other notification when they detect internal problems.   This could be logic errors, or conditions like queue full.  You need to take action when problems occur.
  • You need automation to handle these event, and take action, either to fix the problem, or notify someone who can fix it
  • If you use some of the congitive insight tools on the events they can be trained to spot problems before they occur
  • You may be able to use these tools to help you do capacity planning.

You shall plan for availability problems

Disaster recovery – loss of main site

People need to know that this will have an impact to business critical applications.  There is likely to be a loss of data in MQ (and databases as well).  If you cannot tolerate this, you need to a design which can handle this – for example sending a message to two sites.

Fail over – disk are mirrored, queue managers and other applications can be restarted on the  other site.

People need to know

  • There should be no loss of persistent data.  There may be a delay of several minutes before persistent data is available in the queue manager.  (The time taken to notice a problem, the time taken to shut the queue manger down, the time to start the queue manager on the other site).
  • Plan for the loss of all non persistent data
  • Systems can be configured so new work flows to a different system while the queue managers are recovering.  This needs good planning and configuration

You need to know what level of availability the applications need.  z/OS may be the best solution for this.

Applications

Connect information

Information about the connection to MQ shall be outside of the application, for example in a parameter file or passed as a parameter

Why?

  • It is easier to move the application into production.
  • It allow an application to be moved to a different queue manager without having to redeploy the application, for example upgrading the queue manager or hardware.
  • If you need to provide a second queue manager for scale-ability, you create a second queue manager and MQ team changes the configuration file

Applications should avoid frequent connects.  Use a Java connectionFactory with connection pooling

Why?

  • MQ Connect is very expensive.  It is easy to write a java program which causes an MQCONN for every message.  Using a connection factory means the connection is held and there is no MQDISC.

Applications should disconnect and reconnect every 15 minutes or so. This should be done at the end of a business transaction

Why

  • This allows connection balancing, and avoids the situation where one server is overloaded, and a second server (which started later) has no work.  Using the Uniform Clustering in MQ V9 is not suitable for non trivial business applications, for example request-reply.

Application shall explicitly use MQDISC or close the connection factory.

Why?

  •  If a java application returns it may not automatically go through disconnect processing, resulting in “lost” connections and an increase in the number of connections in use.   Too many connections can cause outages
  • If applications just return on detecting a problem, rather than close the MQ resources before returning, this is a defect.

Messages

There shall be no messages on application queues older than 8 days.

Why?

  • Application messages should be processed either with seconds, or perhaps overnight or at the weekend.
  • If there are any old messages this means that there were some application problems, perhaps an application timed out, and did not clean up the late coming reply.  it can be caused by an application abending while processing the message.  In either case there should be an alert, or action to process the residual message
  • Deep application queues impact the time to restart, and recover after failure.

The appropriate message persistence shall be used.

  • Persistent for critical data,
  • Non persistent for non critical data or where there is an end-to-end recovery solution at the application level.

Guideline: Updating a resource needs persistent data unless there the application does end to end recovery, inquiry (or repeatable requests) should be non persistent

The application Requestor determines whether the data is critical – any “server” will respect this request, for example if the input message is persistent the reply should be persistent

Persistent messages

  • Persistent messages will not use message expiry
  • There will be a process for removing persistent messages if the getting application has gone away.  This is to prevent a build up of orphaned messages on a queue
  • Persistent messages are usually processed within Syncpoint. An exception is audit type messages that you want to be produced even if the transaction rolls back.

Non persistent message

  • Non persistent messages shall have message expiry set.
  •  Non persistent messages are often processed out of syncpoint

Applications will check the backout count of a message and use Backout Queue

  • If a message has a backout count > 2, or has bad content (for example there is an error in the data, or an invalid header is detected), then the message should be put onto the backout queue, out of syncpoint and with no expiry set.

Why?

  • The program detecting the problem may not be doing the commit or backout. Out of syncpoint means it cannot be rolled back
  • There shall be a process for taking messages from this BO queue and notifying the application team.

 

Application coding

Messages shall have the appropriate msgid and correlid set.

  • A requestor should always use msgid/correlid (put followed a get for the answer from the server)
  • A server using “get next message”, shall not specify a msgid or correlid value for the get, and should specify the appropriate msgid and correlid in any response

Why?

  • Msgid an Correlid are used to get a specific message.  You cannot assume that  a dedicated queue is being used by the application, and a queue shared by other application instances may be used for performance reasons.

 MQGETS should use the GMO_FAIL_IF_QUIESCING.

Why?

  • Without this a long get will prevent a queue manager from shutting down.

Applications should handle return codes appropriately

  • Operational return codes such as queue full, queue disabled, should be handled either within the application, or via an event and automation.   A queue full condition may wait for a period and retry.
  •  A return code indicating data or programming problems, message too large, of message format error, should generate an alert or message for automation to pick up.  The application will typically terminate

JMS or Java exceptions shall be reported, so the underlying error is reported.

When “An error occurred”  alert is produced with no exception data, it makes it hard to diagnose the problem.  Provide the exception data, the object being used, the program name and the line within the program.  For example have a unique error message which is produced in only one place.

Applications shall use get with convert option.

This is in case the queue manager is moved to a different platform, or a different source is used.

If processing data within syncpoint applications shall do an explicit commit or backout.

as the default is different on zOS and mid-range. Out of syncpoint data does not need a commit or backout.

If an application is getting a specify reply, and the application times out, then there must be a process to handle the message if it finally arrives.

Why?

  • Non persistent message with expiry will eventually time out
  • Persistent messages with unlimited time out will just stay on the queue until an action is taken.  They need another process, for example overnight drain the queue of messages over a minutes in age.
  • There are utilities like QLOAD from MQGEM which can move messages from a queue to another queue or a file, which meet the selection criteria, for example age > 5 minutes.

Have a process to delete any temporary queues that were used.

Why

  • If there were some messages left behind, the queue may not have been deleted on close.
  • There needs to be a process to handle any messages on this queue, and deleting the queue.

After a time out, if the application decides to resend the requests, the back end application must be able to process a missing requests, and process a possible duplicate request

Why

  • The applications team need to architect the flow.  If the application doing an MQGET does not get a message within the specified time, it needs to either resend the message, or notify the requestor. If the message is resent, and may flow via a different route the second time, the back end application needs to be able to handle a missing message, or a possible duplicate message.   The original message have have got stuck on the way to the back end, or it got to the back end, and the reply was stuck.

All gets shall check the Backout Count value in the message.

Why

  • To prevent  endless MQGET, problem, RollBack. For example if the backout count is >= 2 then put it on the backout queue, and produce an alert.

The get wait time out value shall be less than 10 minutes.

Why

  • For an application waiting for a response to a request, a typical value is 1 second. This is application dependent. The time for a request and receiving the reply, may be 50 milliseconds.
  •  This 10 minutes allows the MQDISC to be within 15 minutes.

Applications shall not poll queue on a short timer (under 1 minute)

Why?

  • An MQGET with a long wait is more efficient as it uses a lot less CPU
  • If the application has to get from multiple queues, then you can use MQ to delivery a message to your application when a message arrives.
  • Polling a queue is very expensive for CPU used.  This usually causes a No Message Found response.

 

Application builds

  • If you are having seemingly un-explanable problems with your MQ program compiled on previous releases, then re-compile with the level of MQ being used as first recovery option. This rebuild should use the header files and Java files for the level of MQ being used.
  • All applications using MQ shall be rebuilt at least once a year.

Why?

  • If a problem is found in an application, and has to be rebuilt, then it may be using different libraries from the previous time it was compiled, for example MQ V9 C header files, instead of MQV8 header files. Also the MQ interface code(stubs) may also have changed. If there are problems we need to know before we have a critical problem.
  • When the level of MQ has been changed on the servers, the applications should be recompiled as part of the post migration work.
  • There may be a different level of Java, and this may behave differently.
  • We know we have the source of the program which matches the executable

This is controversial!

Application owners own the application data.

  • Application data in MQ queues is the responsible of the application owner. The primary owner is the application getting the message.
  • If the queue fills up the applications team are responsible for clearing the queue, either by starting more applications, or moving messages out of the queue
  • If the message has bad data, the application getting the data is the responsible for resolving the problem, even thought a different application may have put the message, or be external to your enterprise.
  • The MQ team are responsible for MQ messages, on transmission queues, clustering, events, etc.

Operations: Naming standards

  • Existing naming standards shall be used.  There may be many existing standards today. Application teams shall work with the MQ team when defining new resources to use the appropriate standard.
  • There will be security profiles to control access to MQ resources. These cover putting and getting messages, and defining, altering MQ objects. The controls will be group/role based.

Operations housekeeping

Applications can be moved between different MQ servers, to allow the MQ team to manage the queue managers

  • This includes moving  work to better queue managers, and removing old queue managers
  • It may need a CCDT or other mechanism to “deploy” definitions to applications and so pick up the new queue managers, and handle changes to cipher spec.
  • You need to know which applications use which queues, so definitions can be made on new servers.  You may need to use an application trace to identify objects on mid-range, as the statistics and accounting data report the queue being used, and not the opened queue. For a clustered queue the queue name may just be “SYSTEM.CLUSTER.XMIT.QUEUE”.
  • Provide an multiple back end solution to applications for availability

“Object not found” shall be an error.

Why?

  • If a request is made to open an object which does not exist in the queue manager, it will send a request to a full repository, if clustering is being used.   The full repository remembers the request.
  • This wastes CPU, and can cause a large number of objects to be store in the cluster repository cache.
  • The queue should be put/get disabled if no access is needed.
  •  An object not found event is written to the MQ event queue.  You could have an application which processes the event, and generates the object either from a backup, or a standard template.

Unused objects shall be deleted after a period not less than 13 months

Why ?

  • Good housekeeping practice
  • Some objects may only be used once a year, so they need to be kept for longer than this.
  • You can use the DIS QS(queuname) LPUTTIME( ) to see when a message was last put.   This information is discarded when the queue manager is shut down.   It would be good practice to issue the command for all queues once a week, and before shutting down the queue manager to capture when the last message was put.

Should I make my web server Highly Available ?

If you are using XA on midrange  then connect to just one queue manager, because  anything else is not supported.  If your connection goes to a different queue manager, then it cannot recover a unit of work.
The only exception to that rule is if you are connecting to a QSG on z/OS. When using a QSG, all of the queue managers in the QSG can resolve each others transactions – this means that the transaction recovery issue goes away.
Be careful using multiple queue managers.
For the listener you want to connect to just one queue manager.  If you have a choice of two queue managers, all threads may connect to QM1, and there be no connection to QM2.  If you have a second web server instance, this may also connect to QM1, and not QM2.  It is better to have WS1 connect to QM1 and WS2 to connect to WS2 – so called production lines.
If you need to shut down QM1, then shut down WS1,  do not try to cross wire it
For the MDB logic connection factory if you had two possible queue managers, you may get a spread of connections, from all on QM1, to all on QM2, or a mixture.
If the WS frees connections (for example they are idle), then when there is a load needing more connections, it could get a connection from QM1 or QM2.
All in all – keep it simple,  and have each web server connect to just one queue manager.

 

Running a z/Linux container as an address space on z/OS – WOW!

I was at the Guide Share Europe conference in the UK last week.  I had not been for a few years, and it was great to be able to brush up my latest z/OS skills.   It was the largest attendance – over 500 people – and about 50 young people which was great  (so young they were not allowed to drink alcohol).  It was also CICS’s 50th birthday, so a dinner, lots of cake and impressive fireworks.

One presentation caught my eye.   Running a z/Linux container in a z/OS address space.  yes – a z/Linux container in an address space, not USS.  Instead of having to install z/VM, or having to carve out an LPAR for z/Linux, you “just” configure the address space.  It looks about as complex as installing MQ on z/OS.  For example you have to define linear datasets for the Linux to use.   These are accessed by page number – just like a page set.  You control it using the z/OS modify command.  You access it via TCP/IP so there is no cross memory interfaces into it.
You can now run all of the clouds stuff like Jenkins within z/OS in an address space – WOW!

Recently someone said that virtualization had made a huge difference to the way systems are deployed these days.   I said I was using virtualisation on vm/370 before he was born.
I wonder what will be “new” on z/OS in 20 years time?

Does your highly available solution depend on a bit of rusty kit?

I heard second ( or third hand) about a customer involved in distribution who found a little problem with his highly available system.

They had great software that made sure that avocados and aubergines can be sent to Arundel; Blackberries and Blackcurrants sent to Blackpool, and chives and chickory sent to Chichester.  The software would give instructions to the packers where to store the vegetables, and which order to put the trolleys into the container, so when the container was delivered the right goods were in the right place in the container.   This made unloading very efficient.    Things happened automatically, or instructions were sent to tablets telling people what to do.   There was almost no paper involved in the distribution.

Paper was used by the drivers, who would come to the shed to get instructions as to which container to collect, and told where to go, so the the delivery did not go from Arundel to Chichester by way of Blackpool.   The teeny weeny problem they had was when the printer got old and finally stopped working.  They could not print out the drivers instructions, and so the drivers did not know where to go to.   They could not route the printing to another printer as other printers were not configured to CICS.  As a result they had a day when they could not deliver the containers, and their perishable contents had to be thrown away.

 

So remember the end to end solution is truly end to end ….  not just the walls of your machine room.

 

Midrange now DIS APSTATUS command

This is a new command on 9.1.3 mid-range, part of the “uniform clustering” support .  (Uniform clustering  is what I would call connection balancing see Uniform clustering gets a tick from me).

For example  I have two instances of program oemput and it gave

dis apSTATUS('oemput') 
AMQ8932I: Display application status details.
  APPLNAME(oemput) CLUSTER( )
  COUNT(2) MOVCOUNT(0) 
  BALANCED(NOTAPPLIC)

and

dis apSTATUS('oemput') type(local)
AMQ8932I: Display application status details.
  APPLNAME(oemput) 
  CONNTAG(MQCT4509BF5D0368DB23QMA_2018-08-16_13.32.14oemput)
  CONNS(1) IMMREASN(NOTCLIENT)
  IMMCOUNT(0) IMMDATE( )
  IMMTIME( ) MOVABLE(NO)
AMQ8932I: Display application status details.
  APPLNAME(oemput) 
  CONNTAG(MQCT4509BF5D017BDB23QMA_2018-08-16_13.32.14oemput)
  CONNS(1) IMMREASN(NOTCLIENT)
  IMMCOUNT(0) IMMDATE( )
  IMMTIME( ) MOVABLE(NO)

 

There is a  different conntag for each instances of the program.  DIS QMGR QMGRID gives QMID(QMA_2018-08-16_13.32.14) .

The tags are MQCT4509BF5D017BDB23QMA_2018-08-16_13.32.14oemput and  MQCT4509BF5D0368DB23QMA_2018-08-16_13.32.14oemput.
(Thanks to eagle eyed Morag for pointing out the difference.)

Stackoverflow: What throughput can a standalone Java program achieve?

There was a question on the MQ section on StackOverflow

I have a standalone multi threaded java application which listen messages from IBM MQ.
Current system take around 500ms for processing of 1 message after it read from queue and till it commit.
I want to know how many messages I can consume

  • Concurrently:
  • Max number of messages can be processed? or throttle limit

A good meaty performance question I thought.  Let me break this into pieces.

Current system take around 500ms for processing of 1 message after it read from queue and till it commit.

Processing one messages and commit should take about 10 milliseconds or less( say 30 ms for a two phase commit).    There is clearly something else going on.  Fix this first.

  1. A long database call.   This could be due to database locking, or a badly designed statement, for example a query which needs to access thousands or millions of rows.
  2. A request to a server far far away
  3. A file system with the speed of writing an illuminated letter to parchment

How many messages I can consume: Concurrently:

Take the worst case of using persistent messages, which require log IO during commit.

For one thread, processing multiple messages before doing a commit means the thread can do more work.  Consider a get taking 1 millisecond, and a commit taking 10 ms. This is one message processed every 11 ms.  If you did 50 gets – taking 50 ms and a commit taking 10 ms, this is 50 messages in 50 + 10 ms which equates to one message every 1.2 milliseconds almost 10 times faster.    This is how channels can send messages efficiently.   There is a “sweet spot” of messages per commit to give you maximum data processed per second.   This depends on the message size, logging rates and other factors.  For a 100MB message it is one message per commit.  For 10KB messages,  this may be 1000 messages per commit.

This may be selfish

This is clearly a great improvement, but possibly selfish.  If the application logic is a get followed by a database insert, followed by a commit, then doing 50 gets, 50 inserts and a commit, will work much faster.  The down side is that the database requests will keep locks until the commit.  These locks may prevent other applications from accessing data, either the recently inserted  records, page locks, or index locks. So overall MQ throughput goes up – but the business transaction suffers.    You need to understand the database and find the optimum number of requests per commit for your business transaction.

How long before the data is visible?

Rather than have one thread process 1000 messages per commit (taking 1010 ms) you may want to have multiple threads processing 10 messages per commit – taking 20 ms.  This means that the data in the database (or replies etc) are visible earlier.    This may be important to your business transaction if you have to worry about response time.

Parallel  threads

  1. Using more threads should improve throughput, unless this is delayed by external factors – such as database locks.
  2. One customer found one thread was optimum because there was no database delays.

How many messages I can consume: Max number of messages can be processed? or throttle limit

There are papers written on this but here is a one minute overview

As fast as the queue manager can process data

  1. The rate at which MQ can write its logs
  2. Keep queue data in memory – ( buffer pools on z/OS, queue buffer on midrange), so few messages on the queue.

Threads

  1. Having parallel threads gives you better throughput than one thread.  You get overlapped writing to the log, the units of work are shorter in duration, you can get parallel IO.
  2. You may be limited by the network.   Having multiple threads from an application means the network can be better utilized.  One thread can be receiving data down the wire, while another thread is waiting in commit.
  3. You may be limited by where your programs run – eg short of CPU, or slow IO (for your System.out.println statements)

Application design

  1. You may get delays due to serialization if all thread are using the same queue.
  2. Remove the debug printf or System.out.println statements.
  3. Using a queue per business application is better than all applications sharing the same queue
  4. Using one reply to queue per web server may be better than a shared reply to queue – especially if you use Apache Camel.
  5. Use get first if possible.  Avoid scans of the queues.

 

The short answer….

You should be able to get thousands of 1KB messages a second through your Java application when using multiple threads.