You may have started with a well run MQ organisation, where things were nice and calm, and with no major problems. You wake up one day, and find your time is spent fighting and losing, against the enterprise. There are task forces to look into the application stability, you find you have a large bill for all the queue managers in your enterprise which do not have a license, and you cannot catch up on the “important work” you want to do.
When I worked for IBM it was rare for me to find a well managed MQ environment, where the systems were stable, the environment was neither too big nor too small, and there were few problems to manage. The fact that I was an MQ trouble shooter might partly explain it.
Over time an enterprise’s MQ environment changes.
- It can grow organically, it can grow by mergers and acquisitions,
- There may have been many half implemented changes in direction,
- There are people who think they can set up an MQ server better than the MQ team – until they find the queue manager is not licensed, it has had no maintenance, and a child of 10 could get through its security.
- The applications have suffered the curse of cut and paste. It took five minutes to cut and paste some code which worked else where, but it took other people six months work to get the stability and performance that was needed. If they applications developers had taken more time, following best practices and using defensive coding (report if this call takes a long time), the application would take a lot less effort to support. Being agile means someone else gets the bullets.
It is difficult to fight such an environment. Having policies or strict standards should help narrow the environment and get things onto a narrow well worn track. With just one way of doing something, it makes it easier to manage. Applications following best practices helps keep them on the golden path, and are easier to support.
Below are Colin’s Commandments for getting your environment under control. Some are very obvious, but people may think they do not apply to them. You will get resistance from people when you want them to change, but the benefits should outweigh the costs.
I hope they will help make the life better for the MQ system programmers and MQ administrators.
I would welcome any comments on these commandments – or suggestions for other areas I may have missed.
The mission is
To provide an MQ environment which is
- Stable
- Performs well
- Has the availability characteristics required by the business application
- At the right cost, both in money and capacity
This is a combined effort of all teams; the MQ team, the application development teams, other support teams and managers.
Colin’s commandments
The list below is the summary, more detail follows after the summary.
Providing queue managers:
- The MQ team is responsible for providing MQ Servers. Other teams shall not install MQ Servers.
- The applications teams are responsible for using a current level of MQ client and Java code.
- The MQ team are responsible for maintaining the levels of software
- The MQ team shall keep the MQ Servers current.
New applications and significant changes
- Application teams will work with the MQ team so there is a common understanding of the quality of service requirements.
Security
- Authorisation to MQ resources will be done using groups or roles, not userids. Individual accesses shall be removed.
- Identity and authentication shall be done using digital certificates.
- The levels of encryption on channels will be reviewed annually, and levels of security improved if needed.
- You shall limit which people need access to queue data in production.
- Use shall abide by the agreed encryption policy
- There will be no access where just a userid and password is required to access MQ
- Every queue must be protected with a security profile
Operations
- Objects will be defined in a central repository and deployed to the machines as needed. Application objects not in the repository will be deleted from the queue manager.
Automation
- Events to be produced when abnormality detected , either from application, tools like AppDynamics or dependent software(MQ)
You shall plan for availability problems
- Disaster recovery – loss of main site. You will lose data.
- Fail over – disk are mirrored, queue managers and other applications can be restarted on the other site.
- There should be no loss of persistent data, it may take minutes to become available.
- Assume all non persistent messages are lost.
- Systems can be configured so new work flows to a different system while the queue managers are recovering. This needs good planning and configuration
Applications
- Information about the connection to MQ shall be outside of the application, for example in a parameter file, passed as a parameter in an MQ CCDT
- Applications should avoid frequent connects. Use a Java connectionFactory with connection pooling
- Applications should disconnect and reconnect every 15 minutes or so. This should be done at the end of a business transaction.
- Application shall explicitly use MQDISC or close the connection factory.
- All applications using MQ shall be rebuilt at least once a year.
Messages
- There shall be no messages on application queues older than 8 days.
- The appropriate message persistence shall be used. Persistent for critical data, non persistent for non critical data or where there is an end-to-end recovery solution at the application level. This shall be decided by the applications team.
- Use of Persistent messages. These will not have message expiry interval specified
- Use of non persistent message. Non persistent messages shall have message expiry set.
- Applications will check the backout count of a message and use of Backout Queue
- There shall be a process for taking messages from this BO queue and notifying the application team.
- The MQ team is responsible for MQ messages, on transmission queues, clustering, events, etc.
- Application owners own the application data.
Application coding
- Messages shall have the appropriate msgid and correlid set.
- MQGETS should use the GMO_FAIL_IF_QUIESCING.
- Applications shall handle return codes appropriately
- Applications shall use get with convert option.
- If an application is getting a specify reply, and the application times out, there must be a process to handle the message if it finally arrives.
- All gets shall check the Backout Count value in the message.
- The get wait time out value shall be less than 10 minutes.
- Applications shall not poll queue on a short timer (under 1 minute)
Operations:
- Naming standards shall be used
- Applications can be moved between different MQ servers, to allow MQ team to manage the queue managers
- “Object not found” shall be an error.
- Unused objects shall be deleted after a period not less than 13 months
The commandments in more detail…
Providing queue managers
The MQ team is responsible for providing MQ Servers. Other teams shall not install MQ Servers.
Why?
- To prevent unauthorized installation of mq
- Management of licenses
- It is important to keep servers up to date with fixes
- Instance life cycle – for example deleting queue managers when they are not needed , and merging little used ones to reduce the number of queue managers in use.
The MQ team is responsible for making MQ client code etc available to application teams
The applications teams are responsible for using a current level of MQ client code. The applications team are the only people who can incorporate the new level of MQ code into the applications.
The applications teams need to incorporate the provided level of the products, for example recompile applications to use current header files and Java classes.
The MQ team are responsible for maintaining the levels of software
- The MQ team shall keep the MQ Servers current. If version 9.1 is available then all servers should be at V9.0 or above. Servers shall have a recent fix pack applied, at least once a year.
- Application software (MQ client, web server, java) has to at a supported level agreed with the manager of the MQ team.
- Any non-compliant MQ client connecting to a MQ server can/will be rejected by the MQ manager. You can do this by writing an channel exit which checks the level of the connection.
Why?
- It is important to keep current with fixes. There is less impact if a new or urgent fix needs to be applied. Applying three months worth of fixes is easier to install than 2 years of fixes.
- Newer versions of client code have fixes for stability etc.
- Back level clients not being allowed to connect. People may not know they are using an MQ client. Cutting off their access makes sure that people know, and so they can change the MQ client.
New applications and significant changes
Application teams will work with the MQ teams so there is a common understanding of the quality of service requirements.
- If the application is business critical
- What are the availability requirements (always available, or could accept a half hour delay while the server is restarted). This will help with deciding which systems should host the queues
- Expected capacity and throughput. Are the queues going to hold millions of messages in normal use? What is the expected throughput of data in MB per second, or per day. Once the application is in production, statistics can be used to show usage and queue depths
Why?
- Decide if a queue manager be shared with other user
- Decide if it needs the high availability of MQ on zOS or if midrange can provide the level of service
- Decide on how many queue manager instances are required, and how they need to be configured.
Security
Access to MQ resources will be done using groups or roles, not userids. Individual accesses shall be removed.
If there are multiple identical queue managers, they shall share a common security policy. For example using LDAP, or RACF, or deploy setauth commands automatically to all these queue managers.
Why?
- This makes it easier to manage. The manager of the group can decide if the person or userid needs access to the resource. The MQ team do not know if a person needs access or not.
- It makes management much easier from an MQ perspective. As there are fewer security objects. A person is connected to a group and gets access to all of the queue managers.
Identity and authentication shall be done using digital certificates.
These will be renewed annually, and will have the agreed levels of security (size of key, and algorithms)
Why?
- Increasing the key size will increase the security
- Newer algorithms can increase security, and some can be offloaded to special chips (z/OS ZIIPs), and so reduce the CPU used
The levels of encryption on channels will be reviewed annually, and levels of security improved if needed.
Why?
- Newer algorithms provide better encryption, and lower overall CPU cost.
- It can be a major piece of work upgrading clients and channels to a different cipher spec.
Limit which people need access to queue data in production.
- Maintain an audit list of users when they access production data (opening a production queue for input or output).
- Most people do not need access to the message data. Automation should be able to work with any messages (such as offloading them to a file) if needed.
- Applications people will need access to the message content, if there is a problem with the message content. A special userid could be used for this sort of activity.
Why?
- Limit how much data can leak
- Be able to identify who had access to data at given times.
Use of encryption
- Channels to external partners to use TLS
- You need to agree if you will use TLS on channels within your organisation.
- Decide if you will use AMS or not to protect message content in flight
- Document your policy
There will be no access where just a userid and password is required to access MQ
Using userid and password is inherently insecure. Using certificates is more secure. Using Multi Factor Authentication is even more secure. For example you need a digital certificate and an one time password generated by a dongle.
Every Queue must be protected with a security profile
This could be a generic profile PAYROLL.*
Why?
- The MQ team do not know what business data will be stored the queue
Operations
Deployment of definitions
- Objects will be defined in a central repository and be deployed to the machines as needed
- There will be no manual definition on the queue managers
- If an object on a queue manager is out of step with the repository, then object definition will be updated to the repository standard, or the definitions removed from the queue manager
- Use queue manager change events to monitor when objects are changed.
Why?
- If you are deploying the same definitions to multiple queue managers you need a process to do it. If you are deploying to just one queue manager – why have two processes instead of one. It is also easier to “quickly create another instance of the queue manager”.
- Have the MQ objects for a business application in one file. If you make a change to one object, deploy all of them with a “define replace”. Any objects that may have been changed by hand will be reset to the standards.
- Application objects which do not belong to a supported application should be deleted. This is good housekeeping practice.
Automation
- Events to be produced when abnormality detected , either from application, tools like AppDynamics or dependent software(MQ)
- Use automation to respond to messages and events. If a message occurs which is not automated – then change the automation to include it. This means you need to check events and messages every day.
- Use tools for identifying trends, this could be capacity, or a channel is connecting more often.
Why?
- Applications need to produce events or other notification when they detect internal problems. This could be logic errors, or conditions like queue full. You need to take action when problems occur.
- You need automation to handle these event, and take action, either to fix the problem, or notify someone who can fix it
- If you use some of the congitive insight tools on the events they can be trained to spot problems before they occur
- You may be able to use these tools to help you do capacity planning.
You shall plan for availability problems
Disaster recovery – loss of main site
People need to know that this will have an impact to business critical applications. There is likely to be a loss of data in MQ (and databases as well). If you cannot tolerate this, you need to a design which can handle this – for example sending a message to two sites.
Fail over – disk are mirrored, queue managers and other applications can be restarted on the other site.
People need to know
- There should be no loss of persistent data. There may be a delay of several minutes before persistent data is available in the queue manager. (The time taken to notice a problem, the time taken to shut the queue manger down, the time to start the queue manager on the other site).
- Plan for the loss of all non persistent data
- Systems can be configured so new work flows to a different system while the queue managers are recovering. This needs good planning and configuration
You need to know what level of availability the applications need. z/OS may be the best solution for this.
Applications
Connect information
Information about the connection to MQ shall be outside of the application, for example in a parameter file or passed as a parameter
Why?
- It is easier to move the application into production.
- It allow an application to be moved to a different queue manager without having to redeploy the application, for example upgrading the queue manager or hardware.
- If you need to provide a second queue manager for scale-ability, you create a second queue manager and MQ team changes the configuration file
Applications should avoid frequent connects. Use a Java connectionFactory with connection pooling
Why?
- MQ Connect is very expensive. It is easy to write a java program which causes an MQCONN for every message. Using a connection factory means the connection is held and there is no MQDISC.
Applications should disconnect and reconnect every 15 minutes or so. This should be done at the end of a business transaction
Why
- This allows connection balancing, and avoids the situation where one server is overloaded, and a second server (which started later) has no work. Using the Uniform Clustering in MQ V9 is not suitable for non trivial business applications, for example request-reply.
Application shall explicitly use MQDISC or close the connection factory.
Why?
- If a java application returns it may not automatically go through disconnect processing, resulting in “lost” connections and an increase in the number of connections in use. Too many connections can cause outages
- If applications just return on detecting a problem, rather than close the MQ resources before returning, this is a defect.
Messages
There shall be no messages on application queues older than 8 days.
Why?
- Application messages should be processed either with seconds, or perhaps overnight or at the weekend.
- If there are any old messages this means that there were some application problems, perhaps an application timed out, and did not clean up the late coming reply. it can be caused by an application abending while processing the message. In either case there should be an alert, or action to process the residual message
- Deep application queues impact the time to restart, and recover after failure.
The appropriate message persistence shall be used.
- Persistent for critical data,
- Non persistent for non critical data or where there is an end-to-end recovery solution at the application level.
Guideline: Updating a resource needs persistent data unless there the application does end to end recovery, inquiry (or repeatable requests) should be non persistent
The application Requestor determines whether the data is critical – any “server” will respect this request, for example if the input message is persistent the reply should be persistent
Persistent messages
- Persistent messages will not use message expiry
- There will be a process for removing persistent messages if the getting application has gone away. This is to prevent a build up of orphaned messages on a queue
- Persistent messages are usually processed within Syncpoint. An exception is audit type messages that you want to be produced even if the transaction rolls back.
Non persistent message
- Non persistent messages shall have message expiry set.
- Non persistent messages are often processed out of syncpoint
Applications will check the backout count of a message and use Backout Queue
- If a message has a backout count > 2, or has bad content (for example there is an error in the data, or an invalid header is detected), then the message should be put onto the backout queue, out of syncpoint and with no expiry set.
Why?
- The program detecting the problem may not be doing the commit or backout. Out of syncpoint means it cannot be rolled back
- There shall be a process for taking messages from this BO queue and notifying the application team.
Application coding
Messages shall have the appropriate msgid and correlid set.
- A requestor should always use msgid/correlid (put followed a get for the answer from the server)
- A server using “get next message”, shall not specify a msgid or correlid value for the get, and should specify the appropriate msgid and correlid in any response
Why?
- Msgid an Correlid are used to get a specific message. You cannot assume that a dedicated queue is being used by the application, and a queue shared by other application instances may be used for performance reasons.
MQGETS should use the GMO_FAIL_IF_QUIESCING.
Why?
- Without this a long get will prevent a queue manager from shutting down.
Applications should handle return codes appropriately
- Operational return codes such as queue full, queue disabled, should be handled either within the application, or via an event and automation. A queue full condition may wait for a period and retry.
- A return code indicating data or programming problems, message too large, of message format error, should generate an alert or message for automation to pick up. The application will typically terminate
JMS or Java exceptions shall be reported, so the underlying error is reported.
When “An error occurred” alert is produced with no exception data, it makes it hard to diagnose the problem. Provide the exception data, the object being used, the program name and the line within the program. For example have a unique error message which is produced in only one place.
Applications shall use get with convert option.
This is in case the queue manager is moved to a different platform, or a different source is used.
If processing data within syncpoint applications shall do an explicit commit or backout.
as the default is different on zOS and mid-range. Out of syncpoint data does not need a commit or backout.
If an application is getting a specify reply, and the application times out, then there must be a process to handle the message if it finally arrives.
Why?
- Non persistent message with expiry will eventually time out
- Persistent messages with unlimited time out will just stay on the queue until an action is taken. They need another process, for example overnight drain the queue of messages over a minutes in age.
- There are utilities like QLOAD from MQGEM which can move messages from a queue to another queue or a file, which meet the selection criteria, for example age > 5 minutes.
Have a process to delete any temporary queues that were used.
Why
- If there were some messages left behind, the queue may not have been deleted on close.
- There needs to be a process to handle any messages on this queue, and deleting the queue.
After a time out, if the application decides to resend the requests, the back end application must be able to process a missing requests, and process a possible duplicate request
Why
- The applications team need to architect the flow. If the application doing an MQGET does not get a message within the specified time, it needs to either resend the message, or notify the requestor. If the message is resent, and may flow via a different route the second time, the back end application needs to be able to handle a missing message, or a possible duplicate message. The original message have have got stuck on the way to the back end, or it got to the back end, and the reply was stuck.
All gets shall check the Backout Count value in the message.
Why
- To prevent endless MQGET, problem, RollBack. For example if the backout count is >= 2 then put it on the backout queue, and produce an alert.
The get wait time out value shall be less than 10 minutes.
Why
- For an application waiting for a response to a request, a typical value is 1 second. This is application dependent. The time for a request and receiving the reply, may be 50 milliseconds.
- This 10 minutes allows the MQDISC to be within 15 minutes.
Applications shall not poll queue on a short timer (under 1 minute)
Why?
- An MQGET with a long wait is more efficient as it uses a lot less CPU
- If the application has to get from multiple queues, then you can use MQ to delivery a message to your application when a message arrives.
- Polling a queue is very expensive for CPU used. This usually causes a No Message Found response.
Application builds
- If you are having seemingly un-explanable problems with your MQ program compiled on previous releases, then re-compile with the level of MQ being used as first recovery option. This rebuild should use the header files and Java files for the level of MQ being used.
- All applications using MQ shall be rebuilt at least once a year.
Why?
- If a problem is found in an application, and has to be rebuilt, then it may be using different libraries from the previous time it was compiled, for example MQ V9 C header files, instead of MQV8 header files. Also the MQ interface code(stubs) may also have changed. If there are problems we need to know before we have a critical problem.
- When the level of MQ has been changed on the servers, the applications should be recompiled as part of the post migration work.
- There may be a different level of Java, and this may behave differently.
- We know we have the source of the program which matches the executable
This is controversial!
Application owners own the application data.
- Application data in MQ queues is the responsible of the application owner. The primary owner is the application getting the message.
- If the queue fills up the applications team are responsible for clearing the queue, either by starting more applications, or moving messages out of the queue
- If the message has bad data, the application getting the data is the responsible for resolving the problem, even thought a different application may have put the message, or be external to your enterprise.
- The MQ team are responsible for MQ messages, on transmission queues, clustering, events, etc.
Operations: Naming standards
- Existing naming standards shall be used. There may be many existing standards today. Application teams shall work with the MQ team when defining new resources to use the appropriate standard.
- There will be security profiles to control access to MQ resources. These cover putting and getting messages, and defining, altering MQ objects. The controls will be group/role based.
Operations housekeeping
Applications can be moved between different MQ servers, to allow the MQ team to manage the queue managers
- This includes moving work to better queue managers, and removing old queue managers
- It may need a CCDT or other mechanism to “deploy” definitions to applications and so pick up the new queue managers, and handle changes to cipher spec.
- You need to know which applications use which queues, so definitions can be made on new servers. You may need to use an application trace to identify objects on mid-range, as the statistics and accounting data report the queue being used, and not the opened queue. For a clustered queue the queue name may just be “SYSTEM.CLUSTER.XMIT.QUEUE”.
- Provide an multiple back end solution to applications for availability
“Object not found” shall be an error.
Why?
- If a request is made to open an object which does not exist in the queue manager, it will send a request to a full repository, if clustering is being used. The full repository remembers the request.
- This wastes CPU, and can cause a large number of objects to be store in the cluster repository cache.
- The queue should be put/get disabled if no access is needed.
- An object not found event is written to the MQ event queue. You could have an application which processes the event, and generates the object either from a backup, or a standard template.
Unused objects shall be deleted after a period not less than 13 months
Why ?
- Good housekeeping practice
- Some objects may only be used once a year, so they need to be kept for longer than this.
- You can use the DIS QS(queuname) LPUTTIME( ) to see when a message was last put. This information is discarded when the queue manager is shut down. It would be good practice to issue the command for all queues once a week, and before shutting down the queue manager to capture when the last message was put.
One thought on “Taming your MQ environment.”