With MQ, if a message cannot be successfully delivered, it can be put on a Dead Letter Queue for later processing.
You can have multiple queues
- The system dead letter queue, where the MQ puts messages it cannot processed,
- Application dead letter queues, and application can put messages to a queue,
- The AMS dead letter queue for messages which had errors during get or put, for example a certificate mismatch.
Messages can be put to these queues for a variety of reasons.
- Transient problems
- If a channel is putting a message to a queue, and the queue is full, then the channel can put the message to the Dead Letter Queue. The DLQ handler can then try to put the message to the original queue, and retry a number of times after an interval. If the queue full condition was transient, then the DLQ handler is likely to succeed. If an application stops processing a queue, you can get quickly get thousands of messages on the DLQ queue.
- The queue is put disabled. A queue can be set to put disabled, for example to stop messages from going onto a queue during queue maintenance. Once the maintenance has been done the queue can have put enabled.
- The putting channel is not authorised to put to the queue, so the message gets put to the DLQ. An administrator needs to check to see if the putter is allowed to put the message. If so, fix the security and put the message back on original queue. If not remove the message, and educate the developer.
- An AMS protected message has a problem, for example an unauthorised user has signed a message, or the id getting the message does not have a certificate to decrypt a message. You need to resolve any local certificate problems, or send the original message back to the requester saying it is in error.
- The message is too large for the queue. The administrator needs to educate the developer and/or make the queue max message size larger.
You may have a policy that non persistent messages for a particular queue which end up on the dead letter queue should be purged. Persistent message for another queue should have special treatment.
You may want administrators to be able to look at the meta data about a message, destination queue, MSGID, the list of recipients who can decrypt a message; but not to look at the message content.
Setting up your environment to cover these areas need considerable planning.
Implementing a solution
You want to try to keep the main DLQ close to empty, for example if your DLQ fills up with non persistent inquiries, then putting an important persistent message to the DLQ may fail.
You can use the runmqdlq program on midrange or CSQUDLQH on z/OS, to specify rules for automatic processing of messages on the DLQ.
You can select on attributes like original destination queue name, the reason why the message was on the DLQ, userid in the MQMD; and specify an action
- Retry the put to the original queue
- Move to another queue
- Purge it
- Leave it
When a message is processed on the DLQ, the rules are applied, and the action of the first matching rule is applied. For example
DESTQ(MYQUEUE) REASON(MQRC_Q_FULL) ACTION(RETRY) RETRY(5)
DESTQ(MYQUEUE) REASON(MQRC_Q_FULL) ACTION(FWD) FWD(MYQUEUEOVERFLOW) HEADER(YES)
This says that if a messages destination was MYQUEUE, and the reason code was MQRC_Q_FULL, it retries the put to the queue, at most 5 times. After 5 attempts, the first rule is skipped, the second rule is used, and the message is forwarded to the queue MYQUEUEOVERFLOW keeping the DLQ header.
DEST(INQ*) PERSIST(MQPER_NON_PERSISTENT ACTION(DISCARD)
For message destination INQ* and non persistent messages, then just discard them.
DEST(INQ*) PERSIST(MQPER_PERSISTENT ACTION(LEAVE))
For message destination INQ* and persistent messages, then just leave them on the queue, for some other processing.
If runmqdlq or CSQUDLQH is restarted, then all processing is reset.
For transient type problems you may want to consider
- Non persistent messages for a set of queues get purged, dont even try to put them back on the queue.
- Persistent messages for INVOICE* queues get moved to INVOICE_DLQ queue, where you have another DLQ monitor running on the queue.
For administrator type problems
- You could pass non persistent messages to an admin_DLQ_NP queue, and have a program which reads the meta data, and prints it to a file, then deletes the original message
- You could pass persistent messages to an admin_DLQ_P queue
- have a program which reads the meta data, and prints it to a file, and leaves the message on the queue.
- Using the meta information resolve the problem.
- Have another program which takes the msgid and correlid as input parameters, then puts the message on the original queue. (If there is only one message, you could use the default DLQ handler to do this.)
For AMS problems
- This is complicated by having to use a different queue. If the DLQ handler tries to put to the AMS protected queue, it will be “protected” (enciphered) again. You need to use put the message to an alias queue, with the original queue as the target. On midrange Java and C clients can disable AMS processing, either by using an environment variable, or through the MQCLIENT.ini file. See here.
- This is also complicated by possibly needing access to information in the payload, such as the list of recipients, and decrypting the message to get the DN of the signer.
- Once you have resolved the problem, have another program which takes the msgid and correlid as input parameters, and puts the message on the alias queue (if there is only one message you could use the default DLQ handler to do this).
How do I check it I have got it right?
It is worth putting a process in place to monitor the depth of the dead letter queue, and if it does not become empty a few a minutes, display the contents of the queue, and add rules to handle the residual messages.
I do not think that IBM provides a list of return codes of messages that it puts onto the DLQ, I think you’ll have to go through the list (over 500!), and put a rule in place for each one. If an application invents its own return codes, you may need rules for these as well.
My quick look at the list includes
2030 (07EE) (RC2030): MQRC_MSG_TOO_BIG_FOR_Q
2031 (07EF) (RC2031): MQRC_MSG_TOO_BIG_FOR_Q_MGR
2033 (07F1) (RC2033): MQRC_NO_MSG_AVAILABLE
2051 (0803) (RC2051): MQRC_PUT_INHIBITED
2052 (0804) (RC2052): MQRC_Q_DELETED
2053 (0805) (RC2053): MQRC_Q_FULL
2056 (0808) (RC2056): MQRC_Q_SPACE_NOT_AVAILABLE
2071 (0817) (RC2071): MQRC_STORAGE_NOT_AVAILABLE
2102 (0836) (RC2102): MQRC_RESOURCE_PROBLEM
2120 (0848) (RC2120): MQRC_CONVERTED_MSG_TOO_BIG
2141 (085D) (RC2141): MQRC_DLH_ERROR
2142 (085E) (RC2142): MQRC_HEADER_ERROR
2148 (0864) (RC2148): MQRC_IIH_ERROR
2149 (0865) (RC2149): MQRC_PCF_ERROR
2150 (0866) (RC2150): MQRC_DBCS_ERROR
2342 (0926) (RC2342): MQRC_DB2_NOT_AVAILABLE
2345 (0929) (RC2345): MQRC_CF_NOT_AVAILABLE
2348 (092C) (RC2348): MQRC_CF_STRUC_AUTH_FAILED
2349 (092D) (RC2349): MQRC_CF_STRUC_ERROR
2362 (093A) (RC2362): MQRC_BACKOUT_THRESHOLD_REACHED
You may want common ones, such as queue full and not authorised, on a per queue basis, and all the less common ones, such as all of the z/OS ones all putting to one “admin” queue.