Aug 31 2015
I was recently at a customer where they had lost a message somewhere in their MQ infrastructure, and I was asked to document how to find it.
Take a scenario where messages are sent from z/OS down to a queue in a cluster on distributed.
- Check the chinit job log for messages. You can use the MQLOG* exec described else where in this blog to display error type messages. Check for messages similar to
- +CSQX506E +cpf CSQXRCTL Message receipt confirmation not received for channel z_to_Linux
- +CSQX527E +cpf CSQXRCTL Unable to send message for channel z_to_Linux
- +CSQX544E +cpf CSQXRCTL Messages for channel z_to_Linux sent to remote dead-letter queue;
- Check the channels are started. Use +cpf DIS CHS(…) to check
- If this is a cluster channel – find out the possible locations. Use +cpf DIS QCLUSTER(queuename). This will report information like
- Check the queue managers listed (MQPC in above example) If the message was put 10 minutes ago, and the last time a message was sent over a channel was over an hour ago- the message was clearly not sent over this channel.
- For each potential queue manager
- check the logs – are any problems reported?
- Check the application queue on the system – if is a remote or clustered queue, find out where this queue is located – and see if the message is there.
- if there are messages on the queue the message may be stuck there. Use DIS QSTATUS to display the age of the oldest message on the queue and the number of Input handles open. If the value of Input handles is 0 then no application has the queue open for input.
- Use the DIS QMGR DEADQ to identify the dead letter queue for the queue manager. This may be a remote or clustered queue, so you will have to find where the queue(s) are located. Check to see if the dead letter queue has depth > 0 – if so investigate the messages on the queue
- It may have expired – so it gets deleted.
- If the EXPIRY report option is specified a message will be sent to the reply_ to queue. Did the application reading this queue know what to do with a report message – did it report the event or did it just throw it away?
- is the report message stuck somewhere or is not deliverable?
- Did an application process it. An application may have logic like – if message type A then do A_logic, else if message type B then do B_logic else ignore it and get next message
- Is it a shared queue – so the message was processed on a different LPAR?
- Did someone clear the queue perhaps using the CLEAR QLOCAL command?
- Did the application that put the message commit it – or did it roll back? If it rolled back the message was not successfully put.
- It may be a poisoned message so an application does MQGET – abend – rollback and does this repeatedly. The applications need logic to say – if backed out more than 3 times – then do not look inside the message – just put it somewhere safe like the dead letter queue.
- Do not assume that all your queue managers are identical. On 99% of your distributed queue managers the definition is the same – local queue with max message size of 10 KB. On one of your queue managers, the queue is defined as a cluster queue going somewhere totally different, or has max message size of 1KB.
New version of MP1B program to print MQ stats. Also includes OEMPUT to put+get and see CPU costs and elapsed times
Aug 23 2015
In MP1B there is an updated version of the MQSMF program which prints out MQ SMF accounting and statistics data.
- ˆ CSV fileles are produced for gets and puts by queue per transaction, and summarized by queue.
- ˆ Output files are opened using fopen and now use the ” w ” attribute to (re)write the files. Before it used ” a ” which caused output to be appended to uss files
- ˆ Fields present in record are now displayed, such as Open Suspend Time
- New CSV files for SMDS activity, space, and buffers
- ˆ Summary of all queue activity from the class 3 records is displayed in //QALL
- Queues doing page set I/O are displayed in //PSIDQIO
- From the accounting class 3 data, queues which directly cause I/O to the page set is reported in //BUFFIO
- More threshold keywords. SMDSWriteTime, SMDSReadTime, SMDSWaitFree, SMDSWaitBusy.
- Buffer pool CSV includes pages currently used when the SMF record was created. % full is now % of the highest used pages. It was % used of current pages.
- ˆChinit SMF. Channels now record and display
- DNS resolution time
- SSL Certicate serial number
- CN from SSLCERT
- ˆ ˆSMDS calculates the MB/Second
- Numbers were sometimes inconsistent between CF report and CSV file
- Always report the summary of MQ SMF records and subtypes found
- Fixed problems with processing the data, where invalid results were displayed.
- Fixed problems where summary information in //TASKSUM was not displayed properly
- ˆ Display buffer pool pages read and written per second
- Log maximum times now report 1:1 and 2:1, where the first number is the log copy number, and the second is one page written.
- In V701 the amount of data logged was the total data logged – you could not tell if this was due to single logging or dual logging. In V710 and later, the data logging per individual log is displayed. This may appear as a reduction of data logged per second.
- The label of the Log statistics, Checkpoints, has been changed to LogLoad Checkpoints(LLCheckpoints) to show this is due to the LogLoad value, not the logs filling up.
Also present in the supportPac is a program (OEMPUT) which was available in the no longer supported SupportPac (IP13).
This runs in z/OS batch. It puts and gets messages from queues – but also reports the CPU time used, the elapsed time taken, and reports the number of messages processed per second.
Here are some typical scenarios
How long does it take and how much CPU is used to do MQOPEN 1000 * (Put of persistent messages of size 1KB, Commit), MQCLOSE
Total Transactions : 1000
Elapsed Time : 0.540 seconds
Application CPU Time: 0.038 seconds (7.1%)
Transaction Rate : 1850.481 trans/sec
Round trip per msg : 540 microseconds
Avg App CPU per msg : 38 microseconds
We can see that the commit, (the logging time) in the simplest, no load case is about 540 microseconds.
Request reply model
What rate can one batch job put a message to the server program and get the reply? The job ran for 1 minute
The CPU used by jobs is reported. The program looks at internal control blocks to calculate the CPU used by the specified jobs in the internal.
In the data below the server used 101 microseconds of CPU per message.
This was for a test system with only one application-server pair running.
If other work was running in the queue manager then the CPU and CHINIT cost would include their MQ CPU as well. So these figure should be used with care, in carefully controlled environments.
Total Transactions : 36248
Elapsed Time : 60.002 seconds
Application CPU Time: 3.348 seconds (5.6%)
Transaction Rate : 604.114 trans/sec
Round trip per msg : 1655 microseconds
Avg App CPU per msg : 92 microseconds
Jobname.ASID TCB(uS) SRB(uS) Tot(uS) (%)
/tran /tran /tran
————- ——– ——– ——– —-
MQPAMSTR.00A5 00000001 00000126 00000128 7.8
MQPACHIN.00A6 00000000 00000000 00000000 0.0
MQSERVER.00D5 00000100 00000000 00000101 6.1
MQS* 00000100 00000000 00000101 6.1
Total CPUmicrosecs/tran 230
Grand Total CPUmicrosecs/msg 322
The round trip time was 1655 microseconds. This involved application(put commit), server (get put commit), application (get commit). So 3 commits 2 puts and 2 gets. Most of the time is spent logging. In the simplest case above the put commit took 540 microseconds. 3 * 540 = 1620 which is close to the measured time of 1655 uSecods
Remote request reply model
Put a few messages to a remote queue manager and have them sent back to the reply queue. This can give a measure of the network time.
You can use non-persistent to see the network time, and use Persistent messages to include logging time, and time spent in batches
You can specify
- The name of the request queue (for puts)
- The name of the reply queue ( for gets)
- Message persistence
- Message size
- The name of a data set containing message content
- How often to rint out some or all of the messages send and received
- A message property
- Fields in the MQMD
- and more
Printing SMF 42-6 DASD stats by data set
I have a program which prints out the SMF statistics by data set – the SMF 42-6. So you can see the I/O stats for MQ data sets, includes Connect time by MQ log, and by stripe.
I can make this available to people if there is sufficient interest. Please let me know at PAICE@UK.IBM.COM
Aug 5 2015
One customer said to us ‘If we page – we die’. Most customers have enough real storage so that none of their production applications page. If they have any paging then they have performance problems.
It is often cheaper to buy more real storage than to manage applications to use less real storage.
So the simple answer to how much real storage does MQ need?, is you do not need to know – just make sure your system does not page or swap for storage even at peak time.
In z/OS the address space size is 2GB below the 31 bit bar. Of this about half a GB is used by the nucleus, ECSA and other z/OS components. So the maximum usable virtual storage available to a region is about 1.5GB.
You can determine this size from tools like RMF’s VIRTUAL STORAGE ACTIVITY report which has the Private and the EPVT values.
|STATIC STORAGE MAP
AREA ADDRESS SIZE
EPVT 23F00000 1473M
ECSA B788000 391M
EMLPA B787000 4K
EFLPA 0 0K
EPLPA 713F000 70.3M
ESQA 1BC0000 85.5M
ENUC 1000000 11.8M
—– 16 MEG BOUNDARY ——
NUCLEUS FD0000 192K
SQA ECF000 1028K
PLPA D12000 1780K
FLPA 0 0K
MLPA 0 0K
CSA A00000 3144K
PRIVATE 2000 10.0M
PSA 0 8K
So the total private region on my system is 1473 +10 MB = 1483 MB or about 1.5 GB
So for MQ V7 the QMGR and the CHINIT could each use up to 1.5 GB, or up to 3GB between them.
With MQ V8 we have 64 bit buffers. You do not want these to page, so you need enough real storage for the working set. If you just move a buffer from below the bar to above the bar, the real storage will not significantly change. If you now use the space just freed up it will potentially increase the real storage used.
So in MQ V8 the maximum real storage used, is going to be 3GB + size of 64 bit buffer pools. So in simple terms allow for 4GB of real storage.
You also need to make sure that your applications do not page or get swapped for storage.
The numbers above are the maximum expected real storage.
If your buffer pools are usually close to empty, your messages are small, and you hand hundreds of channels – then your real storage will be much smaller.
If your channels are processing 100MB messages – and you are encrypting these messages – you will need 200MB of virtual/real storage per message. Five concurrent channels will use 1GB of real storage in the chinit, and the buffer pool in the queue manager is likely to fill up.
You need to plan for a bad day.
Consider a normal day, where you have 1000 branches, sending messages, and so you process 10 messages a second, and only two channels actively sending data at any one time. Your back end CICS has one transaction processing these.
Now consider you have an outage, and now each branch has many messages queued up. So instead of 10 messages a second with two channels active, you have 1000 channels actively sending many messages. So you might need 500 times the buffer space – which will need real storage. With this huge increase in workload – you now need five CICS regions to process the workload.
You need much more real storage after this outage than you normally use in day to day running.
In the worst case MQ may use 3GB of real storage during this recovery time.
If you do not have enough real storage then your recovery may be very slow – as every thing is paging and this is slowing down the work,
Aug 5 2015
I struggled to get MQ FTE on z/OS working at 701, and ended up writing my own JCL and making it available via this blog.
However I was digging around in my MQ FTE 701 files on z/OS and I found some sample JCL which is not documented in the info center. If I had known this was there it would have made my life much easier.
In /HMF7704/samples/JCL there are two directories..
- example: this has some sample filled in with definitions used by one of our developers
- source: this has the JCL with words like change ##AGENT## to the name of the agent
I used the samples directory. These are in a funny code page – I found I could oedit the files but not obrowse them.
You can either copy these files to your private directory for example
cp /HMF7704/samples/JCL/source/* ~/my704fte
where means my home directory
or to copy into an existing PDSE so they can be read use
cp -O u * “//’PAICE.FTE.JCL704′”
You cannot use OGETX as it does not do the conversion.
There are three types of samples
- BFGY* which use a shell command, and so execute the .profile shell script for the user
- BFGX* which invokes the command directly and configuration is specified in a STDENV file
- BFGZ* which runs under JZOS
Which is the best set to use?
- Running JZOS* has advantages in that you can use P jobname to stop an agent – so this is my preferred way of running. It uses a shell environment to set up the parameters.
- Running with the Shell means you can build up complex strings in a shell script eg building up the LIBPATH over several lines. This is complicated by you may not be entirely clear what scripts are running and setting your FTE parameters.
- Running with what I think of as the bare metal, you specify the parameters in STDENV. This is not a shell environment so you cannot build up strings. You are limited by the line length.
If you are using a shell you can use commands like
if you are using the bare metal you have to code
LIBPATH=/java/java71_bit64_sr1/J7.1_64:/mqm/V8R0M0/java/lib/:/db2/db2v10/jdbc/lib and run out of space before the end of the line!
If you use either of the BPXBATCH solutions you have to submit a job to issue the fteStopAgent command. The JZOS solutions supports P JOBNAME so will be more familiar to z/OS people.
These blog posts are from when I worked at IBM and are copyright © IBM 2015.