How can I replicate the RACF definitions for MQ on z/OS?

If you are the very careful person who makes all updates to RACF only through batch jobs, then this is easy – take the old jobs, and change the queue manager name and rerun them.

For the other 99.99% of us,  read on…

Even if you have been careful to keep track of any changes to security definitions,  someone else may have made a change either using the native TSO commands, or via the ISPF panels. 
You can list the RACF database, but there is no easy way of listing the RACF database in command format, to allow you to do a global rename, and submit the commands.

I have found two ways of extracting the RACF definitons.

  1. Using an unloaded copy of the RACF database
  2. Using RACF commands to extract and recreate the requests

Using an unloaded copy of the RACF database

I discovered dbsync on a RACF tools repository which does most of the hard work.   You can run a RACF utility to unload the RACF database into a flat file (omitting sensitive information like passwords etc).  Dbsync is a rexx program which takes two copies of an unloaded database, and generates the RACF commands for the differences. I simply used my existing unloaded file and a null file, and got out the commands to create all of the entries.

The steps are

  1. Unload the RACF database
  2. Get dbsync into your z/OS system
  3. Run DBsync
  4. Edit the files, and remove all lines which are not relevant
  5. Run the output to create/modify the definitions

Unload the database

//IBMUSUN JOB 1,MSGCLASS=H 
//* use the TSO RVARY command to display databases
//UNLOAD EXEC PGM=IRRDBU00,PARM=NOLOCKINPUT
//SYSPRINT DD SYSOUT=*
//INDD1 DD DISP=SHR,DSN=SYS1.RACFDS
//OUTDD DD DISP=(MOD,CATLG),DSN=COLIN.RACF.UNLOAD,
// SPACE=(CYL,(1,1)),DCB=(LRECL=4096,RECFM=VB,BLKSIZE=13030)

Of course this assumes you have the authority to create this file.  If not ask a friendly sysprog to run the command, edit the to output delete all records which do not have MQ in them.

Run dbsync

I had to make the following changes

  1. Dataset 1 was the dataset I created above
  2. Dataset 2 was a dummy

Modify the sort step to output to a temporary output file

//COLINRA JOB 1,MSGCLASS=H 
//* ftp://public.dhe.ibm.com/eserver/zseries/zos/racf/dbsync/
//SORT1 EXEC PGM=SORT
//SYSOUT DD SYSOUT=*
//SORTIN DD DISP=SHR,DSN=COLIN.RACF.UNLOAD
//SORTOUT DD DISP=(NEW,PASS),DSN=&TEMP1,SPACE=(CYL,(1,1))
//SYSIN DD *
SORT FIELDS=(5,2,CH,A,7,1,AQ,A,8,549,CH,A)
ALTSEQ CODE=(F080,F181,F282,F383,F484,F585,F686,F787,F888,F989,
C191,C292,C393,C494,C595,C696,C797,C898,C999,
D1A1,D2A2,D3A3,D4A4,D5A5,D6A6,D7A7,D8A8,D9A9,
E2B2,E3B3,E4B4,E5B5,E6B6,E7B7,E8B8,E9B9)
OPTION VLSHRT,DYNALLOC=(SYSDA,3)
/*

Delete the sort of the other data set – as I was using a dummy file

Run dbsync

I changed the bold lines below, the template JCL had

//OUTSCD1 DD DSN=your.dsname.for.outscd1,
// DISP=(NEW,CATLG),

so I changed

  • your.dsname.for to COLIN.RACF
  • NEW,CATLG to MOD,CATLG
  • Upper cased the changed lines using the ucc…ucc ISPF edit line command.
//DBSYNC EXEC PGM=IKJEFT01,REGION=5000K,DYNAMNBR=50,PARM='%DBSYNC' 
//SYSPRINT DD SYSOUT=*
//SYSTSPRT DD SYSOUT=*
//SYSTSIN DD DUMMY
//SYSEXEC DD DISP=SHR,DSN=COLIN.DBSYNC.REXX
//OPTIONS DD *
/* your options here
//INDD1 DD DISP=SHR,DSN=*.SORT1.SORTOUT
//INDD2 DD DUMMY
//OUTADD1 DD DSN=COLIN.RACF.ADDFILE1,
// DISP=(MOD,CATLG),
// UNIT=SYSDA,SPACE=(CYL,(25,25),RLSE),
// DCB=(RECFM=VB,LRECL=255,BLKSIZE=6400)
etc

The output was rexx commands in a file, such as

“rdefine MQCMDS CSQ9.** owner(IBMUSER ) uacc(CONTROL )
    audit(failures(READ )) level(00)”
“permit CSQ9.** class(MQCMDS) reset”
“rdefine MQQUEUE CSQ9.** owner(IBMUSER ) uacc(NONE )
     audit(failures(READ )) level(00) warning notify(IBMUSER )”
“permit CSQ9.** class(MQQUEUE) reset”
“rdefine MQCONN CSQ9.BATCH owner(IBMUSER ) uacc(CONTROL )
    audit(failures(READ )) level(00)”
“permit CSQ9.BATCH class(MQCONN) reset”
“rdefine MQCONN CSQ9.CHIN owner(IBMUSER ) uacc(READ )
    audit(failures(READ )) level(00)”
“permit CSQ9.BATCH class(MQCONN) id(IBMUSER ) access(ALTER )”
“permit CSQ9.BATCH class(MQCONN) id(START1 ) access(UPDATE )”
“permit CSQ9.CHIN class(MQCONN) id(IBMUSER ) access(ALTER )”

You edit and run the the Rexx exec to issue the commands.

Easy – it took me less than half an hour from start to finish.

Using RACF commands to extract and recreate the requests

I found that most people do not have access to an unloaded RACF database.  My normal userid does not have the authority to create the unloaded copy. 

I put an exec up on Github.   It issues a display command for each class in MQCMDS MXCMDS MQQUEUE MXQUEUE MXTOPIC MQADMIN MXADMIN MQCONN and formats it as a RDEFINE command, and then issues the permit command to give people access to it.  It writes the output in to the file being edited.

Use ISPF to edit a member where you want the output.

Make sure the rexx exec is in the SYSPROC or SYSEXEC concatenation, for example use ISRDDN to check.

Syntax

genclass <queuemanagername>

The output is like

 /* class:MXCMDS profile:MQPA class not found 
/* class:MXQUEUE profile:MQPA profile not found
/* class:MXTOPIC profile:MQPA profile not found
/* class:MXADMIN profile:MQPA profile not found
RDEFINE MQCONN -
MQPA.CICS -
- /* Create date 07/17/20
OWNER(ADCDA) -
- /* Last reference Date 07/17/20
- /* Last changed date 07/17/20
- /* Alter count 0
- /* Control count 0
- /* Update count 0
- /* Read count 0
UACC(NONE) -
LEVEL(0) -
- /* Global audit NONE
/* Permit MQPA.CICS CLASS(MQCONN ) RESET
Permit MQPA.CICS CLASS(MQCONN ) ID(ADCDA ) ACCESS(ALTER )
Permit MQPA.CICS CLASS(MQCONN ) ID(START1 ) ACCESS(READ )
/* class:MQCONN profile:MQPA.CICS profile not found

It includes a Permit… RESET if you want to remove all access

What should I monitor for MQ on z/OS – buffer pool

For the monitoring of MQ on z/OS, there are a couple of key metrics you need to keep an eye on for the buffer pools, as you sit watching the monitoring screen.

A quick overview of MQ buffer pools

An inefficient queue manager could have all messages stored on disk, in page sets.  An efficient queue manager would cache hot pages in memory to they can be accessed without doing any I/O.

The MQ Buffer Manager component does this caching, by using buffers  (so no surprise there).

A simplified view of the operation is as follows

  1. Messages are broken down into 4KB pages.
  2. Getting the contents of a pageset page, if the page is not in the buffers, the buffer manager reads it from disk into a buffer.
  3. If a page is requested and the contents are not required (for example it is about to be overwritten as part of a put) it does not need to read it from disk.
  4. If the page is updated, for example a new message, or a field in the message is updated during get processing,  the page is not usually written to disk immediately.   The write to disk is deferred (using a process called Deferred Write Process – another non surprise in the naming convention).  This has the advantage that there can be many updates to a page before it is written to disk.
  5. Any buffer page which has not been changed, and is not currently being used, is a free page.

If the system crashes, non persistent messages are thrown away, and persistent messages can be rebuilt from the log.

To reduce the restart time, pages containing persistent data are written out to the page set periodically.   This is driven by the log data set filling up which causes a checkpoint.  Updates which have been around for more than 2 checkpoints are written to disk.  During restart the page set knows how far back in the log restart needs to go.

In both cases, checkpoint and buffer pool filling up (when there are less than 15 % of free =85% in use) , once a page has been successfully written to the page set, the buffer is marked as free.

Pages for non-persistent messages can be written out to disk.

If the buffer pool is critically short of free buffers, and there are less than 5% free buffers, then pages are written to the page set immediately rather than use the deferred write process.  This allows some application work to be done while the buffer manger is trying to make more free pages available.

What is typical behaviour?

The buffer pool is working well.

When a message is got, the buffer is already in the buffer pool so there is no I/O to read from the page set.

The buffer pool is less than 85% full (has more than 15% free pages), there is periodic I/O to the page set because pages with persistent data are written to the page set at checkpoint.

The buffer pool fills up and has more 85% in-use pages.

This can occur if the number and size of the messages being processed is bigger than the size of the buffer pool. This could be a lot of messages in a unit of work,  big messages using lots of pages, or lots of transactions putting and getting messages.  It can also occur when there are lots of puts, but no gets.

If the buffer pool has between 85%  and 95% of in-use pages( between 15%  and 5% free pages),  the buffer manager is working hard to keep free pages available.

There will be I/O intermittently at checkpoints, and a steady I/O as the buffer manager writes the pages to the page set.

If messages are being read from disk, there will be read activity from the pageset, but the buffer pool page can be reused as soon as the data has been copied from the buffer pool page.

The buffer pool has less than 5% free pages.

The buffer manager is in overdrive.  It is working very hard to keep free pages in the buffer pool.   There will be pages written out to the page set as it tries to increase the number of free pages.  Gets may require reads from page set I/O.  All of this I/O can cause I/O contention and all the page set I/Os slow down, and so MQ API request using this buffer pool slow down.

What should be monitored

Most cars these days have a “low fuel light” and a “take me to a garage” light.  For monitoring we can provide similar.

Monitor the % buffer pool full.

  • If it is below 85% things are OK
  • If it is between 85% and 95% then this needs to be monitored, it may be “business as usual”.
  • If it is >=95% this needs to be alerted.  It can be caused by applications or channels not processing messages

Monitoring the number of pages written does not give much information.

It could be caused by a checkpoint activity, or because the buffer pool is filling up.

Monitoring the number of pages read from the page set can provide some insight.

If you compare today with this time last week you can check the profile is similar.

If the buffer pool is below 85% used,

  • Messages being got are typically in the buffer pool so there is no read I/O.
  • If there is read I/O this could be for messages which were not in the buffer pool – for example reading the queue after a queue manager restart.

If the buffer pool than 85% in-use  and less than 95% in-use  this can be caused by a large message work load coming in, and MQ is being “elastic” and queuing the messages.  Even short lived messages may be read from disk.  The number of read I/Os give an indication of throughput.  Compare this with the previous week to see if the profile is similar.

If the buffer pool is more than 95% in-use this will have an impact on performance, as every page processed is likely to have I/O to the page set, and elongated I/O response time due to the contention.

What to do

You may want “operators notes” either on paper or online which describe the expected status of the buffer pools on individual queue managers.

  1. PROD QMPR
    1. BP 1 green less than 85% busy
    2. BP 2 green less than 85% busy
    3. BP3 green except for Friday night when it goes amber  read I/O rate 6000 I/O per minute.
  2. TEST QMTE
    1. BP 1 green less than 85% busy
    2. BP 2 green less than 85% busy
    3. BP 3 usually amber – used for bulk data transfer

What to do the buffer statistics mean?

There are statistic on the buffer pool usage.

  1. Buffer pool number.
  2. Size of buffer pool – when the data was produced.
  3. Low – the lowest number of free pages in the interval.  100* (Size – log)/Size  gives you % full.
  4. Now – the current number of free pages in the interval.
  5. Getp – the number of requests ‘get page with contents’.  If the page is not in the buffer pool then it is read from the page set.
  6. Getn.   A new page is needed.   The contents are not relevant as it is about to be overwritten.  If the page is not in the buffer pool, just allocate a buffer, and do not read from the page set.
  7. STW – set write intent.  This means the page was got for update. I have not seen a use for this. For example
    1. A put wants to put part of a message on the page
    2. A get is being done and it wants to set the “message has been got” flag.
    3. The message has been got, and so pointers to a page need to be updated.
  8. RIO -the number of read requests to the page set.  If this is greater than zero
    1. The request is for messages which have not been used since the queue manager started
    2. The buffer pool had reached 85%, pages had been moved out to the page set, and the buffer has been reused.
  9. WIO the number of write I/Os that were done.  This write could be due to a checkpoint, or because the buffer pool filled up.
  10. TPW total pages written, a measure of how busy the buffer pool was.   This write could be due to a checkpoint, or because the buffer pool filled up.
  11. IMW – immediate write.  I have not used this value, sometimes I observe it is high, but it is not useful.  This can be caused by
    1. the buffer pool being over 95% full, so all application write I/O is synchronous,
    2. or a page was being updated during the last checkpoint, and it needs to be written to the page set when the update has finished.  This should be a small number.  Frequent checkpoints (eg every minute) can increase this value.
  12. DWT – the number of times the Deferred Write processor was started.  This number has little value.
    1. The DWP could have started and been so busy that it never ended, so this counter is 1.
    2. The DWP could have started, written a few pages and stopped – and done this repeatedly, in this case the value could be high.
  13. DMC.   The number of times the buffer pool crossed the 95% limit.  If this is > 0 it tells you the buffer pool crossed the limit
    1. This could have crossed just once, and stayed over 95%
    2. This could have gone above 95%, then below 95% etc.
  14. SOS – the buffer pool was totally short on storage – there were no free pages.  This is a critical indicator.  You may need to do capacity planning, and make the buffer pool bigger, or see if there was a problem where messages were not being got.
  15. STL – the number of times a “free buffer” was reused.  A buffer was associated with a page of a page set.   The buffer has been reused and is now for a different page.  If STL is zero, it means all pages that were used were in the buffer pool
  16. STLA – A measure of contention when pages are being reused.  This is typically zero.

Summary

Now you know as much as I do about buffer pools you’ll see that the %full (or %free) is the key measure.  If the buffer pool is > 85% used pages, then the I/O rate is a measure as to how hard the buffer manager is running.

What should I monitor for MQ on z/OS – logging statistics

For the monitoring of MQ on z/OS, there are a couple of key metrics you need to keep an eye on for the logging component, as you sit watching the monitoring screen.

I’ll explain how MQ logging works, and then give some examples of what I think would be key metrics.

Quick overview of MQ logging

  1. MQ logging has a big(sequential) buffer for logging data, it wraps.
  2. Application does an MQPUT of a persistent message.
  3. The queue manager updates lots of values (eg queue depth, last put time) as well as move data into the queue manager address space.  This data is written to log buffers. A 4KB page can hold data from many applications.
  4. An application does an MQCOMMIT.  MQ takes the log buffers up and including the current buffer and writes it to the current active log data set.  Meanwhile other applications can write to other log buffers.
  5. The I/O finishes and the log buffers just written can be reused.
  6. MQ can write up to 128 pages in an I/O. If there are more than 128 buffers to write there will be more than 1 I/O.
  7. If application 1 commits, the IO starts,  and then application 2 commits. The I/O for the commit in application 2 has to wait for the first set of disk writes to finish, before the next write can occur.
  8. Eventually the active log data set fills up.  MQ copies this active log to an archive data set.  This archive can be on disk or tape.   This archive data set may never be used again in normal operation.  It may be needed for recovery of transactions or after a failure.   The Active log which has just been copied can now be reused.

What is interesting?

Displaying how much data is logged per second.

Today       XXXXXXXXXXXXXXXXXXXX
Last week XXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Yesterday XXXXXXXXXX          
      0                     100MB/Sec    200 MB/Sec

This shows that the logging rate today is lower than last week.   This could be caused by

  1. Today is just quieter than last week
  2. There is a problem and there are fewer requests coming into MQ.   This could be caused by problems in another component, or a problem in MQ.    When using persistent messages the longest part of a transaction is the commit and waiting for the log disk I/O.  If this I/O is slower it can affect the overall system throughput.
  3. You can get the MQ log IO response times from the MQ log data.

Displaying MQ log I/O response time

You can break down where time is spent in doing I/O into the following area

  1. Scheduling the I/O – getting the request into the I/O processor on the CPU
  2. Sending the request down to the Disk controller(eg 3990)
  3. Transferring data
  4. The I/O completes, and send an interrupt to z/OS, z/OS has to catch this interrupt and wake up the requester.

 Plotting the I/O time does not give an entirely accurate picture, as the time to transfer the data depends on the amount of data to transfer.  On a well run system there should be enough capacity so the other times are constant.    (I was involved in a critical customer situation where the MQ logging performance “died” every Sunday morning.   They did backups, which overloaded the I/O system).

In the MQ log statistics you can calculate the average I/O time.  There are two sets of data for each log

  1. The number of requests, and sum of the times of the requests to write 1 page.  This should be pretty constant, as the data is for when only one 4KB was transferred
  2. The number of requests, and sum of the times of the requests to more than 1 page.  The average I/O time will depend on the amount of data transferred.
  • When the system is lightly loaded, there will be many requests to write just one page. 
  • When big messages are being processed (over 4 KB) you will see multiple pages per I/O.
  • If an application processes many messages before it commits you will get many pages per I/O.   This is typical of a channel with a high achieved batch size.
  • When the system is busy you may find that most of the I/O write more than one page, because many requests to write a small amount of data fills up more than one page.

I think displaying the average I/O times would be useful.   I haven’t done tried this in a customer environment (as I dont have customer environment to use).    So if the data looks like

Today         XXXXXXXXXXXXXXXXXXXXXXXX
Last week     XXXXXXXXXXXXXXXXXXXXXXXXXXXXX  
One hour ago XXXXXXXXXXXXXXXXXXX
time in ms 0 1 2 3

it gives you a picture of the I/O response time.

  • The dark green is for I/O with just one page, the size of the bar should be constant.
  • The light green is for I/O with more than one page, the size of the bar will change slightly with load.  If it changes significantly then this indicates a problem somewhere.

Of course you could just display the total I/O response time = (total duration of I/Os) / (total number of I/Os), but you lose the extra information about the writing of 1 page.

Reading from logs

If an application using persistent messages decides to roll back:

  • MQ reads the log buffers for the transaction’s data and undoes any changes.
  • It may be the data is old and not in the log buffers, so the data is read from the active log data sets.
  • It may be that the request is really old (for example half an hour or more), MQ reads from the archive logs (perhaps on tape).

Looking at the application doing a roll back, and having to read from the log.

  • Reading from buffers is OK.   A large number indicates application problem or a DB/2 deadlock type problem.  You should investigate why there is so much rollback activity
  • Reading from Active logs … . this should be infrequent.  It usually indicates an application coding issue where the transaction took too long before commit.  Perhaps due to a database deadlock, or bad application design (where there is a major delay before the commit)
  • Reading from Archive logs… really bad news…..  This should never happen.

Displaying reads from LOGS

Today         XXXXXXXXXXXXXXXXXXXXXXXX
Last week     X
One hour ago  XXXXX
rate          0        10    20     40

Where green is “read from buffer”, orange is “read from active log”, red is “read from Archive log. Today looks a bad day”.

Should I monitor MQ – if so what for ?

I’ve been talking to someone about using the MQ SMF data, and when would it be useful. There is a lot of data. What are the important things to watch for, and what do I do when there is a problem?

Why monitor?

From a performance perspective there are a couple of reasons why you should monitor

Today’s problems

Typical problems include

  1. “Transaction slow down”, people using the mobile app are timing out.
  2. Our new marketing campaign is a great success – we have double the amount of traffic, and the backend system cannot keep up.
  3. The $$ cost of the transactions has gone up.   What’s the problem.

With problems like transaction slow down, the hard part is often to determine which is the slow component.  This is hard when there may be 20 components in the transaction flow, from front end routers, through CICS, MQ, DB2, IMS, and WAS, and the occasional off-platform request.

You can often spot problems because work is building up, (there are transactions or messages queued up), or response times are longer.  This can be hard to interpret because “the time to put a message and get the reply from MQ is 10 second” may at first glance be an MQ problem – but MQ is just the messenger, and the problem is beyond MQ.  I heard someone say that the default position was to blame MQ, unless the MQ team could prove it wasn’t them.

Yesterday’s problem

Yesterday/last week you had a problem and the task force is looking into it.  They now want to know how MQ/DB2/CICS/IMS etc was behaving when there was a problem.  For this you need Historical Data.  You may have summary data recorded on a minute by minute basis, or you may have summary data recorded over an hour.   If the data is averaged over an hour you may not see any problems. A spike in workload may be followed by no work, and so on average every thing is OK.
It is useful to have “maximum value in this time range”. So if your maximum disk I/O time was 10 seconds in this interval at 16:01:23:45 and the problem occurred around this time, it is a good clue to the problem.

Tomorrow’s problem.

You should be able to get trending information.  If your disk can sustain an I/O rate of 100MB a second, and you can see that every week at peak time, you are logging an extra 5MB/second, this tells you that you need to so something to fix it, either get faster disks, or split the work to a different queue manager.

Monitoring is not capacity planning.

Monitoring is how is it performing in the current configuration.  Monitoring may show a problem, but it is up to the capacity and tuning team to fix it.  For example – how big a buffer pool do we need is a question for the capacity team.  You could make the buffer pools use GB of buffers – or keep the buffer pools smaller and let MQ “do the paging to and from disk”.

How do you know when a ‘problem’ occurs.

I remember going to visit a customer because they had a critical problem with the performance on MQ.  There were so many things wrong it was hard to know where to start.  The customer said that the things I pointed out were always bad – so they were not the current problem.  Eventually we found the (application) problem.  The customer was very grateful for my visit – but still did not fix the performance problems.

One thing to learn from this story is that you need to compare a bad day with a good day, and see what is different.  This may mean comparing it with the data from the same time last week, rather than from an hour ago.  I would expect that last week’s profile should be a good comparison to this week.   One hour ago there may not have been any significant load.

With MQ, there is a saying “A good buffer pool is an empty buffer pool”.  Does a buffer pool which has filled up, and causing lots of disk I/O mean there is a problem?  Not always.  It could be MQ acting a queueing system and if you wait for half an hour for the remote system to restart all of the messages will flow, and the buffer pool become empty.  If you know this can happen, it it good to be told it is happening, but the action may be “watch it”.  If this is the first time it has happened, you may want to do a proper investigation, and find out which queues are involved, which channels are not working, and what remote system are down.

What information do I need?

It depends on what you want.  If you are sitting in the operations room watching the MQ monitor while sipping a cup of your favourite brew, then you want something like modern cars.  If there is a problem, a red light on the dashboard light up meaning “You need to take the car to the garage”.   The garage can then put the diagnostic tools onto the engine and get the reason.

You want to know if there is a problem or not.  You do not need to know you have a problem to 3 decimal places – yes, maybe, or no is fine.

If you are investigating a problem from last week, you, being the role of the garage mechanic, need the detailed diagnostics.

When do you need the data?

If you are getting the data from SMF records you may get a record every 1 minute, or every half an hour.  This may not be frequent enough while there is a problem.  For example if you have a problem with logging, you want to see the data on a second by second basis, not once every 30 minutes.

Take the following scenario.  It is 10:59 – 29 minutes into the period when you get an SMF (or online monitor) data.

So far in this interval, there have been 100,000  I/Os.   The total time spent doing I/Os is 100 seconds,  By calculation the average time for an IO is 1 millisecond.  This is a good response time.

You suddenly hit a problem, and the IO response time goes up to 100 ms, 10 more I/Os are done.

The total number of I/Os is now 100,010 , the time spent doing I/OS is now 101 seconds.  By calculation the average I/O time is now 1.009899 milliseconds.  This does not show there is a problem as this is within typical variation.

If you can get the data from a few seconds ago and now you can calculate the differences

  1. number of IOs 100,010 – 100,000 = 10
  2. time spent doing I/O 101 -100 = 1 second
  3. Average I/O time 100 ms – wow this really shows the problem, compared to calculating the value from the 30 minute set of statistics which showed the time increasing from 1 ms to 1.01 ms.
This shows you need granular data perhaps every minute – but this means you get a lot of data to manage.
 

Do not restart with the fire hose set to maximum.

There is an article in The Register, about an outage at the Tokyo stock exchange.  One of the problems was that they did not have a process for restarting the environment.  The impact of restarting a system is often overlooked, and in the panic of “get it started as quickly as possible” things can go wrong.   The fire brigade slowly increases the pressure in a fire hose to stop the fire crew from being knocked down with the sudden flow.

TCP/IP is good because it has a “slow start” protocol.  Once a connection has been established, and is working well, the exchange can use bigger buffers, and send more buffers before waiting for the acknowledgement.   This boosts the throughput.  If the back-end is slow to process the data, TCP  slows down the traffic, and then increases the throughput again if the connection can handle it.  If the connection stops and restarts, the rate starts slowly and builds up, rather than use the rate just before the outage.

You cannot expect WAS/CICS/DB2/MQ/IMS to restart at maximum speed; it has to work up to it.  Transactions may have to warm up. There can be many reasons:

  1. Data many need to be read from page-sets into buffers, for example read hot Db/2 data into memory.
  2. Java code needs to warm up to become more efficient (JITed).
  3. The systems need to establish a working set, for example making a buffer pool larger.
  4. Establishing connections may have some serialisation delays.

Restarting faster than a system can cope can cause a domino effect.  A transaction server is restarted and the fire hose of data is turned on.   The transaction server is still warming up, and cannot cope with the volume of requests.    Work for this system is then routed to another transaction server which could handle the workload if the volume gradually increases,  If it gets this additional work  all at once, this instance slows down, and the work is routed to another transaction server etc.

MQ can be seen as the bad guy here.  When you restart MQ, it can go to fire hose mode immediately.   You should start the output channels first to start draining messages, then gradually start the input channels.   If you start the input channels before the output channels, you may get queues and page sets filling up, before the output channels can process the messages.

If you have a policy that all client connects must disconnect and reconnect a random time between15 minutes and 45 minutes this should help spread the load, and gradually you should get a balanced environment.

What’s the difference between MQ Web, and z/OS Connect MQ support?

With MQ Web

  1. You can issue commands to administer MQ  for example display, define, delete MQ objects.
  2. You can put and get messages to and from a queue.  The message is what you specify – typically a character string.

With Z/OS Connect MQ support

  1. You can put and get messages to and from a queue, and do transformations on the message.  For example mapping a COBOL structure to JSON.  
  2. You can do field validation.
  3. You can covert HTTP code “200” to “great it worked”.

What is common?

They both use z/OS WebSphere Liberty to provide the basic web server.

A practical path to installing Liberty and z/OS Connect servers – 10 use the MQ sample

Introduction

I’ll cover the instructions to install z/OS Connect, but the instructions are similar for other products. The steps are to create the minimum server configuration and gradually add more function to it.

The steps below guide you through

  1. Overview
  2. planning to help you decide what you need to create, and what options you have to choose
  3. initial customisation and creating a server,  creating defaults and creating function specific configuration files,  for example a file for SAF
  4. starting the server
  5. enable logon security and add SAF definitions
  6. add keystores for TLS, and client authentication
  7. adding an API and service application
  8. protecting the API and service applications
  9. collecting monitoring data including SMF
  10. use the MQ sample
  11. using WLM to classify a service

With each step there are instructions on how to check the work has been successful.

Use the MQ sample

You need to have installed the service, and protected it.

You need to configure MQ to include the MQ support and tell JMS where the libraries are

<server> 
<!-- Enable features --> 
<featureManager> 
    <feature>zosconnect:mqService-1.0</feature> 
</featureManager> 
                                                                                                         
<wmqJmsClient nativeLibraryPath="/Z24A/usr/lpp/mqm/V9R1M1/java/lib"/> 

<variable name="wmqJmsClient.rar.location"
   value="/Z24A/usr/lpp/mqm/V9R1M1/java/lib/jca/wmq.jmsra.rar"/> 
</server> 

You could configure a variable for the mq directory so you use it once, and use

<variable name="wmq"  value="/Z24A/usr/lpp/mqm/V9R1M1/java/lib/"
<wmqJmsClient nativeLibraryPath="${wmq}”/>
<variable name="wmqJmsClient.rar.location"
   value="${wmq}wmq.jmsra.rar}">

You could also pass the mq location as a variable in STDENV, and so pass it in through JCL.

Configure the jms to define the queue manager and queues

<server> 
 <jmsConnectionFactory jndiName="jms/cf1" 
     connectionManagerRef="ConMgr1"> 
    <properties.wmqJms transportType="BINDINGS" 
         queueManager="CSQ9"/> 
 </jmsConnectionFactory> 
                                                                                                      
 <connectionManager id="ConMgr1" maxPoolSize="5"/> 
                                                                                                      
 <!-- A queue definition where request messages 
      for the stock query application are sent. --> 
 <jmsQueue jndiName="jms/stockRequestQueue"> 
    <properties.wmqJms 
       baseQueueName="STOCK_REQUEST" 
       targetClient="MQ"/> 
 </jmsQueue> 
                                                                                                      
 <!-- A queue definition where response messages from 
      the stock query application are sent. --> 
 <jmsQueue jndiName="jms/stockResponseQueue"> 
    <properties.wmqJms baseQueueName="STOCK_RESPONSE" targetClient="MQ"/> 
 </jmsQueue> 
</server>

and include these in the server.xml file.

You need to compile and run the back end service.  See here.  Take care if using cut and paste as there a long lines which wrap, and cause compilation errors.

Because the MQ path name is long I used

export HLQ="/usr/lpp/mqm/V9R1M1/java/lib"

java -cp $HLQ/com.ibm.mq.allclient.jar:. -Djava.library.path=$HLQ TwoWayBackend CSQ9 STOCK_REQUEST STOCK_RESPONSE

I set up a job to run this in back ground, so I could free up my TSO terminal.

Use the API

Once installed you should be able to use the API. For example

curl –insecure -i –cacert cacert.pem –cert adcdd.pem:password –key adcdd.key.pem https://10.1.3.10:9443/stockmanager/stock/items/999999

If the back end application was working I got

{"SQRESP":{"ITEM_ID":999999,"ITEM_DESC":"A description. 00050","ITEM_COST":45,"ITEM_COUNT":0}}

If the back end application was not working I got back an empty response.

The back-end application runs until Ctrl+c is used.  On my USS, the ESCape key is cent symbol ¢ (Unicode 00a2) which I do not have on my default keyboard.    See  x3270 – where’s the money key? for guidance on how to set it.

 

Use the Service

To use the API I used a web browser with

https:10.1.3.10/9443/stockmanager/stock/items/999999

and got back

{"SQRESP":{"ITEM_ID":999999,"ITEM_DESC":"A description. 00050","ITEM_COST":45,"ITEM_COUNT":0}}

or curl with

 curl --insecure -X post -H Content-Type: application/json
  -H "Content-Type:application/json" -i --cacert cacert.pem
  --cert adcdd.pem:password --key adcdd.key.pem 
  --data {"STOCKQRYOperation": {"sqreq" : { "item_id": 2033}}}
  https://10.1.3.10:9443/zosConnect/services/stockQuery?action=invoke

Why isnt my MQ RACF command working?

I was  trying to define an MQ queue using %CSQ9 DEFINE QL(AA)  and was getting

ICH408I USER(IBMUSER ) GROUP(SYS1 ) NAME( ) 
CSQ9.QUEUE.AA CL(MQADMIN ) 
PROFILE NOT FOUND - REQUIRED FOR AUTHORITY CHECKING 
ACCESS INTENT(ALTER ) ACCESS ALLOWED(NONE ) 

But the profile existed!
The command tso rlist MQADMIN CSQ9.QUEUE.AA  showed me the profile which would be used

CLASS NAME
----- ----
MQADMIN CSQ9.QUEUE.* (G)

It did not look like the class was being cached

SETR RACLIST CLASSES =  APPL CBIND CDT CONSOLE CSFKEYS CSFSERV DASDVOL     
                        DIGTCERT DIGTCRIT DIGTNMAP DIGTRING DSNR EJBROLE   
                        FACILITY IDIDMAP LOGSTRM OPERCMDS PTKTDATA PTKTVAL 
                        RDATALIB SDSF SERVAUTH SERVER STARTED SURROGAT     
                        TSOAUTH TSOPROC UNIXPRIV WBEM XCSFKEY XFACILIT     
                        ZMFAPLA ZMFCLOUD 

But I missed the

GLOBAL=YES RACLIST ONLY =  MQADMIN MQNLIST MQPROC MQQUEUE MXTOPIC      

I used the

TSO SETROPTS RACLIST(MQADMIN) REFRESH

and the define command worked. Another face palming moment.

Lesson learned -if  indoubt use the SETROPTS RACLIST(MQADMIN) REFRESH command

 

Looking for an MQ reason code in Liberty? Get your safari helmet, anti malarial tablets and follow me to find the treasure.

I was using an MQ application in Liberty, and rather do things the easy way, I did what I normally do, and did it the hard way.  On my z/OS I did not have the queue manager defined, because I wanted to see what happened.  I was not expecting the expedition.

You configure MQ in Liberty using configuration like

<jmsConnectionFactory jndiName="jms/cf1" connectionManagerRef="ConMgr1"> 
<properties.wmqJms transportType="BINDINGS" queueManager="MQPA"/>

 

I was expecting a message like the following in the job output.

Application COLINAPP MQCONN call to MQPA failed with compcode 
'2' ('MQCC_FAILED')reason '2058' ('MQRC_Q_MGR_NAME_ERROR').

Oh no, it was not that easy.  It was quite a trek into the jungle to find the information.

In the Liberty server’s logs directory there is a message.log file.  In this file I had

9/14/20 19:16:32:242 GMT 00000060 com.ibm.ws.logging.internal.impl.IncidentImpl I FFDC1015I: An FFDC Incident has been created: "com.ibm.mq.connector.DetailedResourceException: MQJCA1011: Failed to allocate a JMS connection., error code: MQJCA1011 An internal error caused an attempt to allocate a connection to fail. See the linked exception for  details of the failure. com.ibm.ejs.j2c.poolmanager.FreePool.createManagedConnectionWithMCWrapper 199" at 
ffdc_20.09.14_19.16.28.0.log

This was one long line, and I had to scroll sideways (just like you did) to see the content (or use the ISPF line prefix command “tf” to flow the text to the display width).  A key hint was the message MQJCA1011 An internal error caused an attempt to allocate a connection to fail  so I knew I was on the right trail.  I now knew the name of the file – ffdc_20.09.14_19.16.28.0.log.

Knowing the name of the file did not help very much, as if you use ISPF 3.17  (z/OS UNIX Directory List ) it showed a list of 40 files with the name ffdc_20.09.14_1 (ffdc_yy.mm.dd_h).   This is because it only displays the first part of the name. Thanks to Steve Porter who said ..

To increase column size in 3.17, >
Options
1. Directory List Options…
Width of filename column . . . . . . . . 15 (Default value – increase as necessary)

 

The file has a name ffdc_20.09.14_19.16.28.0.log and a displayed time stamp of 2020/09/14 18:16:32 which is close enough – allowing for the time zone difference and the time take to write the file.  I was fortunate not to be running a workload and producing many of these files.

I edited the file – and I could see the full file name at the top of the page, so I knew I was in the right file.

The file has long lines, so I had to scroll or use the “tf” line command to reformat it.

Near the top it had

Stack Dump = com.ibm.mq.connector.DetailedResourceException: 
MQJCA1011: Failed to allocate a JMS connection., error code:  
MQJCA1011 An internal error caused an attempt to allocate a connection to fail. 
See the linked exception for details of the  failure.

Further down it had

Caused by: com.ibm.msg.client.jms.DetailedJMSException: 
JMSWMQ0018: Failed to connect to queue manager 'MQPA' with connection 
mode 'Bindings' and host name 'localhost(1414)'.

and further further down (line 50) I found the treasure

Caused by: com.ibm.mq.MQException: JMSCMQ0001: IBM MQ call failed with 
compcode '2' ('MQCC_FAILED') reason '2058'  ('MQRC_Q_MGR_NAME_ERROR').

What a trek to find the information I needed!

Next time I’ll just list the logs/ffdc directory, edit (not browse) each file and search for “compcode”.   You cannot use “grep compcode” from uss because the file is in UTF8 and does not find it.  You can just use oedit file_name in uss.

It would be nice if the MQ code could be enhanced to have an option “makeErrorsHardToFind” which you could set to “no”, and still keep the default “yes”.

 

Setting up a highly available web server is a “yes – but” problem

I’ve been setting up a Liberty web server, as used in MQWEB, z/OS Connect, z/OSMF and so on, and was looking into how to make this available, so I could move the web server to a different LPAR or TCP/IP instance.

Moving it should be easy – it is  – but …  but there are things you need to think about. It is a bit like going around a maze trying to find the solution.

How do I get to the fail over system?

You start the web server on a different LPAR in the sysplex. How can you support this to allow your browser to get to the backend, without changing the URL?

You have two choices.

  1. You change your DNS look up, or router so your request goes via a different connection (think different bit of wire) to the failover LPAR. These change can be automated to some extent.
  2. Multiple z/OS images can listen for an IP address.

These work but…

The certificate sent down from the web server contains the address of the LPAR as part of the SAN.  When the browser processes it, it compares the LPAR address in the certificate with the address in the certificate.  If they do not match the browser produces an error message.

How do I get over the certificate SAN and the IP address difference?

You have a couple of choices

  1. Use a unique certificate on each LPAR.    Yes this works, but there is more administrative work to set up.  You could set up two web servers and only use one at a time.   This work, but it is unnecessary work.
  2. Use a Virtual IP address.   In TCP networking the end of every connection is a “device” or system with its unique IP address.   You can give the web server its own IP address which is “virtual” as it is not  device or system.  With this, when you start your web server on a different LPAR, it has the same IP address.  To use this you have to configure z/OS to support this.  You can set this up
    1. To support multiple web servers, and distribute the work to them
    2. Have a hot standby
    3. To route traffic to where the web server has started.

Yes, these work, but – is not easy to set up.  I’ll be blogging how to do this.