What should I monitor for MQ on z/OS – logging statistics

For the monitoring of MQ on z/OS, there are a couple of key metrics you need to keep an eye on for the logging component, as you sit watching the monitoring screen.

I’ll explain how MQ logging works, and then give some examples of what I think would be key metrics.

Quick overview of MQ logging

  1. MQ logging has a big(sequential) buffer for logging data, it wraps.
  2. Application does an MQPUT of a persistent message.
  3. The queue manager updates lots of values (eg queue depth, last put time) as well as move data into the queue manager address space.  This data is written to log buffers. A 4KB page can hold data from many applications.
  4. An application does an MQCOMMIT.  MQ takes the log buffers up and including the current buffer and writes it to the current active log data set.  Meanwhile other applications can write to other log buffers.
  5. The I/O finishes and the log buffers just written can be reused.
  6. MQ can write up to 128 pages in an I/O. If there are more than 128 buffers to write there will be more than 1 I/O.
  7. If application 1 commits, the IO starts,  and then application 2 commits. The I/O for the commit in application 2 has to wait for the first set of disk writes to finish, before the next write can occur.
  8. Eventually the active log data set fills up.  MQ copies this active log to an archive data set.  This archive can be on disk or tape.   This archive data set may never be used again in normal operation.  It may be needed for recovery of transactions or after a failure.   The Active log which has just been copied can now be reused.

What is interesting?

Displaying how much data is logged per second.

Today       XXXXXXXXXXXXXXXXXXXX
Last week XXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Yesterday XXXXXXXXXX          
      0                     100MB/Sec    200 MB/Sec

This shows that the logging rate today is lower than last week.   This could be caused by

  1. Today is just quieter than last week
  2. There is a problem and there are fewer requests coming into MQ.   This could be caused by problems in another component, or a problem in MQ.    When using persistent messages the longest part of a transaction is the commit and waiting for the log disk I/O.  If this I/O is slower it can affect the overall system throughput.
  3. You can get the MQ log IO response times from the MQ log data.

Displaying MQ log I/O response time

You can break down where time is spent in doing I/O into the following area

  1. Scheduling the I/O – getting the request into the I/O processor on the CPU
  2. Sending the request down to the Disk controller(eg 3990)
  3. Transferring data
  4. The I/O completes, and send an interrupt to z/OS, z/OS has to catch this interrupt and wake up the requester.

 Plotting the I/O time does not give an entirely accurate picture, as the time to transfer the data depends on the amount of data to transfer.  On a well run system there should be enough capacity so the other times are constant.    (I was involved in a critical customer situation where the MQ logging performance “died” every Sunday morning.   They did backups, which overloaded the I/O system).

In the MQ log statistics you can calculate the average I/O time.  There are two sets of data for each log

  1. The number of requests, and sum of the times of the requests to write 1 page.  This should be pretty constant, as the data is for when only one 4KB was transferred
  2. The number of requests, and sum of the times of the requests to more than 1 page.  The average I/O time will depend on the amount of data transferred.
  • When the system is lightly loaded, there will be many requests to write just one page. 
  • When big messages are being processed (over 4 KB) you will see multiple pages per I/O.
  • If an application processes many messages before it commits you will get many pages per I/O.   This is typical of a channel with a high achieved batch size.
  • When the system is busy you may find that most of the I/O write more than one page, because many requests to write a small amount of data fills up more than one page.

I think displaying the average I/O times would be useful.   I haven’t done tried this in a customer environment (as I dont have customer environment to use).    So if the data looks like

Today         XXXXXXXXXXXXXXXXXXXXXXXX
Last week     XXXXXXXXXXXXXXXXXXXXXXXXXXXXX  
One hour ago XXXXXXXXXXXXXXXXXXX
time in ms 0 1 2 3

it gives you a picture of the I/O response time.

  • The dark green is for I/O with just one page, the size of the bar should be constant.
  • The light green is for I/O with more than one page, the size of the bar will change slightly with load.  If it changes significantly then this indicates a problem somewhere.

Of course you could just display the total I/O response time = (total duration of I/Os) / (total number of I/Os), but you lose the extra information about the writing of 1 page.

Reading from logs

If an application using persistent messages decides to roll back:

  • MQ reads the log buffers for the transaction’s data and undoes any changes.
  • It may be the data is old and not in the log buffers, so the data is read from the active log data sets.
  • It may be that the request is really old (for example half an hour or more), MQ reads from the archive logs (perhaps on tape).

Looking at the application doing a roll back, and having to read from the log.

  • Reading from buffers is OK.   A large number indicates application problem or a DB/2 deadlock type problem.  You should investigate why there is so much rollback activity
  • Reading from Active logs … . this should be infrequent.  It usually indicates an application coding issue where the transaction took too long before commit.  Perhaps due to a database deadlock, or bad application design (where there is a major delay before the commit)
  • Reading from Archive logs… really bad news…..  This should never happen.

Displaying reads from LOGS

Today         XXXXXXXXXXXXXXXXXXXXXXXX
Last week     X
One hour ago  XXXXX
rate          0        10    20     40

Where green is “read from buffer”, orange is “read from active log”, red is “read from Archive log. Today looks a bad day”.

Should I monitor MQ – if so what for ?

I’ve been talking to someone about using the MQ SMF data, and when would it be useful. There is a lot of data. What are the important things to watch for, and what do I do when there is a problem?

Why monitor?

From a performance perspective there are a couple of reasons why you should monitor

Today’s problems

Typical problems include

  1. “Transaction slow down”, people using the mobile app are timing out.
  2. Our new marketing campaign is a great success – we have double the amount of traffic, and the backend system cannot keep up.
  3. The $$ cost of the transactions has gone up.   What’s the problem.

With problems like transaction slow down, the hard part is often to determine which is the slow component.  This is hard when there may be 20 components in the transaction flow, from front end routers, through CICS, MQ, DB2, IMS, and WAS, and the occasional off-platform request.

You can often spot problems because work is building up, (there are transactions or messages queued up), or response times are longer.  This can be hard to interpret because “the time to put a message and get the reply from MQ is 10 second” may at first glance be an MQ problem – but MQ is just the messenger, and the problem is beyond MQ.  I heard someone say that the default position was to blame MQ, unless the MQ team could prove it wasn’t them.

Yesterday’s problem

Yesterday/last week you had a problem and the task force is looking into it.  They now want to know how MQ/DB2/CICS/IMS etc was behaving when there was a problem.  For this you need Historical Data.  You may have summary data recorded on a minute by minute basis, or you may have summary data recorded over an hour.   If the data is averaged over an hour you may not see any problems. A spike in workload may be followed by no work, and so on average every thing is OK.
It is useful to have “maximum value in this time range”. So if your maximum disk I/O time was 10 seconds in this interval at 16:01:23:45 and the problem occurred around this time, it is a good clue to the problem.

Tomorrow’s problem.

You should be able to get trending information.  If your disk can sustain an I/O rate of 100MB a second, and you can see that every week at peak time, you are logging an extra 5MB/second, this tells you that you need to so something to fix it, either get faster disks, or split the work to a different queue manager.

Monitoring is not capacity planning.

Monitoring is how is it performing in the current configuration.  Monitoring may show a problem, but it is up to the capacity and tuning team to fix it.  For example – how big a buffer pool do we need is a question for the capacity team.  You could make the buffer pools use GB of buffers – or keep the buffer pools smaller and let MQ “do the paging to and from disk”.

How do you know when a ‘problem’ occurs.

I remember going to visit a customer because they had a critical problem with the performance on MQ.  There were so many things wrong it was hard to know where to start.  The customer said that the things I pointed out were always bad – so they were not the current problem.  Eventually we found the (application) problem.  The customer was very grateful for my visit – but still did not fix the performance problems.

One thing to learn from this story is that you need to compare a bad day with a good day, and see what is different.  This may mean comparing it with the data from the same time last week, rather than from an hour ago.  I would expect that last week’s profile should be a good comparison to this week.   One hour ago there may not have been any significant load.

With MQ, there is a saying “A good buffer pool is an empty buffer pool”.  Does a buffer pool which has filled up, and causing lots of disk I/O mean there is a problem?  Not always.  It could be MQ acting a queueing system and if you wait for half an hour for the remote system to restart all of the messages will flow, and the buffer pool become empty.  If you know this can happen, it it good to be told it is happening, but the action may be “watch it”.  If this is the first time it has happened, you may want to do a proper investigation, and find out which queues are involved, which channels are not working, and what remote system are down.

What information do I need?

It depends on what you want.  If you are sitting in the operations room watching the MQ monitor while sipping a cup of your favourite brew, then you want something like modern cars.  If there is a problem, a red light on the dashboard light up meaning “You need to take the car to the garage”.   The garage can then put the diagnostic tools onto the engine and get the reason.

You want to know if there is a problem or not.  You do not need to know you have a problem to 3 decimal places – yes, maybe, or no is fine.

If you are investigating a problem from last week, you, being the role of the garage mechanic, need the detailed diagnostics.

When do you need the data?

If you are getting the data from SMF records you may get a record every 1 minute, or every half an hour.  This may not be frequent enough while there is a problem.  For example if you have a problem with logging, you want to see the data on a second by second basis, not once every 30 minutes.

Take the following scenario.  It is 10:59 – 29 minutes into the period when you get an SMF (or online monitor) data.

So far in this interval, there have been 100,000  I/Os.   The total time spent doing I/Os is 100 seconds,  By calculation the average time for an IO is 1 millisecond.  This is a good response time.

You suddenly hit a problem, and the IO response time goes up to 100 ms, 10 more I/Os are done.

The total number of I/Os is now 100,010 , the time spent doing I/OS is now 101 seconds.  By calculation the average I/O time is now 1.009899 milliseconds.  This does not show there is a problem as this is within typical variation.

If you can get the data from a few seconds ago and now you can calculate the differences

  1. number of IOs 100,010 – 100,000 = 10
  2. time spent doing I/O 101 -100 = 1 second
  3. Average I/O time 100 ms – wow this really shows the problem, compared to calculating the value from the 30 minute set of statistics which showed the time increasing from 1 ms to 1.01 ms.
This shows you need granular data perhaps every minute – but this means you get a lot of data to manage.
 

Do not restart with the fire hose set to maximum.

There is an article in The Register, about an outage at the Tokyo stock exchange.  One of the problems was that they did not have a process for restarting the environment.  The impact of restarting a system is often overlooked, and in the panic of “get it started as quickly as possible” things can go wrong.   The fire brigade slowly increases the pressure in a fire hose to stop the fire crew from being knocked down with the sudden flow.

TCP/IP is good because it has a “slow start” protocol.  Once a connection has been established, and is working well, the exchange can use bigger buffers, and send more buffers before waiting for the acknowledgement.   This boosts the throughput.  If the back-end is slow to process the data, TCP  slows down the traffic, and then increases the throughput again if the connection can handle it.  If the connection stops and restarts, the rate starts slowly and builds up, rather than use the rate just before the outage.

You cannot expect WAS/CICS/DB2/MQ/IMS to restart at maximum speed; it has to work up to it.  Transactions may have to warm up. There can be many reasons:

  1. Data many need to be read from page-sets into buffers, for example read hot Db/2 data into memory.
  2. Java code needs to warm up to become more efficient (JITed).
  3. The systems need to establish a working set, for example making a buffer pool larger.
  4. Establishing connections may have some serialisation delays.

Restarting faster than a system can cope can cause a domino effect.  A transaction server is restarted and the fire hose of data is turned on.   The transaction server is still warming up, and cannot cope with the volume of requests.    Work for this system is then routed to another transaction server which could handle the workload if the volume gradually increases,  If it gets this additional work  all at once, this instance slows down, and the work is routed to another transaction server etc.

MQ can be seen as the bad guy here.  When you restart MQ, it can go to fire hose mode immediately.   You should start the output channels first to start draining messages, then gradually start the input channels.   If you start the input channels before the output channels, you may get queues and page sets filling up, before the output channels can process the messages.

If you have a policy that all client connects must disconnect and reconnect a random time between15 minutes and 45 minutes this should help spread the load, and gradually you should get a balanced environment.

What’s the difference between MQ Web, and z/OS Connect MQ support?

With MQ Web

  1. You can issue commands to administer MQ  for example display, define, delete MQ objects.
  2. You can put and get messages to and from a queue.  The message is what you specify – typically a character string.

With Z/OS Connect MQ support

  1. You can put and get messages to and from a queue, and do transformations on the message.  For example mapping a COBOL structure to JSON.  
  2. You can do field validation.
  3. You can covert HTTP code “200” to “great it worked”.

What is common?

They both use z/OS WebSphere Liberty to provide the basic web server.

A practical path to installing Liberty and z/OS Connect servers – 10 use the MQ sample

Introduction

I’ll cover the instructions to install z/OS Connect, but the instructions are similar for other products. The steps are to create the minimum server configuration and gradually add more function to it.

The steps below guide you through

  1. Overview
  2. planning to help you decide what you need to create, and what options you have to choose
  3. initial customisation and creating a server,  creating defaults and creating function specific configuration files,  for example a file for SAF
  4. starting the server
  5. enable logon security and add SAF definitions
  6. add keystores for TLS, and client authentication
  7. adding an API and service application
  8. protecting the API and service applications
  9. collecting monitoring data including SMF
  10. use the MQ sample
  11. using WLM to classify a service

With each step there are instructions on how to check the work has been successful.

Use the MQ sample

You need to have installed the service, and protected it.

You need to configure MQ to include the MQ support and tell JMS where the libraries are

<server> 
<!-- Enable features --> 
<featureManager> 
    <feature>zosconnect:mqService-1.0</feature> 
</featureManager> 
                                                                                                         
<wmqJmsClient nativeLibraryPath="/Z24A/usr/lpp/mqm/V9R1M1/java/lib"/> 

<variable name="wmqJmsClient.rar.location"
   value="/Z24A/usr/lpp/mqm/V9R1M1/java/lib/jca/wmq.jmsra.rar"/> 
</server> 

You could configure a variable for the mq directory so you use it once, and use

<variable name="wmq"  value="/Z24A/usr/lpp/mqm/V9R1M1/java/lib/"
<wmqJmsClient nativeLibraryPath="${wmq}”/>
<variable name="wmqJmsClient.rar.location"
   value="${wmq}wmq.jmsra.rar}">

You could also pass the mq location as a variable in STDENV, and so pass it in through JCL.

Configure the jms to define the queue manager and queues

<server> 
 <jmsConnectionFactory jndiName="jms/cf1" 
     connectionManagerRef="ConMgr1"> 
    <properties.wmqJms transportType="BINDINGS" 
         queueManager="CSQ9"/> 
 </jmsConnectionFactory> 
                                                                                                      
 <connectionManager id="ConMgr1" maxPoolSize="5"/> 
                                                                                                      
 <!-- A queue definition where request messages 
      for the stock query application are sent. --> 
 <jmsQueue jndiName="jms/stockRequestQueue"> 
    <properties.wmqJms 
       baseQueueName="STOCK_REQUEST" 
       targetClient="MQ"/> 
 </jmsQueue> 
                                                                                                      
 <!-- A queue definition where response messages from 
      the stock query application are sent. --> 
 <jmsQueue jndiName="jms/stockResponseQueue"> 
    <properties.wmqJms baseQueueName="STOCK_RESPONSE" targetClient="MQ"/> 
 </jmsQueue> 
</server>

and include these in the server.xml file.

You need to compile and run the back end service.  See here.  Take care if using cut and paste as there a long lines which wrap, and cause compilation errors.

Because the MQ path name is long I used

export HLQ="/usr/lpp/mqm/V9R1M1/java/lib"

java -cp $HLQ/com.ibm.mq.allclient.jar:. -Djava.library.path=$HLQ TwoWayBackend CSQ9 STOCK_REQUEST STOCK_RESPONSE

I set up a job to run this in back ground, so I could free up my TSO terminal.

Use the API

Once installed you should be able to use the API. For example

curl –insecure -i –cacert cacert.pem –cert adcdd.pem:password –key adcdd.key.pem https://10.1.3.10:9443/stockmanager/stock/items/999999

If the back end application was working I got

{"SQRESP":{"ITEM_ID":999999,"ITEM_DESC":"A description. 00050","ITEM_COST":45,"ITEM_COUNT":0}}

If the back end application was not working I got back an empty response.

The back-end application runs until Ctrl+c is used.  On my USS, the ESCape key is cent symbol ¢ (Unicode 00a2) which I do not have on my default keyboard.    See  x3270 – where’s the money key? for guidance on how to set it.

 

Use the Service

To use the API I used a web browser with

https:10.1.3.10/9443/stockmanager/stock/items/999999

and got back

{"SQRESP":{"ITEM_ID":999999,"ITEM_DESC":"A description. 00050","ITEM_COST":45,"ITEM_COUNT":0}}

or curl with

 curl --insecure -X post -H Content-Type: application/json
  -H "Content-Type:application/json" -i --cacert cacert.pem
  --cert adcdd.pem:password --key adcdd.key.pem 
  --data {"STOCKQRYOperation": {"sqreq" : { "item_id": 2033}}}
  https://10.1.3.10:9443/zosConnect/services/stockQuery?action=invoke

Why isnt my MQ RACF command working?

I was  trying to define an MQ queue using %CSQ9 DEFINE QL(AA)  and was getting

ICH408I USER(IBMUSER ) GROUP(SYS1 ) NAME( ) 
CSQ9.QUEUE.AA CL(MQADMIN ) 
PROFILE NOT FOUND - REQUIRED FOR AUTHORITY CHECKING 
ACCESS INTENT(ALTER ) ACCESS ALLOWED(NONE ) 

But the profile existed!
The command tso rlist MQADMIN CSQ9.QUEUE.AA  showed me the profile which would be used

CLASS NAME
----- ----
MQADMIN CSQ9.QUEUE.* (G)

It did not look like the class was being cached

SETR RACLIST CLASSES =  APPL CBIND CDT CONSOLE CSFKEYS CSFSERV DASDVOL     
                        DIGTCERT DIGTCRIT DIGTNMAP DIGTRING DSNR EJBROLE   
                        FACILITY IDIDMAP LOGSTRM OPERCMDS PTKTDATA PTKTVAL 
                        RDATALIB SDSF SERVAUTH SERVER STARTED SURROGAT     
                        TSOAUTH TSOPROC UNIXPRIV WBEM XCSFKEY XFACILIT     
                        ZMFAPLA ZMFCLOUD 

But I missed the

GLOBAL=YES RACLIST ONLY =  MQADMIN MQNLIST MQPROC MQQUEUE MXTOPIC      

I used the

TSO SETROPTS RACLIST(MQADMIN) REFRESH

and the define command worked. Another face palming moment.

Lesson learned -if  indoubt use the SETROPTS RACLIST(MQADMIN) REFRESH command

 

Looking for an MQ reason code in Liberty? Get your safari helmet, anti malarial tablets and follow me to find the treasure.

I was using an MQ application in Liberty, and rather do things the easy way, I did what I normally do, and did it the hard way.  On my z/OS I did not have the queue manager defined, because I wanted to see what happened.  I was not expecting the expedition.

You configure MQ in Liberty using configuration like

<jmsConnectionFactory jndiName="jms/cf1" connectionManagerRef="ConMgr1"> 
<properties.wmqJms transportType="BINDINGS" queueManager="MQPA"/>

 

I was expecting a message like the following in the job output.

Application COLINAPP MQCONN call to MQPA failed with compcode 
'2' ('MQCC_FAILED')reason '2058' ('MQRC_Q_MGR_NAME_ERROR').

Oh no, it was not that easy.  It was quite a trek into the jungle to find the information.

In the Liberty server’s logs directory there is a message.log file.  In this file I had

9/14/20 19:16:32:242 GMT 00000060 com.ibm.ws.logging.internal.impl.IncidentImpl I FFDC1015I: An FFDC Incident has been created: "com.ibm.mq.connector.DetailedResourceException: MQJCA1011: Failed to allocate a JMS connection., error code: MQJCA1011 An internal error caused an attempt to allocate a connection to fail. See the linked exception for  details of the failure. com.ibm.ejs.j2c.poolmanager.FreePool.createManagedConnectionWithMCWrapper 199" at 
ffdc_20.09.14_19.16.28.0.log

This was one long line, and I had to scroll sideways (just like you did) to see the content (or use the ISPF line prefix command “tf” to flow the text to the display width).  A key hint was the message MQJCA1011 An internal error caused an attempt to allocate a connection to fail  so I knew I was on the right trail.  I now knew the name of the file – ffdc_20.09.14_19.16.28.0.log.

Knowing the name of the file did not help very much, as if you use ISPF 3.17  (z/OS UNIX Directory List ) it showed a list of 40 files with the name ffdc_20.09.14_1 (ffdc_yy.mm.dd_h).   This is because it only displays the first part of the name. Thanks to Steve Porter who said ..

To increase column size in 3.17, >
Options
1. Directory List Options…
Width of filename column . . . . . . . . 15 (Default value – increase as necessary)

 

The file has a name ffdc_20.09.14_19.16.28.0.log and a displayed time stamp of 2020/09/14 18:16:32 which is close enough – allowing for the time zone difference and the time take to write the file.  I was fortunate not to be running a workload and producing many of these files.

I edited the file – and I could see the full file name at the top of the page, so I knew I was in the right file.

The file has long lines, so I had to scroll or use the “tf” line command to reformat it.

Near the top it had

Stack Dump = com.ibm.mq.connector.DetailedResourceException: 
MQJCA1011: Failed to allocate a JMS connection., error code:  
MQJCA1011 An internal error caused an attempt to allocate a connection to fail. 
See the linked exception for details of the  failure.

Further down it had

Caused by: com.ibm.msg.client.jms.DetailedJMSException: 
JMSWMQ0018: Failed to connect to queue manager 'MQPA' with connection 
mode 'Bindings' and host name 'localhost(1414)'.

and further further down (line 50) I found the treasure

Caused by: com.ibm.mq.MQException: JMSCMQ0001: IBM MQ call failed with 
compcode '2' ('MQCC_FAILED') reason '2058'  ('MQRC_Q_MGR_NAME_ERROR').

What a trek to find the information I needed!

Next time I’ll just list the logs/ffdc directory, edit (not browse) each file and search for “compcode”.   You cannot use “grep compcode” from uss because the file is in UTF8 and does not find it.  You can just use oedit file_name in uss.

It would be nice if the MQ code could be enhanced to have an option “makeErrorsHardToFind” which you could set to “no”, and still keep the default “yes”.

 

Setting up a highly available web server is a “yes – but” problem

I’ve been setting up a Liberty web server, as used in MQWEB, z/OS Connect, z/OSMF and so on, and was looking into how to make this available, so I could move the web server to a different LPAR or TCP/IP instance.

Moving it should be easy – it is  – but …  but there are things you need to think about. It is a bit like going around a maze trying to find the solution.

How do I get to the fail over system?

You start the web server on a different LPAR in the sysplex. How can you support this to allow your browser to get to the backend, without changing the URL?

You have two choices.

  1. You change your DNS look up, or router so your request goes via a different connection (think different bit of wire) to the failover LPAR. These change can be automated to some extent.
  2. Multiple z/OS images can listen for an IP address.

These work but…

The certificate sent down from the web server contains the address of the LPAR as part of the SAN.  When the browser processes it, it compares the LPAR address in the certificate with the address in the certificate.  If they do not match the browser produces an error message.

How do I get over the certificate SAN and the IP address difference?

You have a couple of choices

  1. Use a unique certificate on each LPAR.    Yes this works, but there is more administrative work to set up.  You could set up two web servers and only use one at a time.   This work, but it is unnecessary work.
  2. Use a Virtual IP address.   In TCP networking the end of every connection is a “device” or system with its unique IP address.   You can give the web server its own IP address which is “virtual” as it is not  device or system.  With this, when you start your web server on a different LPAR, it has the same IP address.  To use this you have to configure z/OS to support this.  You can set this up
    1. To support multiple web servers, and distribute the work to them
    2. Have a hot standby
    3. To route traffic to where the web server has started.

Yes, these work, but – is not easy to set up.  I’ll be blogging how to do this.

Customising for MQWEB Liberty on z/OS, things the documentation does not tell you about

This post covers the customising you need to consider enterprise use of the Liberty MQWEB server.  It covers

  • Setup the USS path and defining an alias for the mq executable’s  directory
  • Do you have common configuration across mqwebuser.xml files?
  • Decide if you want to use setmqweb.
  • Setting up the server’s certificate and keyring
  • Setting up the trust store
  • Setting up the Angel process(es)
  • Reserving the TCP/IP Port number
  • Customising the jvm.option
    • To prevent the web server coming up if the Angel process is missing
    • Setting the time zone
  • Customising the mqwebuser.xml
    • SAF definitions
    • Setting the log sizes so the logs can be viewed
  • Letting requests in from outside of the LPAR
  • dspmqweb/setmqweb – which instance to use?
  • Selecting which IP stack to use
  • Customising ISPF option 3.17 – Unix Directory List

Setup the USS path and defining an alias for the mq executable’s directory

To be able to use the dspmweb and setmqweb commands you need to point to the command location.

You can add to the user’s .profile file, or the /etc/profile the statement

export PATH=/usr/lpp/mqm/V9R1M1/web/bin:$PATH

If you have multiple releases of MQ in your environment you could set up shell commands like v913dspmqweb.sh

/usr/lpp/mqm/V9R1M3/web/bin/dspmqweb "$@"

But this causes extra work when you need to migrate to the new release.  It might be better to set up an alias

ls -s  /usr/lpp/mqm/V9R1M3/web/bin /v913
ls -s  /usr/lpp/mqm/V9R1M3/web/bin /mqcur

so you just need to type /v913/dspmweb or /mqcur/setmqweb

As part of the migration to a newer release you just change the alias.

Do you have common configuration across mqwebuser.xml files?

If you have multiple mqweb instances, either because you have multiple LPARs in a sysplex, or you have to support different release of MQ concurrently, you may want to put common configuration in an include file. For example created a file common.xml to hold the configuration and put

<include location=”common.xml” optional=”false”/>

in the mqwebuser.xml file.

Decide if you want to use setmqweb.

You can update your *.xml configuration files, or use setmqweb to update mqwebuser.xml for you.

Some organisations do not allow manual changes to configuration. You have to change a configuration file, have it reviewed, and use automation to deploy it.

For test systems it may be ok to use the setmqweb command and change things dynamically.

If you make a change using setmqweb, it updates the mqwebuser.xml file, by adding/changing a <variable name=”…” value=”..”/>  statement.

If you are using SAF authentication and certificate authentication

You will need keyring with the certificate to identify the server (the key store).  You will need a keyring to identify the certificates you trust (the trust store).  You could use the same keyring for both – but this is not good practice.

The server’s certificate and key store keyring

You need to decide if the MQWEB server uses the same certificate as CICS, WAS and z/OS Connect etc. on the same LPAR.  You could have a common certificate to simplify administration. The certificate needs a Subject Alternative Name, to identify the machine the certificate came from. This can be the DNS name or the dotted address (9.20.4.6) depending on your set up.  It might be easier to define both. Note the RACF command

RACDCERT .. ALTNAME(IP(10.1.1.2) IP(10.1.1.3) DOMAIN(‘WWW.ME.COM’) DOMAIN(‘WWW.LAST.COM’))…

accepts multiple entries, but only uses the last one. The above command gave produced a certificate with

Subject's AltNames: 
IP: 10.1.1.3 
Domain: WWW.LAST.COM

This means you many only be able to use the certificate only on the LPAR that has been defined, (if you move the server to a different LPA, it will have a different IP address, and your clients will complain – see below).   You may be able to something clever things with VIPA (Virtual IP addressing) where your Sysplex has one IP address and this maps to different IP addresses on each LPAR.

If you have the wrong IP or Domain then the browser gets  a message like “Your connection is not private. Attackers may try to steal your information from 10.1.1.2.  NET:ERR_COMMON_NAME_INVALID”

The trust store keyring.

The trust store keyring has the certificates to authenticate what has been sent from the client.  For example, a copy of any self signed certificate, or the Certificate Authorities of the Web Browser’s certificate.

This keyring could be sysplex wide, and shared by CICS, WAS, Z/OS connect etc – assuming they have the same people connecting to them.

The certificates may have been configured with owner CERTAUTH rather than an userid.

My definitions

<sslDefault sslRef="defaultSSLConfig"/> 
<ssl id="defaultSSLConfig" 
   sslProtocol="TLSv1.2" 
   keyStoreRef="racfKeyStore" 
   trustStoreRef="racfTrustStore" 
   clientAuthenticationSupported="true" 
   clientAuthentication="true" 
   serverKeyAlias="MYMQWEB/> 

<keyStore filebased="false" id="racfKeyStore" 
   location="safkeyring://START1/KEY" 
   password="password" 
   readOnly="true" 
   type="JCERACFKS"/> 

 <keyStore filebased="false" id="racfTrustStore" 
   location="safkeyring://START1/TRUST" 
   password="password" 
   readOnly="true" 
   type="JCERACFKS"/> 

<webAppSecurity allowFailOverToBasicAuth="false"/>
  • The sslDefault  points to the ssl with the same ID
  • The ssl points to
    • the key store with the servers certificate with the id racfKeyStore
    • the trust store to validate connecting clients, with the id racfTrustStore

Create an angel

You need an Angel process to handle the SAF (RACF) security requests – the MQ documentation tells you this.

Typically the Angel started task is started at IPL, and shut down at system shut down.
All instances of Liberty Web Server running on an LPAR can all use the same Angel, for example the z/OSMF angel IZUANG1.

You cannot shut down the Angle process if it is in use, but if you cancel it, the servers using it will stop working (hang) and may abend.

You may want to consider more than one Angel process, and not share it.

When the Angel process has started, it uses no CPU, as the Web Servers execute code within the  Angel address space, on the Web Server’s threads – just like MQ, DB2 etc.

Customise  jvm.options

Stop if there is no Angel  process

If the Angel process is not running at Liberty startup,  then the Web Server may continue to come up.  People will not be authorised to access it, but the Web Server will be running.   This is pretty useless.

You can specify an option so the liberty server (MQWEB) does not start if the Angel task is not running.

I use

-Dcom.ibm.ws.zos.core.angelRequired=true
#-Dcom.ibm.ws.zos.core.angelName=MYANGEL

-Dcom.ibm.ws.zos.core.angelRequired=true

If the angel process is not available then the MQWEB stops when it detects the angel is not available.

#-Dcom.ibm.ws.zos.core.angelName=MYANGEL

If you are using a names Angel, uncomment this and specify the Angel name.

If you are using the unnamed Angel, leave this commented.

Set the time zone

The time zone is picke up from TZ in /etc/profile, but you can override it by specifying

-Duser.timezone=Europe/London

This sets the time-zone of the messages in the message.log and trace.log files.

Reserve the TCP/IP port number

It is a good idea to talk to the networking team and get them to update the TCP/IP configuration for example

PORT 
    20 TCP OMVS NOAUTOLOG ; FTP Server 
    21 TCP OMVS ; FTP Server 
    22 TCP SSHD* ; port for sshd daemonrver 
    23 TCP TN3270 ; Telnet 3270 Server 
    ...
    1414 TCP CSQ9CHIN ; CSQ9 MQ TCP Listener  
    ...
    9443 TCP MQWEB ; Colin Paice MQWEB

 

Customise mqwebuser.xml

Message log and trace file settings

If the trace or message files are too big, you cannot view them. You have to use edit to look at them, but if the file is too large, browse is substituted and browse does not do code page conversion, so you are looked at raw ascii characters in an EBCDIC browser.

<variable name=”maxTraceFileSize” value=”20″/>
<variable name=”maxTraceFiles” value=”20″/>
<variable name=”maxMsgTraceFileSize” value=”20″/>
<variable name=”maxMsgTraceFiles” value=”20″/>

The file size values are in MB.

You should consider keeping you messages.log files for a week or so, so make the number of files large enough.

SAF – Access to RACF

If you are using SAF (RACF or other z/OS security manager) to manage access and authorisation you will have a default entry like

<!-- 
Example SAF Registry 
--> 
<safAuthorization racRouteLog="NONE" id="saf" /> 

<safRegistry id="saf" /> 
<safCredentials unauthenticatedUser="WSGUEST" profilePrefix="MQWEB" 
suppressAuthFailureMessages="false" /> 

I use <safAuthorization racRouteLog=”ASIS”… to get RACF violation messages on the joblog during set up.  See here.

<safRegistry suppressAuthFailureMessages=”false”…  prints out violation messages.  See here.

Let request in from outside z/OS

For this to work you have to edit the mqwebuser.xml file and uncomment

<variable name="httpHost" value="*"/> 
<!-- 
-->

By default it only allows request from the same z/OS system – so not allowing browsers access.

dspmqweb/setmqweb – which instance to use?

This page  says you must use

export WLP_USER_DIR=WLP_user_directory

This is fine when you have one mqweb instance on one LPAR.  You might want a shell program to set this every time.  For example,  the program disMQPAweb.sh

export WLP_USER_DIR=/u/mqmweb/MQPA
/usr/lpp/mqm/V9R1M1/web/bin/dspmqweb "$@"

Then you can use /usr/lpp/mqm/V9R1M1/web/bin/dspmqweb as before.

If you have multiple releases of MQ in your environment you might want to point to the command in the script, so dspMQPA.sh might have

export WLP_USER_DIR=/u/mqmweb/MQPA
/usr/lpp/mqm/V9R1M1/web/bin/dspmqweb "$@"

Though it might be better to have a shell script mq911 with an optional queue manager parameter

Selecting which IP stacks to use.

There is an article from IBM, which gives two ways of configuring it.  Changing the httpEndpoint, or specifying an environment variable

Customise ISPF z/OS UNIX Directory List

In the MWEB directory are message logs and trace logs.  When the file fills up, it renames the old file to include the date and time, for example messages_20.07.29_16.49.29.0.log , and creates a new message.log or trace.log

If you are using ISPF 3.17 (z/OS UNIX Directory List) to use the files, it only displays the first 15 characters of the file name, so you get lots of files with a name like “messages_20.07.” where 20 is the year, and 07 is the month.

The default layout for the z/OS UNIX Directory List  displays by default some unhelpful fields.   You can arrange the fields, (but not make the filename field wider).
If you go to the OPTIONS on the top line, and select “2. Directory List Column Arrangement… ” you can change what fields are displayed, and the order.  I set the widths of all fields to 0, except for

  • Type 04
  • Modified 19 (if you specify a smaller value you only get the YYYY-MM…  not the time)
  • Size 10

The documentation says

  • Modified The date and time the file was last changed.
  • Changed The date and time the status of the file was last changed.

I do not know the difference between these two.

Controlling what is displayed

In the directory list you can use sort commands

  • sort file A
  • sort mod D

Looking at a log or trace file

If you sort by Modified A the newer files will be at the top, so you can look at the “modified” column to look for the time the file was created, and so get the order of the files.

You can use the line command / to display the options.

You can use e to edit, or V to use edit in browse mode.

Browse displays a mess because it does not do conversion

 


	

Liberty on z/OS: Mapping an incoming certificate to a z/OS userid for client certificate authentication – and don’t forget the cookies!

I thought I understood how this worked, I found I didn’t, then had a few days hunting around for the problem

The basics

You can use a digital certificate from a web browser ( curl, or other tools) to authenticate to z/OS.  You need to map the certificate to a userid.

A certificate coming in can have a Distinguished Name like CN=adcdd.O=cpwebuser.C=GB  (Note the ‘.’not ‘,’ between elements).

Your userid needs to have SPECIAL define to be able to use the RACDCERT command (SPECIAL, not just GROUP-SPECIAL).

You will need a definition like (see here for the command)

RACDCERT MAP ID(ADCDD ) - 
    SDNFILTER('CN=adcdd.O=cpwebuser.C=GB') - 
    WITHLABEL('adcdd')

or a general definition for those certificate with  O=cpwebuser.C=GB, ignoring the CN part

RACDCERT MAP ID(ADCDB ) - 
   SDNFILTER('O=cpwebuser.C=GB') - 
   WITHLABEL('cpwerbusergb') 

or using the Issuing Distinguished Name (the Certificate Authority)

IDNFILTER(‘CN=TESTCA.OU=SSSCA.C=GB)

Using a generic

SDNFILTER(‘CN=a*.O=cpwebuser.C=GB’)

does not work.

If you attempt to use a certificate which is not mapped you get

ICH408I USER(START1 ) GROUP(SYS1 ) NAME(COLIN)
DIGITAL CERTIFICATE IS NOT DEFINED. CERTIFICATE SERIAL NUMBER(0163)  SUBJECT(CN=adcdd.O=cpwebuser.C=GB) ISSUER(CN=SSCA8.OU=CA.O=SSS.C=GB).

It is worth defining these using JCL, because if you try to add it, and it already exists then you get a message saying it exists already.  If you know the userid, you can list the maps associated with it.   If you do not know the userid, there is no practical way of finding out – you have to logon with the certificate, and display the userid from the web browser, or extract the list of all users, and use LISTMAP on all of them.

Once you have set up the userid, you can connect them to the group to give them access to the EJBROLE profiles.  For example use group names

  • MQPAWCO MQPAMQWebAdminRO Console Read Only.
  • MQPAWCU MQPAMQWebUser  Console User only.  The request operates under the signed on userid authority.
  • MQPAWCA MQPAMQWebAdmin Console Admin.

for queue manager MQPA, Web  Console (rather than REST) and the access.

You may want to set up  userids solely for client authentication.  If the userid has NOPASSWORD, it cannot be used to logon with userid and password, and of course the lack of password means the password will not expire.

Having a set of userids just for certificate access makes it easier to manage the RACDCERT MAPping.    You have a job with

RACDCERT ID(adcd1) LISTMAP
RACDCERT ID(adcd2) LISTMAP
etc

and search the output for the certificate of interest.

It gets more complicated…

Often the user’s certificate is in the form CN=Colin Paice,o=SSS,C=GB so if you want to allow all people in the MQADMIN team access, you will need to to specify them individually.  It would be easier if DN had CN=Colin Paice,OU=MQADMIN,o=SSS,C=GB, then you can filter on the OU=MQADMIN.   These could map to a userid MQADM1.

It gets more complication if someone can work with MQ, and CICS or z/OS Connect, and you have to decide a userid – MQADM1 or CICSADM1?

Setting up a one to one mapping may be the best solution, so CN=Colin Paice,o=SSS,C=GB maps to CPAICE (or GB070594).   This userid is then added to the appropriate RACF groups to give access to the EJBROLEs, to give access to the servers.

How do I tell what is being used?

I could not get Liberty to record an audit record for the logon/matching.   I tried altering the userid to have UADIT – but it did not work either.

If you have audit defined on the class EJBROLE profile MQWEB.com.ibm.mq.console.MQWebUser, you will get a audit record in SMF.   This has many fields including

  • Date
  • Time
  • ACCESS
  • SUCCESS – or INSACC (INSufficient Access)
  • ADCDC – userid being used
  • READ – Requested access
  • READ – permitted access
  • EJBROLE – the class
  • MQWEB.com.ibm.mq.console.MQWebUser – the profile
  • CN=adcdd.O=cpwebuser.C=GB – the Distinguished Name of the certificate
  • CN=SSCA8.OU=CA.O=SSS.C=GB – the Issuers (Certificate Authority) of the certificate

From this you can see the userid being used ACDC, and the certificate DN CN=adcdd.O=cpwebuser.C=GB.

And to make it more complicated

I deleted the RACDCERT MAP entry, but the web browser continued to work with the user.  I had a cup of tea and a cookie, and the web browser stopped working.   Was problem this connected to a cup of tea and a cookie?

Setting up the initial handshake is expensive.  The system has to do a logon with the certificate to get the userid from the RACDCERT mapping.  It then checks the userid has access to the SERVER profile, then it checks to see if it is MQWebAdmin, MQWebAdminRO, or  MQWebUser.

Once it has done this it it takes the userid and information, encrypts it, and creates the LTPA cookie.   This is sent down to the web browser.

The next time the web browser sends some data, it also sends the cookie. The MQWEB server decrypts the cookie, checks the time stamp to make sure the information is current, and if so, uses it.  The timeline I had was

  • create the RACDCERT mapping from certificate DN to userid
  • use browser to logon to mqweb, using the certificate with the DN
  • it works, mqweb sends down the cooke
  • delete the RACDCERT mapping for the DN
  • restart the browser, logon to mqweb, using the certificate with the DN.  The cookie is passed up – the logon works
  • clear the browser’s cookies – and retry the logon.  It fails as expected.

So ensure the browser cookie is cleared if you change the mapping or ejbrole access for the user.