Getting SSL/TLS to work on MQ on z/OS

After I succeeded in getting TLS 1.3 to run on MQ  midrange 9.2, I thought I would try it on z/OS.  I had not used TLS on z/OS for about 10 years, so it was almost like coming to the topic with very rusty knowledge.

I searched the Knowledge centre and found no relevant hits – lots of hits which were not relevant.  I eventually found an SSL related keyword, and this got me to the topic  Working with SSL/TLS on z/OS.   I think this is well documented.  It covered all the things I had to do.

The remained of this post covers the bits not covered by the documentation.

Define SSLTASKS.

You need to define SSLTASKS to be able to use TLS on z/OS.  See the comments here. I used

%CSQ9 ALTER QMGR SSLTASKS(5)

You need to restart the CHINIT if you change the value of SSLTASKS.

Set up the keyring for the queue manager. 

See here.  This post show how to create the keyring and import a CA from z/OS, and import a CA  from a Linux system.

If you alter the keyring or certlabl you just need a refresh security type(SSL) command to pick up the changes.

Defining the channel

I tried to define the channel, as this failed for security reasons, I’ve given the RACF setup I had to do.

In this section I defined the specific commands for example DEFINE.CHANNEL.   I could have defined DEFINE.* to allow all define commands.

I used a channel called TLS, and define the resource CSQ9.CHANNEL.TLS* to allow my ID to define TLS, TLS1 etc

The command %CSQ9 DEF CHL(TLS) CHLTYPE(SVRCONN) gave me

ICH408I USER(CSQOPR ) GROUP(SYS1 ) NAME(COLIN PAICE ) 167
CSQ9.DEFINE.CHANNEL CL(MQCMDS )
WARNING: INSUFFICIENT AUTHORITY – TEMPORARY ACCESS ALLOWED
ACCESS INTENT(ALTER ) ACCESS ALLOWED(NONE )

I used the RACF commands in a batch job.

 /* RDELETE MQCMDS CSQ9.DEFINE.CHANNEL
RDEF MQCMDS CSQ9.DEFINE.CHANNEL UACC(NONE)
PERMIT CSQ9.DEFINE.CHANNEL CLASS(MQCMDS ) –
        ID(COLIN,IBMUSER) ACCESS(ALTER )

I also set up CSQ9.DELETE.CHANNEL and CSQ9.ALTER.CHANNEL in a similar way, so my userid could maintain the channels.

I refreshed MQ security %CSQ9 refresh security to pick up the changes.

I reissued the command %CSQ9 DEF CHL(TLS ) CHLTYPE(SVRCONN) and got

ICH408I USER(COLIN ) GROUP(SYS1 ) NAME(COLIN PAICE )
CSQ9.CHANNEL.TLS CL(MQADMIN )
PROFILE NOT FOUND – REQUIRED FOR AUTHORITY CHECKING
ACCESS INTENT(ALTER ) ACCESS ALLOWED(NONE )

I used the RACF commands in a batch job.

  /* RDELETE MQADMIN CSQ9.CHANNEL.TLS*
RDEF MQADMIN CSQ9.CHANNEL.TLS* UACC(NONE) WARNING
PERMIT CSQ9.CHANNEL.TLS* CLASS(MQADMIN) –    
        ID(COLIN,IBMUSER) ACCESS(ALTER )
SETROPTS RACLIST(MQADMIN) REFRESH

I issued the commands

%CSQ9 refresh security
%CSQ9 DEF CHL(TLS ) CHLTYPE(SVRCONN)

And successfully defined the channel.

I changed the cipher spec.

I selected a cipher spec from the list.

%CSQ9 alter chl(TLS) chltype(SVRCONN) SSLCIPH(ECDHE_RSA_AES_128_CBC_SHA256)

When I started the channel I got

CSQX631E … CSQXRESP Cipher specifications differ,  channel TLS local=ECDHE_RSA_AES_128_CBC_SHA256 remote=TLS_RSA_WITH_AES_256_GCM_SHA384
connection 10.1.0.2 (10.1.0.2)

This was clear;  I love clear messages.

I decided to change the z/OS end

%CSQ9 alter chl(TLS) chltype(SVRCONN) SSLCIPH(TLS_RSA_WITH_AES_256_GCM_SHA384 )

and the client connected successfully.

With MQ 9.2 I could (and did) change this to

%CSQ9 alter chl(TLS) chltype(SVRCONN) SSLCIPH(ANY_TLS12)

and the client worked successfully.  The ANY_TLS12.  provides a wide list of supported cipher specifications, includes TLS_RSA_WITH_AES_256_GCM_SHA384  and ECDHE_RSA_AES_128_CBC_SHA256. 

When I am ready to support TLS 1.3 I will use ANY_TLS12_OR_HIGHER and ANY_TLS13_OR_HIGHER.

Connect a client to it!

I had had my client connect to a midrange queue manager, so I had working client environment.  See here for the journey.

I created a .json file for the CCDT connection to z/OS.  I specified

{ "channel":
  [{
    "name": "TLS",
    "clientConnection":
    {
      "connection":
      [{
        "host": "10.1.1.2",
        "port": 1414
       }],
      "queueManager": "CSQ9"
    },
    "transmissionSecurity":
    {
      "cipherSpecification": "ANY_TLS12",
      "certificateLabel": "rsaca256_client",
      "certificatePeerName": ""
    },
    "type": "clientConnection"
  }]
}

When it connected I got messages

+CSQX511I %CSQ9 CSQXRESP Channel TLS started connection 10.1.0.2
ICH408I USER(COLINPAI) GROUP( ) NAME(??? )
LOGON/JOB INITIATION – USER AT TERMINAL NOT RACF-DEFINED
IRR012I VERIFICATION FAILED. USER PROFILE NOT FOUND.
+CSQX512I %CSQ9 CSQXRESP Channel TLS no longer active connection 10.1.0.2

COLINPAI came from the userid on the Linux machine (colinpaice) upper cased and truncated. This id will be flowed and used as the MCAUSER if you don’t set it to anything else, using CHLAUTH for example  (Thanks to Morag for this information).

Enable chlauth

To be able to map from the DN in a certificate to a z/OS userid you have to use MQ CHLAUTH.  See  Mapping a client user ID to an MCAUSER user ID.

Check it is enabled at the queue manager level and enable it it needed.

%CSQ9 DIS QMGR CHLAUTH
%CSQ9 ALTER QMGR CHLAUTH(ENABLED)

Define a mapping from certificate to userid

I used

//S1 EXEC PGM=CSQUTIL,PARM='CSQ9' 
//STEPLIB  DD DSN=COLIN.MQ921.SCSQLOAD,DISP=SHR 
//         DD DSN=COLIN.MQ921.SCSQANLE,DISP=SHR 
//SYSPRINT DD SYSOUT=* 
//SYSIN   DD * 
 COMMAND DDNAME(COMMAND) 
//COMMAND DD * 
 SET CHLAUTH('TLS') + 
     TYPE(SSLPEERMAP) SSLPEER('O="cpwebuser"') + 
     ACTION(REPLACE)   + 
     MCAUSER(ADCDD ) CHCKCLNT(ASQMGR) 
/* 

This says for channel TLS,  take the Organisation(O=..)  from the certificate, and if it is cpwebuser then set the ID to ADCDD.

Check it works

Once the channel had started I used

%CSQ( DIS CHS(TLS)
it displayed the following, where I have removed lines which are not relevant to TLS and added some comments

  • CHSTATUS(TLS)
  • CHLTYPE(SVRCONN)
  • SECPROT(TLSV12) – this is the level of the protocol
  • SSLCERTI(CN=SSCARSA1024,OU=CA,O=SSS,C=GB)- this is the DN of the issuer of the SSLPEER certificate (below)
  • SSLCERTU(START1) – the IBM documentation says “The local user ID associated with the remote certificate.”  I dont know where this comes from.. how to change it, or where it is used.
  • SSLCIPH(TLS_RSA_WITH_AES_256_GCM_SHA384) – The negotiated cipher spec
  • SSLRKEYS(0) -The number of successful TLS key resets.
  • SSLKEYTI() -The time on which the previous successful TLS secret key was reset.  The secret key has not been reset
  • SSLKEYDA() -The date on which the previous successful TLS secret key was reset.  The secret key has not been reset
  • SSLPEER(SERIALNUMBER=01:90,CN=rsaca256,O=cpwebuser,C=GB, UNSTRUCTUREDNAME=openssl_ca_user_cnf.keyAgreement2, UNSTRUCTUREDNAME=localhost, UNSTRUCTUREDADDRESS=127.0.0.1) . This is information from the certificate at the remote end.
  • MCAUSER(ADCDD) – This is the userid (set by the CHLAUTH above) used by this channel.
  • LOCLADDR(10.1.1.2(1414)) – This is the address the connection came in from.  This value will be different it you have different IP stacks and different listener ports.

Setting up the MQ keyring on z/OS

I wanted to connect some clients to my z/OS queue manager over a TLS channel.  This post describes how I set up the z/OS keyring with the certificates.

Define the keyring.

The CSQ9 CHINIT runs with a userid of START1, so I defined a keyring belonging to that id.

I set up a dataset  called IBM.MQCSQ9.KEYRING to keep all of my JCL in for the CSQ9 queue manager.  This makes it easier to clone the definitions for another queue manager.

The definitions create the keyring, and add the z/OS CA certificate (CERTAUTH ADCD_CA) to it.

//IBMRACF JOB 1,MSGCLASS=H 
//* Use JCL for the RACF definitions
//S1 EXEC PGM=IKJEFT01,REGION=0M
//SYSPRINT DD SYSOUT=*
//SYSTSPRT DD SYSOUT=*
//SYSTSIN DD *
/*RACDCERT DELRING(MQRING) ID(START1)
RACDCERT ADDRING(MQRING) ID(START1)
RACDCERT ID(START1 ) -
   CONNECT(RING(MQRING) LABEL('ADCD-CA') CERTAUTH)
SETROPTS RACLIST(DIGTCERT,DIGTRING ) refresh

Define the queue manager’s certificate.   This uses an Elliptic curve with key size of 256.

 /*RACDCERT ID(START1) DELETE(LABEL('CSQ9CERT')) 
RACDCERT ID(START1) GENCERT - 
  SUBJECTSDN(CN('CSQ9CERT')  - 
             O('ADCD') - 
             OU('TEST')) - 
   SIZE(256) - 
   NISTECC - 
   SIGNWITH (CERTAUTH LABEL('ADCD-CA')) - 
ALTNAME(IP(10.1.1.9)) - NOTAFTER( DATE(2021-12-29))- WITHLABEL('CSQ9CERT')
RACDCERT id(START1) ALTER(LABEL('CSQ9CERT'))TRUST RACDCERT ID(START1) CONNECT(RING(MQRING) - LABEL('CSQ9CERT') USAGE(PERSONAL)) RACDCERT LISTRING(MQRING) ID(START1) RACDCERT LIST(LABEL('CSQ9CERT' )) ID(START1) SETROPTS RACLIST(DIGTCERT,DIGTRING ) refresh

It need ALTNAME(IP(10.1.1.2)) ( or similar) because some browsers check this name, with the IP address of the server.

 

Configure the queue manager

%CSQ9 ALTER QMGR SSLKEYR(START1/MQRING) CERTLABL(CSQ9CERT)

Note if you use mixed case keyring you need to put the value in quotes.

Export the CA certificate from z/OS and sent to the client machine

RACDCERT CERTAUTH EXPORT(LABEL('ADCD-CA'))- 
DSN('IBMUSER.CERT.ADCDCA.PEM') -
FORMAT(CERTB64) -
PASSWORD('password')

The data set  IBMUSER.CERT.ADCDCA.PEM  contained text, and the first line is —–BEGIN CERTIFICATE—– .   Send this file to the client machine, for example using FTP.  I sent it as zos.adcdca.pem.

Import this to the keystore

runmqakm -cert -add -file zos.adcdca.pem -type cms -stashed -db zzclient.kdb -label zosca

Upload  the certificates from Linux to z/OS.

On Linux, my CA certificiate was in a *.pem file where the first line was —–BEGIN CERTIFICATE—–.  Send this to z/OS.  I used FTP.

Import the CA into the keyring.

The command adds an existing certificate CARSA1024 for userid START1.  The CONNECT USAGE(CERTAUT) defines this as a CA certificate (without the need to have the certificate belong to CERTAUTH userid).

  /*RACDCERT DELETE  ( LABEL('CARSA1024')) ID(START1) 
 SETROPTS RACLIST(DIGTCERT,DIGTRING ) refresh 
                                                                   
 RACDCERT ADD('IBMUSER.CARS1024.PEM')  - 
   ID(START1)  WITHLABEL('CARSA1024') 
 RACDCERT ID(START1) CONNECT(RING(MQRING    ) - 
                             USAGE(CERTAUTH) - 
                             ID(START1)  - 
                             LABEL('CARSA1024') 
 racdcert listring(MQRING ) id(start1) 
 SETROPTS RACLIST(DIGTCERT,DIGTRING ) refresh 

Refresh the queue manager

%CSQ9 refresh security type(SSL)

I’m thinking of using MQ MinimumRSAKeySize. What do I need to plan for?

If you enable this option in the qm.ini or mqclient.ini, it can have problems which are hard to diagnose.

This option restricts

  1. the key size of certificates with a type of RSA
  2. the key size of any CA certificates  with a type of RSA, used to sign a certificate (whether the certificate is RSA or not).

For example with MinimumRSAKeySize=2048 you cannot use a certificate or CA generated with openssl genpkey -out $name.key.pem -algorithm RSA -pkeyopt rsa_keygen_bits:1024.

On the client machine in /var/mqm/errors/*01* I got AMQ9633E: Bad SSL certificate for channel ….

On z/OS I got CSQX620E … CSQXRESP System SSL error, channel … connection … function ‘gsk_secure_socket_init’ RC=541.  Code 451 means “bad certificate” was received from the remote end.

How do I check?

There is no easy way of displaying the details of all of the certificates in a key store.

Midrange.

You can use the following command to list the labels all of the certificates

runmqakm -cert -list all -db zzclient.kdb -type cms -stashed -v

Then use the following command to display the details of each label in turn (zosca in the example)

runmqakm -cert -details -db zzclient.kdb -type cms -stashed -label zosca

This displays information like the example below for a CA certificate of type RSA and key size 1024.

Label : zosca
Key Size : 1024
Version : X509 V3
Serial : 00
Issuer : "CN=z/OSCertification Authority,OU=TEST,O=ADCD"
Subject : "CN=z/OSCertification Authority,OU=TEST,O=ADCD"
Not Before : 7 July 2020 00:00:00 GMT+01:00
Not After : 7 July 2021 23:59:59 GMT+01:00
Public Key
    ...
Public Key Type : RSA (1.2.840.113549.1.1.1)
  

or for an Elliptic certificate with key size 256.

Label : ca256
Key Size : 256
Version : X509 V3
Serial : ...
Issuer : CN=SSCA256,OU=CA,O=SSS,C=GB
Subject : CN=SSCA256,OU=CA,O=SSS,C=GB
Not Before : 7 February 2021 11:24:56 GMT
Not After : 7 February 2024 11:24:56 GMT
Public Key
    ... 
Public Key Type :  EC_ecPublicKey 

You can only check the certificates that are in your key store, not certificates that are sent as part of the handshake. 

z/OS

The listring command displays the contents of the ring (owner and label).

RACDCERT LISTRING(MQRING) ID(START1)

The list command displays the details of a certificate.

RACDCERT certauth LIST(label(‘ADCD-CA’))

displays information like, for the RSA certificate with a small key size,

 Label: ADCD-CA                                                       
 Certificate ID: 2QiJmZmDhZmjgcHEw8Rgw8FA                             
 Status: TRUST                                                        
 Start Date: 2020/07/06 23:00:00                                      
 End Date:   2021/07/07 22:59:59                                      
 Serial Number: ...                                                
 Issuer's Name:                                                      
      >CN=z/OSCertification Authority.OU=TEST.O=ADCD<                 
 Subject's Name:                                                      
      >CN=z/OSCertification Authority.OU=TEST.O=ADCD<                 
 Signing Algorithm: sha1RSA                                           
 Key Usage: CERTSIGN                                                  
 Key Type: RSA                                                        
 Key Size: 1024                                                       

Certificates signed by this CA would not work if MinimumRSAKeySize=2048 was specified.

Are there any requirements on the certificates used by MQ?

Having set up web servers using digital certificates I know some components are sensitive to the content of a certificate.  For example

  1. Some web browsers want the IP address in the  Alternate Name to match the IP address of the server.
  2. Some clients check that the server’s certificate has been signed with “server auth” – saying it can be used as a server’s certificate.

The short answer is no, there are no requirements. The longer answer is maybe.

For the chinit

I used the following to define certificate for z/OS.

RACDCERT ID(START1) GENCERT - 
  SUBJECTSDN(CN('CSQ9CERT')  - 
             O('ADCD') - 
             OU('TEST')) - 
  SIZE(2048) - 
  RSA     - 
  SIGNWITH (CERTAUTH LABEL('ADCD-CA')) - 
NOTAFTER( DATE(2021-12-29))- WITHLABEL('CSQ9CERT') RACDCERT id(START1) ALTER(LABEL('CSQ9CERT'))TRUST RACDCERT ID(START1) CONNECT(RING(MQRING) - LABEL('CSQ9CERT') USAGE(PERSONAL))
RACDCERT LISTRING(MQRING) ID(START1) RACDCERT LIST(LABEL('CSQ9CERT' )) ID(START1) SETROPTS RACLIST(DIGTCERT,DIGTRING ) refresh

Things to be careful about

  1. Check the NOTAFTER date
  2. It needs to be TRUST
  3. The CONNECT of the certificate to the keyring needs USAGE(PERSONAL).  It it does not have it you get CSQX645E +cpf CSQXRESP Certificate … missing for channel …
  4. If the certificate is RSA, or the CA certificate is RSA, then check the keysize of the certificate and the CA.  On MQ 9.2 the size needs to be larger than the MinimumRSAKeySize any QM.INI or mqclient.ini file SSL stanza, in any queue manager or client.

For the MQ Web server

Any keyusage from no keysusage to KEYUSAGE( DATAENCRYPT, DOCSIGN, HANDSHAKE) works. Just do not use KEYUSAGE(CERTSIGN) .

With CERTSIGN Chrome gives NET::ERR_CERT_INVALID, and the TLS trace gives BAD_CERTIFICATE or UNKNOWN_CERTIFICATE. See here.

Colins updates to MQ messages

As I was trying to get TLS to work on midrange, I had many MQ error messages. Sometimes the messages were a bit vague “you’ve had a problem. Resolve it and restart the channel”.

Below is the list of messages I’ve added comments to. I’ve done it as a blog post as well-known search engines are not finding the pages.

Mid range

z/OS

Client

What are these message classes in MQ 9.2 on z/OS?

I noticed that the MQ 9.2 documentation for z/OS refers to different message classes.  This is nothing very exciting. From the JCL it says

This sample shows 8 other page sets, three for each major class of message, and one for each other class:

  • 00 – object definitions
  • 01 – system-related messages
  • 02-04 – important long-lived messages
  • 05-07 – short-lived messages
  • 08 – miscellaneous messages

Basically it means provide multiple page sets for the major classes of messages. This gives more capacity and can spread the I/O work load.

Debugging MQ client connection problems

I had lots of problems trying to get a client using TLS to connect to the queue manager, from set up errors, to unclear documentation.   It took me days to get my first channel set up.
IBM does not provide a program on midrange to format the GSKIT trace.  Someone suggested I use Wireshark  network packet analyser to monitor the traffic on the network.  I’ve given some examples of the handshake for TLS 12, TLS 13.

Below are some things you can do to check your set up is as you expect.

Review the error log

This is in /var/mqm/errors/AMQERR01.LOG
It does not provide all the information needed to identify the parameters used in the handshake, so you need to use the trace.

Turn on the MQ client trace

strmqtrc -e
start your program
endmqtrc -e

Format the trace

The client trace files are in /var/mqm/trace

  • cd /var/mqm/trace
  • dspmqtrc *.TRC

This will create several *.FMT files

Find the trace file for the connection

grep -F PeerName *.FMT

will list the file with the trace entries in it.  My file was  AMQ7232.0.FMT .

Check the channel name.

From the output of the grep -F PeerName *.FMT above, check the Channel Name: is what you expect? Check the remote IP address and port.  If you are using a CCDT it will take the first channel name which matches the queue manager.  This may not be the channel name you were expecting.    I had channels QMACLIENT and QMACLIENTTLS in the CCDT.   QMACLIENT was chosen instead of QMACLIENTTLS.

Check the mqclient.ini being used

grep -F mqclient.ini AMQ7232.0.FMT |grep -F FileName

Check the CCDT, (Channel Definition File) being used

grep -F ChannelDefinition AMQ7232.0.FMT

Adding stanza (ChannelDefinitionDirectory) length (19)
Adding stanza (ChannelDefinitionFile) length (10)
ChannelDefinitionDirectory = '/home/colinpaice/mq'
ChannelDefinitionFile = 'COLIN2.TAB'
Adding stanza (ChannelDefinitionFile) length (10)
Using ChannelDefinitionDirectory / MQCHLLIB value of /home/colinpaice/mq
Using ChannelDefinitionFile / MQCHLTAB value of RE.TAB
Adding stanza (ChannelDefinitionDirectory) length (19)
Adding stanza (ChannelDefinitionFile) length (10)
Using ChannelDefinitionDirectory / MQCHLLIB value of /home/colinpaice/mq
Using ChannelDefinitionFile / MQCHLTAB value of RE.TAB

The last entries show what was used.

Check the keystore being used

grep -Fi SSLKeyR *.FMT

Display the certlabl.

grep -Fi CertificateLabel *.FMT

It will display many records. They key ones are

  • MQCD CertificateLabel ‘rsaca256_client’
  • Saved CertificateLabel ‘rsaca256_client’

If these are missing certlabl has not been specified.

Display the channel definiton

Edit or browse the file and locate “CD “.

This will locate the MQCD (see the MQ documentation).

Interesting offsets in the CD are

  • 0x0000 Client channel name
  • 0x0060 Queue manager name
  • 0x00c0 IP address
  • 0x0698 Cipher Spec
  • 0x0780 TLSCertificate alias name in the client’s keystore.

Display Client Hello

You may have information about the TLS handshake.  This is called Client_hello, and Server_Hello.

Locate <client_hello> if found, it will have data like (some data removed).

client_version 
TLSV12
random ... 
session_id ... 
cipher_suites 
  tls_ri_scsv,tls_rsa_with_aes_128_cbc_sha256
compression_methods ...                              .
Extensions...
signature_algorithms 13
   rsa:sha512,rsa:sha384,rsa:sha256,rsa:sha224,...
server_name 
  qmaqclienttls.chl.mq.ibm.com
End of GSKit TLS Handshake Transcript

You need to understand the TLS handshake to fully understand this.

  • TLSV12 minimum level of TLS supported
  • cipher_suites these are what the client like
  • signature_algorithms  these are what the client will accept
  • server_name is “sni” information identifying the channel

What is my client connected to?

On a queue manager you an issue the DIS CHS(..)  all and get

SECPROT(TLSV12)
SSLCERTI(CN=SSCA256,OU=CA,O=SSS,C=GB)
SSLCIPH(TLS_RSA_WITH_AES_128_CBC_SHA256)
SSLPEER(SERIALNUMBER=01:79,CN=rsaca256,O=cpwebuser,C=GB,…)

I could not find a command to display the same information for the clients perspective.

You can look in the trace file for SSLCERTI for example

SSLCERTI(‘CN=SS,O=SSS,C=GB’)
SSLPEER(‘SERIALNUMBER=73:CB:2B:…,CN=SS,O=SSS,C=GB’)

This gives the server’s DN and Certificate issuer, and peer=subject.  As the peer has the same DN as the SSLCERTI this shows it is a self signed certificate.

How can I replicate the RACF definitions for MQ on z/OS?

If you are the very careful person who makes all updates to RACF only through batch jobs, then this is easy – take the old jobs, and change the queue manager name and rerun them.

For the other 99.99% of us,  read on…

Even if you have been careful to keep track of any changes to security definitions,  someone else may have made a change either using the native TSO commands, or via the ISPF panels. 
You can list the RACF database, but there is no easy way of listing the RACF database in command format, to allow you to do a global rename, and submit the commands.

I have found two ways of extracting the RACF definitons.

  1. Using an unloaded copy of the RACF database
  2. Using RACF commands to extract and recreate the requests

Using an unloaded copy of the RACF database

I discovered dbsync on a RACF tools repository which does most of the hard work.   You can run a RACF utility to unload the RACF database into a flat file (omitting sensitive information like passwords etc).  Dbsync is a rexx program which takes two copies of an unloaded database, and generates the RACF commands for the differences. I simply used my existing unloaded file and a null file, and got out the commands to create all of the entries.

The steps are

  1. Unload the RACF database
  2. Get dbsync into your z/OS system
  3. Run DBsync
  4. Edit the files, and remove all lines which are not relevant
  5. Run the output to create/modify the definitions

Unload the database

//IBMUSUN JOB 1,MSGCLASS=H 
//* use the TSO RVARY command to display databases
//UNLOAD EXEC PGM=IRRDBU00,PARM=NOLOCKINPUT
//SYSPRINT DD SYSOUT=*
//INDD1 DD DISP=SHR,DSN=SYS1.RACFDS
//OUTDD DD DISP=(MOD,CATLG),DSN=COLIN.RACF.UNLOAD,
// SPACE=(CYL,(1,1)),DCB=(LRECL=4096,RECFM=VB,BLKSIZE=13030)

Of course this assumes you have the authority to create this file.  If not ask a friendly sysprog to run the command, edit the to output delete all records which do not have MQ in them.

Run dbsync

I had to make the following changes

  1. Dataset 1 was the dataset I created above
  2. Dataset 2 was a dummy

Modify the sort step to output to a temporary output file

//COLINRA JOB 1,MSGCLASS=H 
//* ftp://public.dhe.ibm.com/eserver/zseries/zos/racf/dbsync/
//SORT1 EXEC PGM=SORT
//SYSOUT DD SYSOUT=*
//SORTIN DD DISP=SHR,DSN=COLIN.RACF.UNLOAD
//SORTOUT DD DISP=(NEW,PASS),DSN=&TEMP1,SPACE=(CYL,(1,1))
//SYSIN DD *
SORT FIELDS=(5,2,CH,A,7,1,AQ,A,8,549,CH,A)
ALTSEQ CODE=(F080,F181,F282,F383,F484,F585,F686,F787,F888,F989,
C191,C292,C393,C494,C595,C696,C797,C898,C999,
D1A1,D2A2,D3A3,D4A4,D5A5,D6A6,D7A7,D8A8,D9A9,
E2B2,E3B3,E4B4,E5B5,E6B6,E7B7,E8B8,E9B9)
OPTION VLSHRT,DYNALLOC=(SYSDA,3)
/*

Delete the sort of the other data set – as I was using a dummy file

Run dbsync

I changed the bold lines below, the template JCL had

//OUTSCD1 DD DSN=your.dsname.for.outscd1,
// DISP=(NEW,CATLG),

so I changed

  • your.dsname.for to COLIN.RACF
  • NEW,CATLG to MOD,CATLG
  • Upper cased the changed lines using the ucc…ucc ISPF edit line command.
//DBSYNC EXEC PGM=IKJEFT01,REGION=5000K,DYNAMNBR=50,PARM='%DBSYNC' 
//SYSPRINT DD SYSOUT=*
//SYSTSPRT DD SYSOUT=*
//SYSTSIN DD DUMMY
//SYSEXEC DD DISP=SHR,DSN=COLIN.DBSYNC.REXX
//OPTIONS DD *
/* your options here
//INDD1 DD DISP=SHR,DSN=*.SORT1.SORTOUT
//INDD2 DD DUMMY
//OUTADD1 DD DSN=COLIN.RACF.ADDFILE1,
// DISP=(MOD,CATLG),
// UNIT=SYSDA,SPACE=(CYL,(25,25),RLSE),
// DCB=(RECFM=VB,LRECL=255,BLKSIZE=6400)
etc

The output was rexx commands in a file, such as

“rdefine MQCMDS CSQ9.** owner(IBMUSER ) uacc(CONTROL )
    audit(failures(READ )) level(00)”
“permit CSQ9.** class(MQCMDS) reset”
“rdefine MQQUEUE CSQ9.** owner(IBMUSER ) uacc(NONE )
     audit(failures(READ )) level(00) warning notify(IBMUSER )”
“permit CSQ9.** class(MQQUEUE) reset”
“rdefine MQCONN CSQ9.BATCH owner(IBMUSER ) uacc(CONTROL )
    audit(failures(READ )) level(00)”
“permit CSQ9.BATCH class(MQCONN) reset”
“rdefine MQCONN CSQ9.CHIN owner(IBMUSER ) uacc(READ )
    audit(failures(READ )) level(00)”
“permit CSQ9.BATCH class(MQCONN) id(IBMUSER ) access(ALTER )”
“permit CSQ9.BATCH class(MQCONN) id(START1 ) access(UPDATE )”
“permit CSQ9.CHIN class(MQCONN) id(IBMUSER ) access(ALTER )”

You edit and run the the Rexx exec to issue the commands.

Easy – it took me less than half an hour from start to finish.

Using RACF commands to extract and recreate the requests

I found that most people do not have access to an unloaded RACF database.  My normal userid does not have the authority to create the unloaded copy. 

I put an exec up on Github.   It issues a display command for each class in MQCMDS MXCMDS MQQUEUE MXQUEUE MXTOPIC MQADMIN MXADMIN MQCONN and formats it as a RDEFINE command, and then issues the permit command to give people access to it.  It writes the output in to the file being edited.

Use ISPF to edit a member where you want the output.

Make sure the rexx exec is in the SYSPROC or SYSEXEC concatenation, for example use ISRDDN to check.

Syntax

genclass <queuemanagername>

The output is like

 /* class:MXCMDS profile:MQPA class not found 
/* class:MXQUEUE profile:MQPA profile not found
/* class:MXTOPIC profile:MQPA profile not found
/* class:MXADMIN profile:MQPA profile not found
RDEFINE MQCONN -
MQPA.CICS -
- /* Create date 07/17/20
OWNER(ADCDA) -
- /* Last reference Date 07/17/20
- /* Last changed date 07/17/20
- /* Alter count 0
- /* Control count 0
- /* Update count 0
- /* Read count 0
UACC(NONE) -
LEVEL(0) -
- /* Global audit NONE
/* Permit MQPA.CICS CLASS(MQCONN ) RESET
Permit MQPA.CICS CLASS(MQCONN ) ID(ADCDA ) ACCESS(ALTER )
Permit MQPA.CICS CLASS(MQCONN ) ID(START1 ) ACCESS(READ )
/* class:MQCONN profile:MQPA.CICS profile not found

It includes a Permit… RESET if you want to remove all access

What should I monitor for MQ on z/OS – buffer pool

For the monitoring of MQ on z/OS, there are a couple of key metrics you need to keep an eye on for the buffer pools, as you sit watching the monitoring screen.

A quick overview of MQ buffer pools

An inefficient queue manager could have all messages stored on disk, in page sets.  An efficient queue manager would cache hot pages in memory to they can be accessed without doing any I/O.

The MQ Buffer Manager component does this caching, by using buffers  (so no surprise there).

A simplified view of the operation is as follows

  1. Messages are broken down into 4KB pages.
  2. Getting the contents of a pageset page, if the page is not in the buffers, the buffer manager reads it from disk into a buffer.
  3. If a page is requested and the contents are not required (for example it is about to be overwritten as part of a put) it does not need to read it from disk.
  4. If the page is updated, for example a new message, or a field in the message is updated during get processing,  the page is not usually written to disk immediately.   The write to disk is deferred (using a process called Deferred Write Process – another non surprise in the naming convention).  This has the advantage that there can be many updates to a page before it is written to disk.
  5. Any buffer page which has not been changed, and is not currently being used, is a free page.

If the system crashes, non persistent messages are thrown away, and persistent messages can be rebuilt from the log.

To reduce the restart time, pages containing persistent data are written out to the page set periodically.   This is driven by the log data set filling up which causes a checkpoint.  Updates which have been around for more than 2 checkpoints are written to disk.  During restart the page set knows how far back in the log restart needs to go.

In both cases, checkpoint and buffer pool filling up (when there are less than 15 % of free =85% in use) , once a page has been successfully written to the page set, the buffer is marked as free.

Pages for non-persistent messages can be written out to disk.

If the buffer pool is critically short of free buffers, and there are less than 5% free buffers, then pages are written to the page set immediately rather than use the deferred write process.  This allows some application work to be done while the buffer manger is trying to make more free pages available.

What is typical behaviour?

The buffer pool is working well.

When a message is got, the buffer is already in the buffer pool so there is no I/O to read from the page set.

The buffer pool is less than 85% full (has more than 15% free pages), there is periodic I/O to the page set because pages with persistent data are written to the page set at checkpoint.

The buffer pool fills up and has more 85% in-use pages.

This can occur if the number and size of the messages being processed is bigger than the size of the buffer pool. This could be a lot of messages in a unit of work,  big messages using lots of pages, or lots of transactions putting and getting messages.  It can also occur when there are lots of puts, but no gets.

If the buffer pool has between 85%  and 95% of in-use pages( between 15%  and 5% free pages),  the buffer manager is working hard to keep free pages available.

There will be I/O intermittently at checkpoints, and a steady I/O as the buffer manager writes the pages to the page set.

If messages are being read from disk, there will be read activity from the pageset, but the buffer pool page can be reused as soon as the data has been copied from the buffer pool page.

The buffer pool has less than 5% free pages.

The buffer manager is in overdrive.  It is working very hard to keep free pages in the buffer pool.   There will be pages written out to the page set as it tries to increase the number of free pages.  Gets may require reads from page set I/O.  All of this I/O can cause I/O contention and all the page set I/Os slow down, and so MQ API request using this buffer pool slow down.

What should be monitored

Most cars these days have a “low fuel light” and a “take me to a garage” light.  For monitoring we can provide similar.

Monitor the % buffer pool full.

  • If it is below 85% things are OK
  • If it is between 85% and 95% then this needs to be monitored, it may be “business as usual”.
  • If it is >=95% this needs to be alerted.  It can be caused by applications or channels not processing messages

Monitoring the number of pages written does not give much information.

It could be caused by a checkpoint activity, or because the buffer pool is filling up.

Monitoring the number of pages read from the page set can provide some insight.

If you compare today with this time last week you can check the profile is similar.

If the buffer pool is below 85% used,

  • Messages being got are typically in the buffer pool so there is no read I/O.
  • If there is read I/O this could be for messages which were not in the buffer pool – for example reading the queue after a queue manager restart.

If the buffer pool than 85% in-use  and less than 95% in-use  this can be caused by a large message work load coming in, and MQ is being “elastic” and queuing the messages.  Even short lived messages may be read from disk.  The number of read I/Os give an indication of throughput.  Compare this with the previous week to see if the profile is similar.

If the buffer pool is more than 95% in-use this will have an impact on performance, as every page processed is likely to have I/O to the page set, and elongated I/O response time due to the contention.

What to do

You may want “operators notes” either on paper or online which describe the expected status of the buffer pools on individual queue managers.

  1. PROD QMPR
    1. BP 1 green less than 85% busy
    2. BP 2 green less than 85% busy
    3. BP3 green except for Friday night when it goes amber  read I/O rate 6000 I/O per minute.
  2. TEST QMTE
    1. BP 1 green less than 85% busy
    2. BP 2 green less than 85% busy
    3. BP 3 usually amber – used for bulk data transfer

What to do the buffer statistics mean?

There are statistic on the buffer pool usage.

  1. Buffer pool number.
  2. Size of buffer pool – when the data was produced.
  3. Low – the lowest number of free pages in the interval.  100* (Size – log)/Size  gives you % full.
  4. Now – the current number of free pages in the interval.
  5. Getp – the number of requests ‘get page with contents’.  If the page is not in the buffer pool then it is read from the page set.
  6. Getn.   A new page is needed.   The contents are not relevant as it is about to be overwritten.  If the page is not in the buffer pool, just allocate a buffer, and do not read from the page set.
  7. STW – set write intent.  This means the page was got for update. I have not seen a use for this. For example
    1. A put wants to put part of a message on the page
    2. A get is being done and it wants to set the “message has been got” flag.
    3. The message has been got, and so pointers to a page need to be updated.
  8. RIO -the number of read requests to the page set.  If this is greater than zero
    1. The request is for messages which have not been used since the queue manager started
    2. The buffer pool had reached 85%, pages had been moved out to the page set, and the buffer has been reused.
  9. WIO the number of write I/Os that were done.  This write could be due to a checkpoint, or because the buffer pool filled up.
  10. TPW total pages written, a measure of how busy the buffer pool was.   This write could be due to a checkpoint, or because the buffer pool filled up.
  11. IMW – immediate write.  I have not used this value, sometimes I observe it is high, but it is not useful.  This can be caused by
    1. the buffer pool being over 95% full, so all application write I/O is synchronous,
    2. or a page was being updated during the last checkpoint, and it needs to be written to the page set when the update has finished.  This should be a small number.  Frequent checkpoints (eg every minute) can increase this value.
  12. DWT – the number of times the Deferred Write processor was started.  This number has little value.
    1. The DWP could have started and been so busy that it never ended, so this counter is 1.
    2. The DWP could have started, written a few pages and stopped – and done this repeatedly, in this case the value could be high.
  13. DMC.   The number of times the buffer pool crossed the 95% limit.  If this is > 0 it tells you the buffer pool crossed the limit
    1. This could have crossed just once, and stayed over 95%
    2. This could have gone above 95%, then below 95% etc.
  14. SOS – the buffer pool was totally short on storage – there were no free pages.  This is a critical indicator.  You may need to do capacity planning, and make the buffer pool bigger, or see if there was a problem where messages were not being got.
  15. STL – the number of times a “free buffer” was reused.  A buffer was associated with a page of a page set.   The buffer has been reused and is now for a different page.  If STL is zero, it means all pages that were used were in the buffer pool
  16. STLA – A measure of contention when pages are being reused.  This is typically zero.

Summary

Now you know as much as I do about buffer pools you’ll see that the %full (or %free) is the key measure.  If the buffer pool is > 85% used pages, then the I/O rate is a measure as to how hard the buffer manager is running.

What should I monitor for MQ on z/OS – logging statistics

For the monitoring of MQ on z/OS, there are a couple of key metrics you need to keep an eye on for the logging component, as you sit watching the monitoring screen.

I’ll explain how MQ logging works, and then give some examples of what I think would be key metrics.

Quick overview of MQ logging

  1. MQ logging has a big(sequential) buffer for logging data, it wraps.
  2. Application does an MQPUT of a persistent message.
  3. The queue manager updates lots of values (eg queue depth, last put time) as well as move data into the queue manager address space.  This data is written to log buffers. A 4KB page can hold data from many applications.
  4. An application does an MQCOMMIT.  MQ takes the log buffers up and including the current buffer and writes it to the current active log data set.  Meanwhile other applications can write to other log buffers.
  5. The I/O finishes and the log buffers just written can be reused.
  6. MQ can write up to 128 pages in an I/O. If there are more than 128 buffers to write there will be more than 1 I/O.
  7. If application 1 commits, the IO starts,  and then application 2 commits. The I/O for the commit in application 2 has to wait for the first set of disk writes to finish, before the next write can occur.
  8. Eventually the active log data set fills up.  MQ copies this active log to an archive data set.  This archive can be on disk or tape.   This archive data set may never be used again in normal operation.  It may be needed for recovery of transactions or after a failure.   The Active log which has just been copied can now be reused.

What is interesting?

Displaying how much data is logged per second.

Today       XXXXXXXXXXXXXXXXXXXX
Last week XXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Yesterday XXXXXXXXXX          
      0                     100MB/Sec    200 MB/Sec

This shows that the logging rate today is lower than last week.   This could be caused by

  1. Today is just quieter than last week
  2. There is a problem and there are fewer requests coming into MQ.   This could be caused by problems in another component, or a problem in MQ.    When using persistent messages the longest part of a transaction is the commit and waiting for the log disk I/O.  If this I/O is slower it can affect the overall system throughput.
  3. You can get the MQ log IO response times from the MQ log data.

Displaying MQ log I/O response time

You can break down where time is spent in doing I/O into the following area

  1. Scheduling the I/O – getting the request into the I/O processor on the CPU
  2. Sending the request down to the Disk controller(eg 3990)
  3. Transferring data
  4. The I/O completes, and send an interrupt to z/OS, z/OS has to catch this interrupt and wake up the requester.

 Plotting the I/O time does not give an entirely accurate picture, as the time to transfer the data depends on the amount of data to transfer.  On a well run system there should be enough capacity so the other times are constant.    (I was involved in a critical customer situation where the MQ logging performance “died” every Sunday morning.   They did backups, which overloaded the I/O system).

In the MQ log statistics you can calculate the average I/O time.  There are two sets of data for each log

  1. The number of requests, and sum of the times of the requests to write 1 page.  This should be pretty constant, as the data is for when only one 4KB was transferred
  2. The number of requests, and sum of the times of the requests to more than 1 page.  The average I/O time will depend on the amount of data transferred.
  • When the system is lightly loaded, there will be many requests to write just one page. 
  • When big messages are being processed (over 4 KB) you will see multiple pages per I/O.
  • If an application processes many messages before it commits you will get many pages per I/O.   This is typical of a channel with a high achieved batch size.
  • When the system is busy you may find that most of the I/O write more than one page, because many requests to write a small amount of data fills up more than one page.

I think displaying the average I/O times would be useful.   I haven’t done tried this in a customer environment (as I dont have customer environment to use).    So if the data looks like

Today         XXXXXXXXXXXXXXXXXXXXXXXX
Last week     XXXXXXXXXXXXXXXXXXXXXXXXXXXXX  
One hour ago XXXXXXXXXXXXXXXXXXX
time in ms 0 1 2 3

it gives you a picture of the I/O response time.

  • The dark green is for I/O with just one page, the size of the bar should be constant.
  • The light green is for I/O with more than one page, the size of the bar will change slightly with load.  If it changes significantly then this indicates a problem somewhere.

Of course you could just display the total I/O response time = (total duration of I/Os) / (total number of I/Os), but you lose the extra information about the writing of 1 page.

Reading from logs

If an application using persistent messages decides to roll back:

  • MQ reads the log buffers for the transaction’s data and undoes any changes.
  • It may be the data is old and not in the log buffers, so the data is read from the active log data sets.
  • It may be that the request is really old (for example half an hour or more), MQ reads from the archive logs (perhaps on tape).

Looking at the application doing a roll back, and having to read from the log.

  • Reading from buffers is OK.   A large number indicates application problem or a DB/2 deadlock type problem.  You should investigate why there is so much rollback activity
  • Reading from Active logs … . this should be infrequent.  It usually indicates an application coding issue where the transaction took too long before commit.  Perhaps due to a database deadlock, or bad application design (where there is a major delay before the commit)
  • Reading from Archive logs… really bad news…..  This should never happen.

Displaying reads from LOGS

Today         XXXXXXXXXXXXXXXXXXXXXXXX
Last week     X
One hour ago  XXXXX
rate          0        10    20     40

Where green is “read from buffer”, orange is “read from active log”, red is “read from Archive log. Today looks a bad day”.