Connect to Liberty, the clever way, to give different qualities of service.

While I was investigating two TCP/IP stacks I discovered you can set up Liberty Web Server to support different classes of service depending on TCP Host name, and port number.

You can configure <httpEndpoint…> with a host and port number, and point to other set up parameters and so configure

  • the host name
  • the httpsPort number
  • the maximum number of active connections for this definition
  • which keyring to use as the trust store
  • which keyring to use as the key store
  • which certificate the server should use in the key store
  • which TLS protocols for example TLS 1.2 or 1.3
  • what logging you want done: date,time, userid, url, response time
  • which file you want the access logging information to be written to
  • which sites can/cannot use this, the addressExcludeList and addressIncludeList.

How do you set up another http address and port ? It is really easy – just define another set of definitions!

Why would you want to do this?

You may want to restrict people’s access to the server. For example external people are told to access the server using a specified port, and you can specify which cipher specification should be used, and what trust store is used to validate a client authentication request.

You may want want to restrict the number of connections into a port, and have a port for administrators so they can always logon.

How do I do this?

You need to define another httpEndpoint. This in turn points to

I set up a file called colin.xml and included it in the server.xml file.

<server> 
 <httpEndpoint id="colinstHttpEndpoint" 
   host="10.1.1.2" 
   accessLoggingRef="colinaccessLogging" 
   sslOptionsRef="colinSSLRefOptions"
   httpsPort="29443"> 

   <tcpOption   
     addressIncludeList="10.1.*.*" 
     maxOpenConnections="3" /> 
 </httpEndpoint> 
 
 <sslOptions 
   id="colinSSLRefOptions" 
   sslRef="colinSSLOptions" 
 /> 

 <httpAccessLogging id="colinaccessLogging" enabled="true"/> 

 <ssl clientAuthentication="true" 
   clientAuthenticationSupported="true" 
   id="colinSSLOptions" 
   keyStoreRef="racfKeyStore" 
   trustStoreRef="racfTrustStore"                                                                             
   serverKeyAlias="ZZZZ" 
   sslProtocol="TLSv1.2" /> 
                                                                                
 <keyStore filebased="false" id="racfKeyStore" 
   location="safkeyring://START1/KEY" 
   password="password" readOnly="true" type="JCERACFKS"/> 
                                                                                                   
 <keyStore filebased="false" id="racfTrustStore" 
   location="safkeyring://START1/TRUST" 
   password="password" readOnly="true" type="JCERACFKS"/> 

</server> 

Certificate logon to MQWEB on z/OS, the hard way.

I described here different ways of logging on to the MQ Web Server on z/OS. This post describes how to use a digital certificate to logon. There is a lot of description, but the RACF statements needed are listed at the bottom.

I had set up my keystore and could logon to MQWEB on z/OS using certificates. I just wanted to not be prompted for a password.

Once it is set up it works well. I thought I would deliberately try to get as many things wrong, so I could document the symptoms and the cure. Despite this, I often had my head in the hands, asking “Why! – it worked yesterday”.

Can I use CHLAUTH ? No – because that is for the CHINIT, and you do not need to have the CHINIT running to run the web server.

Within one MQ Web Server, you can use both “certificate only” logon as well as using “certificate, userid and password” logon.

When using the SAF interface you specify parameters in the mqwebuser.xml file, such as keyrings, and what level of certificate checking you want.

Enable SAF messages.

If you use <safCredentials suppressAuthFailureMessage=”false” …> in the mqwebuser.xml then if a SAF request fails, there will be a message on the z/OS console. You would normally have this value set to “true” because when the browser (or REST client) reauthenticates (it could be every 10 seconds) you will get a message saying a userid does not have access to an APPL, or EJBROLE profile. If you change this (or make any change the mqwebuser.xnml file), issue the command

f CSQ9WEB,refresh,config

To pick up the changes.

Configure the server name

In the mqwebuser.xml file is <safCredentials profilePrefix=”MQWEB“…> there MQWEB identifies the server, and is used in the security profiles (see below).

SSL parameters

In the mqwebuser.xml file you specify

  • <ssl …
  • clientAuthenticationSupported=”true”|”false. The doc says The server requests that a client sends a certificate. The client’s certificate is optional
  • clientAuthentication=”true”|”false” if true, then client must send a certificate.
  • ssslProtocol=”TLSV1.2″
  • keyStoreRef=”…”
  • trustStoreRef=”…”
  • id=”…”
  • <sslDefault … sslRef=”…” this points to a particular <ssl id=…> definition. It allows you to have more than one <ssl definition, and pick one.

I think it would have been clearer if the parameters were clientAuthentication=”yes”|”no”|”optional”. See my interpretation of what these mean here.

Client authentication

The client certificate maps to a userid on z/OS, and this userid is used for access control.

The TLS handshake: You have a certificate on your client machine. There is a handshake with the server, where the certificate from the server is sent to the client, and the client verifies it. With TLS client authentication the client sends a certificate to the server. The server validates it.

If any of the following are false, it drops through to Connecting with a client certificate, and authenticate with userid and password below.

Find the z/OS userid for the certificate

The certificate is looked up in a RACDCERT MAP to get a userid for the certificate (see below for example statements). It could be a one to one mapping, or depending on say OU=TEST or C=GB, it can check on part of the DN. If this fails you get

ICH408I USER(START1 ) GROUP(SYS1 ) NAME(####################)
DIGITAL CERTIFICATE IS NOT DEFINED. CERTIFICATE SERIAL NUMBER(0194)
SUBJECT(CN=ADCDC.O=cpwebuser.C=GB) ISSUER(CN=SSCARSA1024.OU=CA.O=SSS.
C=GB).

Check the userid against the APPL class.

The userid is checked against the MQWEB profiles in the APPL class. (Where MQWEB is the name you configured in the web server configuration files). If this fails you get

ICH408I USER(ADCDE ) GROUP(TEST ) NAME(ADCDE ) MQWEB CL(APPL )
WARNING: INSUFFICIENT AUTHORITY ACCESS INTENT(READ ) ACCESS ALLOWED(NONE )

Pick the EJBROLE for the userid

There are several profiles in the EJBROLES class. If the userid has read access to the class, it userid gets the attribute. For example for the profile MQWEB.com.ibm.mq.console.MQWebAdmin, if the userid has at least READ access to the profile, it gets MQWEBADMIN privileges.
If these fail you get messages in the MQWEB message logs(s).

To suppress the RACF messages use option suppressAuthFailureMessage=”false” described above.

The userid needs access to at least one profile to be able to use the MQ Web server.

Use the right URL

The URL is like https://10.1.1.2:9443/ibmmq/console/

No password is needed to logon. If you get this far, displaying the userid information (click on the ⓘ icon) gives you Principal:ADCDE – Read-Only Administrator (Client Certificate Authentication) where ADCDE is the userid from the RACDDEF MAP mapping.

Connecting with a client certificate, and authenticate with userid and password.

The handshake as described above is done as above. If clientAuthentication=”true” is specified, and the handshake fails, then the client gets This site can’t be reached or similar message.

If the site can be reached, and a URL like https://10.1.1.2:9443/ibmmq/console/login.html is used, this displays a userid and password panel.

The password is verified, and if successful the specified userid is looked up in the APPL and EJBROLES profiles as described above.

If you get this far, and have logged on, displaying the userid information (click on the ⓘ icon) gives you Principal:colin – Read-Only Administrator (Client Certificate Authentication) where colin is the userid I entered.

The short solution to implement certificate authentication

If you already have TLS certificates for connecting to the MQ Web Server, you may be able to use a URL like https://10.1.1.2:9443/ibmmq/console/ to do the logon. If you use an invalid URL, it will substitute it with https://10.1.1.2:9443/ibmmq/console/login.html .

My set up.

I set up a certificate on Linux with a DN of C=GB,O=cpwebuser,CN=ADCDC and signed by C=GB,O=SSS,OU=CA,CN=SSCARSA1024. The Linux CA had been added to the trust store on z/OS.

Associate a certificate with a z/OS userid

I set up a RACF MAP of certificate to userid. It is sensible to run these using JCL, and to save the JCL for each definition.

 /*RACDCERT DELMAP( LABEL('ADCDZXX'  )) ID(ADCDE  ) 
 /*RACDCERT DELMAP( LABEL('CA'  )) ID(ADCDZ  )   
RACDCERT MAP ID(ADCDE  )  - 
    SDNFILTER('CN=ADCDC.O=cpwebuser.C=GB') - 
    WITHLABEL('ADCDZXX') 
                                                 
 RACDCERT MAP ID(ADCDZ  )  - 
    IDNFILTER('CN=SSCARSA1024.OU=CA.O=SSS.C=GB') 
    WITHLABEL('CA       ') 
                                                 
 RACDCERT LISTMAP ID(ADCDE) 
 RACDCERT LISTMAP ID(ADCDZ) 
 SETROPTS RACLIST(DIGTNMAP, DIGTCRIT) REFRESH 

This mapped the certificate CN=ADCDC.OU=cpwebuser.C=GB to userid ADCDE. Note the “.” between the parts, and the order has changed from least significant to most significant. For other certificates coming in with the Issuer CA of CN=SSCARSA1024.OU=CA.O=SSS.C=GB they will get a userid of ADCDZ.

You do not need to refresh anything as this change becomes visible when the SETROPTS RACLIST REFESH is issued.

First logon attempt

I stopped and restarted my Chrome browser, and used the URL https://10.1.1.2:9443/ibmmq/console. I was prompted for a list of valid certificates. I chose “Subject:ADCD: Issuer:SSCARSA1024 Serial:0194”.

Sometimes it gave me a blank screen, other times it gave me the logon screen with username and Password fields. It had a URL of https://10.1.1.2:9443/ibmmq/console/login.html.

On the z/OS console I got

ICH408I USER(ADCDE ) GROUP(TEST ) NAME(ADCDE ) MQWEB CL(APPL )
WARNING: INSUFFICIENT AUTHORITY ACCESS INTENT(READ ) ACCESS ALLOWED(NONE )

I could see the the userid(ADCDE) from the RACDCERT MAP was being used (as expected). To give the userid access to the MQWEB resource, I issued the commands

 /* RDEFINE APPL MQWEB UACC(NONE)
PERMIT MQWEB CLASS(APPL) ACCESS(READ) ID(ADCDE)
SETROPTS RACLIST(APPL) REFRESH

And tried again. The web screen remained blank (even with the correct URL). There were no messages on the MQWEB job log. Within the MQWEB stdout (and /u/mqweb/servers/mqweb/logs/messages.log) were messages like

[AUDIT ] CWWKS9104A: Authorization failed for user ADCDE while invoking com.ibm.mq.console on
/ui/userregistry/userinfo. The user is not granted access to any of the required roles: [MQWebAdmin, MQWebAdminRO, MQWebUser].

Give the userid access to the EJBroles

In my mqwebuser.xml I have <safCredentials profilePrefix=”MQWEB”. The MQWEB is the prefix of the EJBROLE resource name. I had set up a group MQPA Web Readonly Admin (MQPAWRA) to make the administration easier. Give the group permission, and connect the userid to the group.

 /* RDEFINE EJBROLE MQWEB.com.ibm.mq.console.MQWebAdminRO  UACC(NONE) 
PERMIT MQWEB.com.ibm.mq.console.MQWebAdminRO CLASS(EJBROLE) - 
  ACCESS(READ) ID(MQPAWRA) 
CONNECT ADCDE group(MQPAWRA)
SETROPTS RACLIST(EJBROLE) REFRESH

Once the security change has been made, it is visible immediately to the MQWEB server. I clicked the browser’s refresh button and successfully got the IBM MQ welcome page (without having to enter a userid or password). When I clicked on the ⓘ icon it said

Principal:ADCDE – Read-Only Administrator (Client Certificate Authentication)

Logoff doesn’t

If you click the logoff icon, you get logged off – but immediately get logged on again – that’s what certificate authorisation does for you. You need to go to a different web site. If you come back to the ibmmq/console web site, it will use the same certificate as you used before.

Ways of logging on to MQWEB on z/OS.

There are different ways of connecting to the MQ Web Server on z/OS (this is based on the z/OS Liberty Web server). Some ways use the SAF interface. This is an interface to the z/OS security manager. IBM provides RACF, there are other security managers such as TOP SECRET, and ACF2. Userid information is stored in the security manager database.

The ways of connecting to the MQ Web server on z/OS.

  • No security. Use no_security.xml to set up the MQ Web Server.
  • Hard coded userids and passwords in a file. Using the basic_registry.xml. This defines userid information like <user name=”mqadmin” password=”mqadmin”> . This is suitable only for a sandbox. The password can be obscured or left in plain text.
  • Logon by z/OS userid and password. Use zos_saf_registry.xml. Logon is by userid and password and checked by a SAF call to the z/OS security manager. The userid is checked for access to a resource like MQWEB.com.ibm.mq.console.MQWebAdmin in class(EJBROLE) and MQWEB in class(APPL).
  • Connect with a client certificate, and authenticate using userid and password. This uses zos_saf_registry.xml plus additional configuration. The userid, password and access to the EJBROLE and APPL resources is checked by the SAF interface. The certificate id is not used to check access, it is just used to do the TLS handshake.
  • Certificate authentication, a password is not required. Connecting use a client certificate. This uses zos_saf_registry.xml plus additional configuration. Using the SAF interface, the certificate maps to a z/OS userid; this ID is used for checking access to the EJBROLE and APPL resource.

The configuration for using TLS is not clear.

I found the documentation for the TLS configuration to be unclear. Two parameters are <ssl clientAuthentication clientAuthenticationSupported…/> The documentation says

  • If you specify clientAuthentication="true", the server requests that a client sends a certificate. However, if the client does not have a certificate, or the certificate is not trusted by the server, the handshake does not succeed.
  • If you specify clientAuthenticationSupported="true", the server requests that a client sends a certificate. However, if the client does not have a certificate, or the certificate is not trusted by the server, the handshake might still succeed.
  • If you do not specify either clientAuthentication or clientAuthenticationSupported, or you specify clientAuthentication="false" or clientAuthenticationSupported="false", the server does not request that a client send a certificate during the handshake.

I experimented with the different options and the results are below.

  1. I used a web browser with several possible certificates that could be used for authentication. I was given a pop up which listed them. Chrome remembers the choice. With Firefox, you can click an option “set as default“. If this is unticked you get prompted every time.
  2. I used a browser with no certificates for authentication.

When a session was not allowed, I got (from Firefox) Secure Connection Failed. An error occurred during a connection to 10.1.1.2:9443. PR_END_OF_FILE_ERROR

Client AuthenticationClient Authentication SupportedBrowser with certificatesBrowser without certificates
trueignoredPick certificate, userid and password NOT requiredPR_END_OF_FILE_ERROR
falsetruePick certificate, userid and password NOT requiredA variety of results. One of
  1. PR_END_OF_FILE_ERROR,
  2. Blank screen
  3. Userid and password required
falsefalseUserid and password requiredUserid and password required

When using certificates, you can chose to specify userid and password instead of client authentication, by using the appropriate URL with https://10.1.1.2:9443/ibmmq/console/login.html, instead of https://10.1.1.2:9443/ibmmq/console .

Note well.

The server caches credential information. If you change the configuration and refresh the server, the change may not be picked up immediately.

Once you have logged on successfully, a cookie is stored in your browser. This may be used to authenticate, until the token has expired. To be sure of clearing this token I restarted my browser.

Why do they ship java products on z/OS with the handbrake on? And how to take the brake off.

I noticed that it takes seconds to start MQ on my little z/OS machine, but minutes (feels like days) to start anything with Liberty Web server.  This include the MQWEB, z/OSMF,  and Z/OSConnect.  I mentioned this to an IBM colleague who asked if I was using Java Shared classes.  These get loaded into z/OS shared pages.

When I implemented it, my Liberty server came up in half the time!

I found this blog post which was very helpful, and showed me where to look for more information.  I subsequently found this document (from 2006!)

The kinder garden overview of how Java works.

  • You start with a program written in the Java language.
  • When you run this, Java converts it into byte codes
  • These byte codes get converted to native instructions  – so a byte code “push onto the stack” may become 8  390 assembler instructions.
  • This code can be optimised, for example code which is executed frequently can have the assembler instructions rewritten to go faster.  It might put code inline instead of out in a subroutine.
  • If you are using Java shared classes, this code can be written out and reused by other applications, or if you restart the server, it can reused what it created before.  Reusing the shared classes means that programs benefit because the byte codes have already been converted into native code, and optimisations have been done on the hot code.

What happens on z/OS?

By default, z/OS writes the code to virtual memory and does not save anything to disk.  If you restart your Java application within the same IPL, it can exploit the shared classes which have been converted to native code, and optimised – great- good design.   I found the second time I started the web server it took half the time.  However I IPL once a day, and start my web server once a day. I do not benefit from having it start faster a second time – as I only started it once per session. By default when you re-ipl, the shared classes code is discarded, and so next time you need the code, it has to be to convert to native instructions again, and it loses any optimisation which had been done.

What is the solution?

It is two easy steps:!

  1. Tell Java to write the information from memory to disk – to take a snaphot.
  2. After IPL tell Java to load memory from the disk image – to restore a snapshot.

It is as simple as that.

Background.

It is all to do with the java -Xshareclasses.

With your application you tell Java where to store information about the shared classed.  It defaults to Cache=/tmp/ name=javasharedresources.

In my jvm.options I overrode the defaults and specified

-Xshareclasses:nonFatal 
-Xshareclasses:groupAccess
-Xshareclasses:cacheDirPerm=0777
-Xshareclasses:cacheDir=/tmp,name=mqweb

If you give each application a name (such as mqweb)  you can isolate the cache to an application and not disrupt another JVM if you change the cache.  For example if you restore from a snapshot, only users of that “name” will be affected.

List what is in the cache

You can use the USS command,

java -Xshareclasses:cacheDir=/tmp/,listAllCaches

I used a batch job to do the same thing.

//IBMJAVA  JOB  1 
// SET V='listAllCaches' 
// SET C='/tmp/' 
//S1       EXEC PGM=BPXBATCH,REGION=0M, 
// PARM='SH java -Xshareclasses:cacheDir=&C,&V' 
//STDERR   DD   SYSOUT=* 
//STDOUT   DD   SYSOUT=*            

The output below, shows the cache name is mqweb.  Once you have created a snapshot it has an entry for it.

Listing all caches in cacheDir /tmp/                                                                          
                                                                                                              
Cache name       level         cache-type      feature         OS shmid       OS semid 
mqweb            Java8 64-bit  non-persistent  cr              8197           4101 

For MQWEB the default parameters are -Xshareclasses:cacheDir=/u/mqweb/servers/.classCache,name=liberty-%u” where /u/mqweb is the WLP parameter, where my parameter are defined, and %u is the userid the server is running under, so in my case liberty=START1.

When I had /u/mqweb/servers/.classCache, then the total command line was too long for BPXBATCH.   (Putting it into STDPARM gave me IEC020I 001-4 on the instream STDPARM because the resolved line wa greater than 80 characters.   I resolved this by adding -Xshareclasses:cacheDir=/u/mqweb,name=cache to the jvm.options file.

To take a snapshot


//IBMJAVA  JOB  1 
// SET C='/tmp/' 
// SET N='mqweb' 
// SET V='restoreFromSnapshot' 
// SET V='listAllCaches'
// SET V='snapshotCache' //S1 EXEC PGM=BPXBATCH,REGION=0M, // PARM='SH java -Xshareclasses:cacheDir=&C,name=&N,&V' //STDERR DD SYSOUT=* //STDOUT DD SYSOUT=* //

This job took a few seconds to run.

I believe you have to take the snapshot while your java application is executing – but I do not know for definite.

Restore a snapshot

To restore a snapshot just use restoreFromSnapshot in the above JCL. This took a few seconds to run. 

How to use it.

If you put the restoreFromSnaphot JCL at the start of the web server, it will preload it whenever you use your server.

If you take a snapshot every day before shutting down your server, you will get a copy with the latest optimisations.  If you do not take a new snapshot it continues to use the old one.

If you want to not use the shared cache you can get rid of it using the command destroySnapshot.

Is my cache big enough?

If you use the printStats request you get information like

Current statistics for cache "mqweb":                                                
...                                                                                     
cache size                           = 104857040                                     
softmx bytes                         = 104857040                                     
free bytes                           = 70294788 
...
Cache is 32% full                                     
                                                      
Cache is accessible to current user = true                                                 

The documentation says

When you specify -Xshareclasses without any parameters and without specifying either the -Xscmx or -XX:SharedCacheHardLimit options, a shared classes cache is created with a default size, as follows:

  • For 64-bit platforms, the default size is 300 MB, with a “soft” maximum limit for the initial size of the cache (-Xscmx) of 64MB, …

I had specified -Xscmx100m  which matches the value reported.

What is in the cache?

You can use the printAllStats command.  This displays information like

Classpath

1: 0x00000200259F279C CLASSPATH
/usr/lpp/java/J8.0_64/lib/s390x/compressedrefs/jclSC180/vm.jar
/usr/lpp/java/J8.0_64/lib/se-service.jar
/usr/lpp/java/J8.0_64/lib/math.jar

Methods for a class
  • 0x00000200259F24A4 ROMCLASS: java/util/HashMap at 0x000002001FF7AEB8.
  • ROMMETHOD: size Signature: ()I Address: 0x000002001FF7BA88
  • ROMMETHOD: put Signature: (Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; Address: 0x000002001FF7BC50

This shows

  • there is a class HashMap. 
  • It has a method size() with no parameters returning an Int.  It is at…. in memory
  • There is another method put(Object o1, Object o2)  returning an Object.  It is at … in memory
Other stuff

There are sections with JITHINTS and other performance related data.

 

Do not restart with the fire hose set to maximum.

There is an article in The Register, about an outage at the Tokyo stock exchange.  One of the problems was that they did not have a process for restarting the environment.  The impact of restarting a system is often overlooked, and in the panic of “get it started as quickly as possible” things can go wrong.   The fire brigade slowly increases the pressure in a fire hose to stop the fire crew from being knocked down with the sudden flow.

TCP/IP is good because it has a “slow start” protocol.  Once a connection has been established, and is working well, the exchange can use bigger buffers, and send more buffers before waiting for the acknowledgement.   This boosts the throughput.  If the back-end is slow to process the data, TCP  slows down the traffic, and then increases the throughput again if the connection can handle it.  If the connection stops and restarts, the rate starts slowly and builds up, rather than use the rate just before the outage.

You cannot expect WAS/CICS/DB2/MQ/IMS to restart at maximum speed; it has to work up to it.  Transactions may have to warm up. There can be many reasons:

  1. Data many need to be read from page-sets into buffers, for example read hot Db/2 data into memory.
  2. Java code needs to warm up to become more efficient (JITed).
  3. The systems need to establish a working set, for example making a buffer pool larger.
  4. Establishing connections may have some serialisation delays.

Restarting faster than a system can cope can cause a domino effect.  A transaction server is restarted and the fire hose of data is turned on.   The transaction server is still warming up, and cannot cope with the volume of requests.    Work for this system is then routed to another transaction server which could handle the workload if the volume gradually increases,  If it gets this additional work  all at once, this instance slows down, and the work is routed to another transaction server etc.

MQ can be seen as the bad guy here.  When you restart MQ, it can go to fire hose mode immediately.   You should start the output channels first to start draining messages, then gradually start the input channels.   If you start the input channels before the output channels, you may get queues and page sets filling up, before the output channels can process the messages.

If you have a policy that all client connects must disconnect and reconnect a random time between15 minutes and 45 minutes this should help spread the load, and gradually you should get a balanced environment.

Understanding Z/OS Connect SMF 123 subtype 2 record

Introduction to the z/OS Connect SMF records

z/OS Connect can provide two types of SMF record

  1. SMF 120 subtype 11, provided by the base Liberty support. This gives information on the URL used to access Liberty, and the CPU used to perform requests. This is enabled at the Server level – so you can have records for all request, or no requests. There is one SMF record for each web server request.
  2. SMF 123 provides information about API and the service used, and the “pass through” services. It provides elapsed time of the request, and of the the “pass through” requests. It does not provide CPU usage figures. This can be configured to produce records depending on the http host and port used to access z/OS Connect. One SMF record can have data for multiple web server requests. The SMF records are produced when the SMF record is full – or the server is shut down.

The SMF 120-11 and SMF 123 records are produced independently, and there is no correlating field between them. They both have a URI field, and time stamps, so at low volumes it may be possible to correlate the SMF data.

I’ll document the fields which I think are interesting. If you think other fields are useful please let me know and I’ll update this document.

I have written an SMF formatter in C which prints out interesting data, and summarises it.

SMF 123 subtype 2 fields

  • You get the standard date and time the record was produced, and with the LPAR. You can use PGM=IFASMFDP with the following to filter which records are copied
DATE(2020282,2020282)
START(1000)
START(2359)
  • There is server version (3), system(SOW1), SYSPLEX(ADCDPLEX) and job id(STC04774) which are not very interesting
  • Server job name(SM3) is more interesting. I started the server with s baqstrt,parms=’MQTEST’,jobname=sm3
  • The config dir (/var/zosconnect/servers/MQTEST/) is boring – as is server level (open beta)
  • The HTTP code, for example 200 OK, 403 Forbidden. You may want to report requests with errors in a separate file
    1. So you know you have errors and can fix them
    2. Your statistics, such as average response time do not have dirty data in them.
  • An HTTP 1 character flag – this has always been 00 for me. I cannot find it documented.
  • The Client IP address. 10.1.1.1
  • You get userid information.
    • I used a certificate to authenticate. The DN from the certificate is not available. You only get the userid from the RACF mapping of DN to userid. This mapped userid was in the 64 byte field. The 8 byte userid field was empty for me. The lack of certificate DN, and having the userid in the wrong field feels like a couple of buglets. If you use LDAP, I think the long ID is stored in the long field, and the z/OS userid stored in the short field – so inconsistent.
  • You get the URL(URI) used /stockmanager/stock/items/999999. I treat this as a main key for processing and summarising data. You may need to process this value as the partnumber (999999) will be different for each query. You may want to have standards which say the first two elements (/stockmanager/stock) are useful for reporting. The remaining elements should be ignored when summarising records.
  • The start and stop times (2020/10/08 09:18:19.852360 and 2020/10/08 09:18:22.073946) are interesting. You can calculate the duration – which is the difference between them.
  • Request type API, Service, Admin. An Admin request is like using an URL like /zosConnect/services/stockQuery to get information about the stockQuery service.
  • The API name and version – stockmanager 1.0.0
  • The service name and version – stockQuery 1.0.0. You get the version information. If you do an online display of the service the version is not available.
  • Method GET/POST etc
  • The service provider. This is the code which does the real work of connection to CICS, MQ, or passing the request through. IBM MQ for z/OS, restclient-1.0
  • Request id starts at 1 and increments for the life of the server. If you restart the server it will restart from 1. I do not think this is very useful.
  • For “pass through” requests, z/OS Connect confusingly calls the back end service the Statement of Record or (SOR). (MQ is a transport, not a Statement of Record.) The “pass through” service definition is built from a parameter file and zconbt program. The reported data is
    • SOR ID the host and port 10.1.3.10:19443. These are from the <zosconnect_zosConnectServiceRestClientConnection host and port values.
    • SOR Reference restClientServiceY This is from the connectionRef=in the parameter file and the <zosconnect_zosConnectServiceRestClientConnection…> definition
    • SOR Resource zosConnect/apis/through. This is from the uri= in the parameter file.
    • Time of entry and time of return of the SOR service.
    • From the times calculate the difference to get the duration of the remote request.
  • It would be useful to have this “pass through time” for services calling MQ, CICS etc, so we could get a true picture of the time spent processing the requests.
  • The size of the data payload (0) , and the size of the response(94) excluding any headers.
  • A tracking token. The hex 64 byte string is passed to the some called servers. It is passed to some backends (CICS and “pass through”) to be able to correlate the request across systems. It is not passed to MQ. See X-Correlation-ID below for an example. This field is nulls for Admin request. When a request was “passed through” to another z/OS Connect server which processed the request, the tracking token was not reported in the SMF data of the second system. I dont know if the CICS SMF data records this token, but it is of little use for MQ, or for “pass through”.
  • You get 4 request header, and 4 response header fields. They were blank in my measurements, even though headers were passed to the pass through service. Looking a the http traffic, the request coming in had a header “Content-Type:application/json”. The request passed through to the back end included
    • User-Agent: IBM_zOS_Connect_REST_SP/open beta/20200803-0918
    • X-Correlation-ID: BAQ1wsHYAQAYwcTDxNfTQEDi8ObxQEBAQNikjpk+1klAAA==

What can you do with the data?

Do I need to use the SMF data?

From a performance perspective these records do not provide much information, as they are lacking information about CPU usage. From an audit perspective they have some useful information – but the records are missing information which would provide useful audit information. There is an overlap between the information in the SMF 123 records and the …/servers/…logs/http_access.log file which provides, date time, userid, URI, HTTP code.

What do I want to report on?

Decide what elements of the URI you want to report on. For example the URI /stockmanager/stock/items/999999 includes the stock part number, which may be different for each request. You might decide to summarise API usage on just the first two elements /stockmanager/stock/. You may have to treat each API individually to extract the key information.

I’ll use the term key for the interesting part of the URI – for example /stockmanager/stock.

What reports are interesting?

I think typical questions are:

  1. Which is the most popular API key?
  2. Is the usage of an API key increasing?
  3. How many API key requests were unsuccessful? This can show set-up problems, or penetration attempts.
  4. What is the response time profile of the requests? Are you meeting the business response time criteria?
  5. Which sites are sending in most of the requests. You cannot charge back on CPU used, as you do not know the CPU usage. You could do charge back at a fixed cost per API request, with each API having a different rate.
  6. Which userids are sending in most of the requests. You may want to provide more granular certificate to userid mapping to give you more detailed information

Understanding z/OS Connect SMF 120 subtype 11 data

z/OS Connect can provide two types of SMF record

  1. SMF 120 subtype 11, provided by the base Liberty support. This gives information on the URL used to access Liberty, and the CPU used to perform requests. This is enabled at the Server level – so you can have records for all request, or no requests. There is one SMF record for each web server request. Would I use this to report CPU used ? No – see the bottom of this blog.
  2. SMF 123 provides information about API and the service used, and the “pass through” services. It provides elapsed time of the request, and of the the “pass through” requests. It does not provide CPU usage figures. This can be configured to produce records depending on the http host and port used to access z/OS Connect. One SMF record can have data for multiple web server requests. The SMF records are produced when the SMF record is full – or the server is shut down.

The SMF 120-11 and SMF 123 records are produced independently, and there is no correlating field between them. They both have a URI field, and time stamps, so at low volumes it may be possible to correlate the SMF data.

I’ll document the fields which I think are interesting. If you think other fields are useful please let me know and I’ll update this document.

I have written an SMF formatter in C which prints out interesting data, and summarises it.

SMF 120-11

  • You get the standard date and time the record was produced, and with the LPAR. You can use PGM=IFASMFDP with the following to filter which records are copied
DATE(2020282,2020282)
START(1000)
START(2359)
  • There is server version (3), system(SOW1), and job id(STC04774) which are not very interesting
  • Server job name(SM3) is more interesting. I started the server with s baqstrt,parms=’MQTEST’,jobname=sm3
  • The config dir (/var/zosconnect/servers/MQTEST/) is boring – as is code level (20.0.0.6)
  • The start and stop times (2020/10/08 09:18:19.852360 and 2020/10/08 09:18:22.073946) are interesting as is the duration – which is the difference between them.
  • You get userid information.
    • I used a certificate to authenticate. The DN from the certificate is not available. You only get the userid from the RACF mapping of DN to userid. This mapped userid was in the 64 byte field. The 8 byte userid field was empty for me. The lack of certificate DN, and having the userid in the wrong field feels like a couple of buglets.
  • You get the URL used /stockmanager/stock/items/999999 I treat this as a main key for processing and summarising data. If you want to summarise the data, you may want so summarise it just on /stockmanager/stock/. The full URI contains the part number – and so I would expect a large number of parts.
  • You can configure your requests to WLM. For example
<wlmClassification>
<httpClassification transactionClass="TCI1" method="GET" 
    resource="/zosConnect/services/stockQuery"/>
</wlmClassification>

This produced in the SMF record

WLMTRan :TCI1
WLM Classify type :URI :/zosConnect/services/stockQuery
WLM Classify type :Target Host :10.1.3.10
WLM Classify type :Target Port :19443

This means that the URL, the host, and the port were passed to WLM to classify.

If you get the WLM classification you also get CPU figures when the enclave request ended (was deleted).

  • You get the ports associated with the request.
    • Which port was used on the server – Target Port :9443
    • Where did the request come from? Origin :10.1.1.1 and port :36786
  • The number of bytes in the response Response bytes :791
  • CPU figures for the CPU used on the TCB. See discussion below on the usefulness of this number. You get the CPU figures before the request, and after the request – so you have to calculate the difference yourself! The values come from the timeused facility. You can calculate the delta and get
    • CPU Used Total : 0.967417
    • CPU Used on CP : 0.026327
    • and calculate these to to get CPU Delta. on Z**P : 0.941090 This is the CPU offloaded to ZIIP or ZAAP.
  • If you had the URI classified with WLM, you get Enclave data, see below for a discussion on what the numbers mean.
    • Enclave CPU time : 0.148803
    • Enclave CPU service : 0.000000
    • Enclave ZIIP time : 0.148803
    • Enclave ZIIP Service : 0.000000

What do the CPU numbers mean?

Typically a transaction flow is as follows

  1. A listening thread listens on the HTTP(s) host and port.
  2. When a request arrives, it passes the request to a worker thread, and goes back to listening
    1. The worker thread may do some work and send the response back
    2. The worker thread may need to call another thread to do some work. For example to issue an MQ request,
      1. the MQ code looks for a thread in a pool for a matching queue manager and userid. If it find one it uses it the thread and issues the MQ request.
      2. If it does not find a matching thing thread it may allocate a new thread, and issue an MQCONN to connect to MQ. These are both expensive operations, which is why having a pool of threads with queue manager and userid is a good way of saving CPU
      3. The work is done
      4. The thread is put back into the MQ pool, and the application returns to the worker thread
      5. The worker thread sends the response back to the originator

A thread can ask the operating system, how much CPU time it(the thread) has used. What usually happens is

  1. the thread requests how much CPU it has used
  2. the thread does some work
  3. the thread requests how much CPU it has used,
  4. the thread calculates the difference between the two CPU values and reports this delta.

I the SMF 120 record records the CPU from just the worker thread – and no other thread.

Enclaves

When there are more than one thread involved it gets more complex, as you could have a CICS transaction issuing an MQ request, then a DB2 request, and then an IMS request. You can set up z/OS WorkLoad Manager(WLM) to say “these CICS transactions in this CICS region are high priority”.

With some subsystems you can pass a WLM token into a request. The thread being invoked call tell WLM that the thread is now working on behalf of this token. The thread does some work, and tells WLM that it has finished doing the work. WLM can manage the priority of the threads to achieve the best throughput, for example making the thread high or low priority. WLM can manage a thread doing work in multiple LPARs across a sysplex!

WLM records the CPU used by the thread while performing the work, accumulates and reports this.

This use of multiple threads for a business transaction across one or more address spaces is known as an enclave.

What happens with enclaves?

  1. A request arrives at the listener thread.
  2. The Liberty looks up the URI in the <wlmClassification httpClassification…. It compare the server’s host, server’s port, the URI resource /stockmanager… method ( GET) and finds the best match for the transactionClass.
    1. If there is a transactionClass,
      1. the server calls WLM with the Subsystem type of CB, the specified collectionName, and the transactionClass.
      2. WLM looks for these parameters and if WLM has a matching definition then WLM will manage the priority of the work,
      3. WLM returns a WLM token.
      4. This WLM token is passed to threads which are set up for enclaves.
    2. If there is no transaction class specified in Liberty, or WLM does not have the subsystem, collectionname, transactionClass then there is no token or a null WLM token
    3. The work continue as before.
    4. If another thread is used then pass the WLM token. If the code is set up for WLM token then report “work started”, when it has finished report “work ended”

What happens if the request is not known to WLM.

The worker thread calculates the CPU used for just its work, and reports this. The CPU used by any other thread is not report. The figures reported are the CPUTotal timeused values. You have to calculate the difference yourself

What happens if the request is known to WLM.

You get the timeused CPU for the worker thread – as with the case where the request is not known to WLM.

From RMF (or other products) you get out reports for an interval with

  1. The number of requests in the interval
  2. The rate of requests in the interval
  3. The amount of time on a CP engine in seconds
  4. The amount of time on a ZIIP engine is seconds
  5. The amount of time on a ZAAP in seconds.
  6. Over the interval, what percentage of time was CP on CP engines, zAAP on zAAP engines, zAAP on CP engines, zIIP on zIIP engines.

From the SMF 120 records you get

Enclave CPU time
Enclave ZAAP time
Enclave ZIIP time

Example Enclave figures.

For 100 API requests, the figures as reported by SMF 120-11, and I averaged the values.

  1. Average CPU(1) 0.023
  2. Average CPU(2) 0.0008
  3. Enclave CPU 0.029
  4. Enclave ZAAP 0
  5. Enclave ZIIP 0.028

The figures reported by RMF per request

  1. CPU 0.031
  2. ZIIP 0.039
  3. ZAAP 0.000
  4. Total 0.070 seconds of CPU per transaction

These figures tie up – the Enclave CPU, ZIIP, and ZAAP are similar.

The CPU used by the server address space was

  1. CPU 30.1 seconds
  2. ZIIP 28.7 seconds
  3. ZAPP 0 seconds.
  4. Total 58.8.

Each request took 0.070, and there were 100 requests – so reported 7 second of CPU.

The difference(51) seconds is not reported in the transaction costs. It looks like the “timeused” value is less than 1% of the CPU value, and the enclave figures are under 2% of the grand total.

Looking at the trace in a dump, I can see many hot TCBs using much more CPU that is reported by WLM and RMF. I expect that many TCBs used in a request, but they do not have the enclave support in them. Overall – pretty useless for charge back and understanding the cost per transaction.

What’s the difference between MQ Web, and z/OS Connect MQ support?

With MQ Web

  1. You can issue commands to administer MQ  for example display, define, delete MQ objects.
  2. You can put and get messages to and from a queue.  The message is what you specify – typically a character string.

With Z/OS Connect MQ support

  1. You can put and get messages to and from a queue, and do transformations on the message.  For example mapping a COBOL structure to JSON.  
  2. You can do field validation.
  3. You can covert HTTP code “200” to “great it worked”.

What is common?

They both use z/OS WebSphere Liberty to provide the basic web server.

Looking for an MQ reason code in Liberty? Get your safari helmet, anti malarial tablets and follow me to find the treasure.

I was using an MQ application in Liberty, and rather do things the easy way, I did what I normally do, and did it the hard way.  On my z/OS I did not have the queue manager defined, because I wanted to see what happened.  I was not expecting the expedition.

You configure MQ in Liberty using configuration like

<jmsConnectionFactory jndiName="jms/cf1" connectionManagerRef="ConMgr1"> 
<properties.wmqJms transportType="BINDINGS" queueManager="MQPA"/>

 

I was expecting a message like the following in the job output.

Application COLINAPP MQCONN call to MQPA failed with compcode 
'2' ('MQCC_FAILED')reason '2058' ('MQRC_Q_MGR_NAME_ERROR').

Oh no, it was not that easy.  It was quite a trek into the jungle to find the information.

In the Liberty server’s logs directory there is a message.log file.  In this file I had

9/14/20 19:16:32:242 GMT 00000060 com.ibm.ws.logging.internal.impl.IncidentImpl I FFDC1015I: An FFDC Incident has been created: "com.ibm.mq.connector.DetailedResourceException: MQJCA1011: Failed to allocate a JMS connection., error code: MQJCA1011 An internal error caused an attempt to allocate a connection to fail. See the linked exception for  details of the failure. com.ibm.ejs.j2c.poolmanager.FreePool.createManagedConnectionWithMCWrapper 199" at 
ffdc_20.09.14_19.16.28.0.log

This was one long line, and I had to scroll sideways (just like you did) to see the content (or use the ISPF line prefix command “tf” to flow the text to the display width).  A key hint was the message MQJCA1011 An internal error caused an attempt to allocate a connection to fail  so I knew I was on the right trail.  I now knew the name of the file – ffdc_20.09.14_19.16.28.0.log.

Knowing the name of the file did not help very much, as if you use ISPF 3.17  (z/OS UNIX Directory List ) it showed a list of 40 files with the name ffdc_20.09.14_1 (ffdc_yy.mm.dd_h).   This is because it only displays the first part of the name. Thanks to Steve Porter who said ..

To increase column size in 3.17, >
Options
1. Directory List Options…
Width of filename column . . . . . . . . 15 (Default value – increase as necessary)

 

The file has a name ffdc_20.09.14_19.16.28.0.log and a displayed time stamp of 2020/09/14 18:16:32 which is close enough – allowing for the time zone difference and the time take to write the file.  I was fortunate not to be running a workload and producing many of these files.

I edited the file – and I could see the full file name at the top of the page, so I knew I was in the right file.

The file has long lines, so I had to scroll or use the “tf” line command to reformat it.

Near the top it had

Stack Dump = com.ibm.mq.connector.DetailedResourceException: 
MQJCA1011: Failed to allocate a JMS connection., error code:  
MQJCA1011 An internal error caused an attempt to allocate a connection to fail. 
See the linked exception for details of the  failure.

Further down it had

Caused by: com.ibm.msg.client.jms.DetailedJMSException: 
JMSWMQ0018: Failed to connect to queue manager 'MQPA' with connection 
mode 'Bindings' and host name 'localhost(1414)'.

and further further down (line 50) I found the treasure

Caused by: com.ibm.mq.MQException: JMSCMQ0001: IBM MQ call failed with 
compcode '2' ('MQCC_FAILED') reason '2058'  ('MQRC_Q_MGR_NAME_ERROR').

What a trek to find the information I needed!

Next time I’ll just list the logs/ffdc directory, edit (not browse) each file and search for “compcode”.   You cannot use “grep compcode” from uss because the file is in UTF8 and does not find it.  You can just use oedit file_name in uss.

It would be nice if the MQ code could be enhanced to have an option “makeErrorsHardToFind” which you could set to “no”, and still keep the default “yes”.

 

Getting z/OS Explorer to work with z/OS Connect EE

Ive been trying to set up z/OS Connect, so I could look at the MQ support within it.

Setting up z/OS Connect in the first place, was a challenge, which I’ll blog about some other time.  I was looking for an Installation Verification Program (IVP) and tried to use the z/OS Explorer.  This was another challenge.  Like many problem there are answers, but it is hard to find the information.

Installing z/OS Explorer

This was easy.  I started here and installed z/OS explorer for Aqua – Eclipse tools.  Then select  IBM z/OS Connect EE.  I selected Aqua 3.2, and chose to install using eclipse p2. I have tried to avoid installation manager as it always seemed very complex and frustrating.

I tried to extend an existing eclipse, but this failed due to incompatibilities.  I used start from fresh, and this worked fine.

Adjust the z/OS Connect server configuration.

I enabled logon logging.

 <httpEndpoint id="defaultHttpEndpoint" 
    host="*" 
    accessLoggingRef="hal1" 
    httpPort="19080" 
    httpsPort="19443" > 
   <accessLogging enabled="true" 
     logFormat='h:%h i:%i u:%u t:%t r:%r s:%s b:%b D: %D m:%m' 
    /> 
<sslOptions sslRef="defaultSSLSettings"/> 
</httpEndpoint> 

This creates a file in the  location http_access.log within the log directory. It has output like

10.1.1.1 ADCDC 08/Sep/2020:17:50:40 +0000 "GET /zosConnect/services/stockQuery HTTP/1.1" 200

You can see where the request came from (10.1.1.1), user (ADCDC), the date and time, the request (“GET /zosConnect/services/stockQuery HTTP/1.1”), and the response code(200).

Getting started with z/OS Explorer

You need to define host connections.

If you totally disable security on your server you can use http.

  1. On z/OS explorer,display the Connections tab. (Window -> Show View -> Host connections)
  2. Right click on z/OS Connect Enterprise Edition, and select New z/OS Connect Enterprise Edition, Connection
    1. Name: this is displayed in the tooling
    2. Host name: I used 10.1.3.10 which is my VIPA address of the server
    3. Port number:   This comes from the  httpEndpoint for the server.  The default is http:9080 and https:9443 – but as every Liberty product uses these values, your server may have different values.  I used 19080.
    4. I initially left Secure connection(TLS/SSL) unticked
    5. Click Save and Connect
  3. A panel was displayed asking for credentials. Either create new credentials (userid and password) or select an existing credential.
  4. Double click on the connection you just created.
    1. An error of “302, Found” is an http response meaning redirection.  In the z/OS connect case, this means you are trying to use an http connection when an https ( a TLS connection) was expected.  I got this because I had not disabled security in my server.

The normal way of accessing z/OS connect is to use TLS to protect the session.  As well as TLS to protect the session you can also use client certificate authentication.  This is what I used.

You will need to set up certificates, keystores and keyrings on z/OS and get the Certificate Authority certificates sent to the “other” system.  I used my definitions from using MQWEB.

  1. On z/OS explorer, set up the keystores
    1. Window -> Preferences -> Explorer-> certificate manager
    2. The truststore contains the CA certificates to validate the certificate send down from the z/OS server.  Enter the file name (or use Browse), the pass phrase, and the key store.  My truststore was JKS.
    3. The keystore contains the client certificate used to identify this client to the server.
    4. Smart card details.  Ignore this – (despite it saying you must configure a PKCS11 driver).   This section is used if you select smart card to identify yourself, and it would be better if the wording said “If you are using Smart card authentication you must configure a PKCS11 driver ).
    5. Leave the “Do not validate server certificate trust” unticked.  This will check the passwords etc of the key stores.
    6. At the bottom I used “Secure socket protocol-> TLS v1.2” though this is optional.
    7. Select Apply and Close
  2. Display the Connections tab. (Window -> Show View -> Host connections)
  3. Right click on z/OS Connect Enterprise Edition, and select New z/OS Connect Enterprise Edition, Connection
    1. Name: this is displayed in the tooling
    2. Host name: I used 10.1.3.10 which is my VIPA address of the server
    3. Port number:   This comes from the  httpEndpoint for the server.  The default is http:9080 and https:9443 – but as every Liberty product uses these values, your server may have different values.  I used 19443
    4. I ticked Secure connection(TLS/SSL).  If you do not select this, you will not be able to use a certificate to logon.
    5. Click Save and Connect
  4. A panel was displayed asking for credentials.   When I used an existing credential I failed to connect to the server.
    1. Select Create new credentials
    2. Click on Username and Password pull down – and select Certificate from Keystore.
    3. Enter credentials name – this is just used within the tooling
    4. Userid – this seems to be ignored.  I used certificate mapping on the z/OS to map the certificate to a userid.
    5. Choose a certificate – select one from the pull down.  In my Linux box the choice of certificates came out in yellow writing on a yellow background!
    6. Click OK
    7. The connection should appear on the Connections page, under z/OS Connect Enterprise Edition.  It should go yellow while it is connecting, and green, with a padlock once it has connected

Use z/OS Connect

Use Window-> Show View -> zOS Connect EE Servers

You should see your connection displayed  with the IP address and port. Underneath this are any APIs or Services you have defined.

If you have any APIs or Services, you should be able to right click and select Show Properties View.  You can click on the links, or copy the links and use them, for example  in a web browser directly,or via curl.

If you try to use the APIs or Services, you may not be authorised.  You will need to configure

  1. <zosconnect_zosConnectManager …>
  2. <zosconnect_zosConnectAPIs>   <zosConnectAPI name=”stockmanager”  ….
  3. <zosconnect_service>  <service name=”stockquery”

Good luck.