What no one tells you about defining your RACF resources – and how to do it for MQ.

Introduction

The RACF documentation has a lot of excellent reference materials describing the syntax of the commands, but I could not find much useful information on how to set up RACF specifically for products like MQ, CICS, Liberty etc.

It is bit like saying programming has the following commands, load, store, branch; but fails to tell you that you can do wonderful things like draw Mandelbrot pictures using these instructions.

I set up MQ on my z/OS system as an enterprise user – though I have an enterprise with just one person in it – me!  With this view it shows what you need to configure.

I am not a RACF expert – but have learned as I go along.  I believe this blog post is accurate – but I may have missed some set up considerations.

In this blog post I’ll cover

  • Security roles
  • RACF concepts – class and profile
  • Controlling who can create profiles and how to limit what they can create
  • MQ profiles
  • Planning for MQ
  • How do you copy the security profiles for a new queue manager?

A typical enterprise – from a security perspective

In a typical enterprise there are different departments

  • The security team responsible for the overall set up of security, ensuring that configurations are up to date (userids which are no longer needed are deleted).
  • product teams are responsible for defining the security profiles within their products, protecting resources, and giving people access to facilities.
    • most developers do not have the authority to define profiles or give access.  Many developers do not have a z/OS logon.

What do you need to protect?

There are four types of resources you can define

  • commands
  • resources
  • “logging on” or “connecting to”
  • turning off security, or making powerful commands generally available

For example, for z/OS

  • commands
    • being able to issue z/OS commands
    • being able to issue TSO commands
  • resources
    • data sets
  • logging on
    • which systems you can logon to
  • turning off security
    • changing the RACF configuration

for MQ

  • commands
    • being able to issue MQ commands to configure the queue manager
    • being able to issue commands to define MQ queues etc
  • resources
    • queues, channels etc
  • logging on
    • which queue managers you can use, connect from batch, but not from CICS
  • defining the switches to disable parts of MQ security checking.

How do you protect a resource?

Resources are defined in classes. For example

  • class OPERCMDS define z/OS console commands
  • class MQCMDS for MQ commands
  • class MQQUEUE for MQ queues
  • class SERVER for managing access to servers such as Liberty and WAS

You need to go down a level, and protect resources within a class.  You may want to allow one group of people to define resources for production and another group allow to define resources for test.   You may have MQOPS allowed to define  profiles for PROD… and TEST…. and TESTOPS only allowed to define resources TEST…..

You need CLass AUTHorisation (CLAUTH) on a userid to be able to define a resource.  The CLAUTH does not exist for a group.

ALTUSER ADCDA CLAUTH(MQCMDS)

With this command userid ADCDA can now use commands like

RDEFINE MQCMDS MQPA.DISPLAY.** UACC(NONE)  OWNER(MQM)

This says

  • Create an entry for class MQCMDS
  • Queue manager MQPA, any DISPLAY command, so DISPLAY USAGE, and DISPLAY QLOCAL would be covered
  • No universal access
  • The resource is owner by MQM.   If this is a group, anyone with group special in group MQM can issue the PERMIT command on the resource

How specific a profile do I need?

For harmless commands, such as DISPLAY you can have a general profile MQPA.DISPLAY.* to cover all DISPLAY commands.

For commands that can change the system, you  should use specific profiles, for example

RDEFINE MQCMDS MQPA.DEFINE.PSID UACC(NONE)  OWNER(MQM)
RDEFINE MQCMDS MQPA.DEFINE.QLOCAL UACC(NONE)  OWNER(MQM)
PERMIT MQPA.DEFINE.PSID CLASS(MQCMDS) ACCESS(READ) ID(MQOP1)
PERMIT MQPA.DEFINE.QLOCAL CLASS(MQCMDS) ACCESS(READ) ID(MQAMD1)

If you use the DEFINE.** then administrators can give themselves access to the operator DEFINE commands.

Limiting what profiles a user can manage

If you have RACF GENERICOWNER enabled (this is a system wide option) you can create profiles and grant people access within that group.

Turn on GENERIC OWNER

SETROPTS GENERICOWNER

Create a top level, catch-all case

RDEFINE MQCMDS ** UACC(NONE) OWNER(SYS1)

Create a profile limiting people in group ADCD to define resources with names MQPC.**

RDEFINE MQCMDS MQPC.** UACC(NONE) OWNER(ADCD)

If userid ADCDA in group ADCD tries to create a profile

RDEFINE MQCMDS MQPC.AA3 UACC(NONE)

it works, but

RDEFINE MQCMDS MQPZ.AA UACC(NONE)

gives ICH10103I NOT AUTHORIZED TO DEFINE MQPZ.AA.

The owner of a profile can give authority to anyone, there are no limits or checks.

Creating profiles for MQ

Using the categories described above

  • MQ commands
    • being able to issue MQ commands to configure the queue manager
    • being able to issue commands to define MQ queues etc
  • MQ resources
    • queues, channels etc
  • connecting to MQ
    • which queue managers you can use
  • defining the switches to disable parts of MQ security checking.

MQ Commands

Commands can be issued from

  • the operator console (SDSF)
  • with the MQ ISPF panels,  messages are put to the SYSTEM.COMMAND.INPUT.QUEUE
  • Applications putting messages to the SYSTEM.COMMAND.INPUT.QUEUE
  • Applications using PCF to the SYSTEM.COMMAND.INPUT.QUEUE

If command checking is enabled then command are checked using the MQCMDS class.

Other commands, via the SYSTEM.COMMAND.INPUT.QUEUE, need to have permission to put to the queue, and the command is checked by the MQCMDS class.

MQResources

The queuing resources are  have the following classes – MX… are for MiXed case names. A completely UPPER case queue name can still be protected if you choose to use the MXQUEUE class. That is “upper case” names are a subset of the “mixed case” names, and MYQUEUE is different from MyQueue.

  • MQQUEUE,MXQUEUE  queue resources
  • MQPROC, MXPROC process (for example triggering)
  • MQNLIST, MXNLIST name list
  • MXTOPIC topics – Topics are always mixed case.

Connecting to MQ

  • MQCONN  and you define resources like MQPA.BATCH CLASS(MQCONN)

Defining switches to disable parts of MQ security checking, and subset checks

  • MQADMIN, MXADMIN, Profiles:

Used mainly for holding profiles for administration-type functions. For example:

    • Profiles for IBM MQ security switches
    • The RESLEVEL security profile
    • Profiles for alternate user security
    • The context security profile
    • Profiles for command resource security

For example the following turns off all RACF checking for the queue manager

REFINE MQADMIN MQPA.NO.SUBSYS.SECURITY

You can set up security so people are authorised to only a subset of objects.

You can set up

RDEFINE MQADMIN MQPA.QUEUE.TEST* OWNER(MQPAOPS)

to allow people access to a subset of queues – in this case queues beginning with TEST on queue manager MQPA.  A user would need to be authorised to use RDEFINE MQCMDS MQPA.DEFINE.QLOCAL  or (hlq.DEFINE.**)  and authorised to RDEFINE MQADMIN MQPA.QUEUE.TEST*.

A thought on the MQ profile design.

 It feels like the security was not well defined in this area.  You want to allow someone to restrict someone’s access to only use a subset of queues, but the person may have the authority to turn MQ security off by giving them authority to create MQADMIN MQPA.NO.SUBSYS.SECURITY!

You can solve this using GENERICOWNER (which is optional) and

RDEFINE MQADMIN MQPA.NO.** UACC(NONE) OWNER(THEBOSS)

Looking back, rather than depending on the GENERICOWNER facility,  I would have set up a class MQSWITCH to allow only the site RACF coordinator to define a switch and so turn off security.

Planning for security

You need to identify

  • the classes of profiles ( MQCMDS, MQQUEUES, z/OS OPERCMDS)
  • the subsystems being protected ( MQ, DB2)
  • the areas of profiles,  TEST queues, Production tables for PAYROLL application
  • the roles of people and what they are expected to do – map each role to a group
    • For each subsystem and class of profile what can each role do?
      • Production, Read Only operator commands, roles: all roles
      • Production, DEFINE PAGESET commands, roles: members of ZOPER group
      • Production, DEFINE QUEUE  commands, roles: members of PRODADMN group
      • Test, DEFINE PAGESET commands, roles: members of ZOPER and TESTOPER groups
  • The hierarchy of groups.   If you have defined a profile with owner TESTOPER, people can create resources in this group
    • if they are in the TESTOPER group,
    • or a user who has group-SPECIAL authority over the group which owns the TESTOPER profile
  • Define the profiles, the general MQPA.DISPLAY.**,  and the specific MQPA.DEFINE.PAGESET, MQPA.DEFINE.QLOCAL

Another thought of MQ security design.

At the beginning of MQ 25+ years ago, this was before Sysplex, there was only a single LPAR, and typically only one queue manager, DB2 etc on each LPAR.  These days people have many “identical” queue managers – which may be in a QSG or not.

When you create a new queue manager you have to replicate the security profiles, so copy all the profiles from MQPA…. to MQPB….

With hindsight it may have been better to

  • define profiles with a generic name prefix, eg MQHLQ, so you would have MQHLQ.DEFINE.**
  • have a queue manager option SECPFX=MQHLQ which points to these profiles
  • have a class SERVER profile MQ.MQHLQ and grant the queue manager userid access to it.

How do you copy the security profiles for a new queue manager?

I could find no easy way of doing this.  When I worked for IBM I had some rexx code which used the IRRXUTIL  to extract information from the RACF database and rebuild the RDEFINE and PERMIT statements.

You could also use the RACF Unload Database program into a file, but most people are not likely to have access to the this.

 

 

What no one tells you about setting up your RACF groups – and how to do it for MQ.

Introduction

The RACF documentation has a lot of excellent reference materials describing the syntax of the commands, but I could not find much useful information on how to set up RACF specifically for products like MQ, CICS, Liberty etc.

It is bit like saying programming has the following commands, load, store, branch; but fails to tell you that you can do wonderful things like draw Mandelbrot pictures using these instructions.

You need to plan your group structure before you try to implement security, as it is hard to change once it is in place.

The big picture

You can set up a hierarchy of groups so the site RACF person can set up a group called MQ, and give the MQ team manager authority to this group.

The manager can

  • define groups within it
  • connect users to the group
  • give other people authority to manage the group.

We can set up the following group structure

  • MQM
    • MQOPS – for the MQ operators
      • MQOPSR for operators who are allow to issue only Read (display) commands
      • MQOPSW for operators who can issue all command, display and update
    • MQADMS – for the MQ administrators
      • MQADMR – for MQ administrators who can only use display commands
      • MQADMW – for MQ administrators who can use all commands
    • MQWEB….

You should place an operator’s userid in only one group MQOPSR or MQOPSW as these are used to control access.  MQM, MQOPS, MQADMS, MQWEB are just used for administration.

You permit groups MQOPSR and MQOPSW to issue a display command, but only permit group MQOPSW to issue the SET command.

Setting up groups to make it easy to administer

A group needs an owner which administers the group.  The owner can be a userid or a group.

A group has been set up called MQM, and my manager has been made the owner of it.

My manager has connected my userid PAICE to the MQM group with group special.

CONNECT PAICE GROUP(MQM) SPECIAL

I can define a new group MQOPS for example

ADDGROUP MQOPS SUPGROUP(MQM) OWNER(MQM) DATA(‘MQ operators’)

The SUPGROUP says it is part of the hierarchy under MQM.  I can create the group under MQM because I am authorised,  If I try to create a group with SUPGROUP(SYS1) this will fail because I am not authorised to SYS1.

The OWNER(MQM) says people in the group MQM with group special can administer this new group.

Because my userid (PAICE) has group special for MQM, I can now connect users to the new group, for example

CONNECT ADCDB GROUP(MQMD ) AUTHORITY(USE ).

I can create another group under MQMD called MQMX, and connect a userid to it.

ADDGROUP MQMX  SUPGROUP(MQMD) OWNER(MQMD) DATA(‘MQ Bottom group’)
CONNECT ADCDE GROUP(MQMX ) AUTHORITY(USE )

My userid PAICE can administer this because of the OWNER() inheritance up to GROUP(MQM)

If I list the groups I get

LISTGRP MQM 
INFORMATION FOR GROUP MQM 
    SUPERIOR GROUP=SYS1 OWNER=IBMUSER 
    SUBGROUP(S)= MQM2 MQMD  
    USER(S)= ACCESS= ACCESS COUNT= UNIVERSAL ACCESS= 
       PAICE    JOIN        000000              NONE 
         CONNECT ATTRIBUTES=SPECIAL 

LISTGRP MQMD 
INFORMATION FOR GROUP MQMD 
    SUPERIOR GROUP=MQM OWNER=MQM 
       SUBGROUP(S)= MQMX 
    USER(S)= ACCESS= ACCESS COUNT= UNIVERSAL ACCESS=  
       ADCDB USE 000000 NONE CONNECT ATTRIBUTES=NONE
 
LISTGRP MQMX 
    INFORMATION FOR GROUP MQMX 
    SUPERIOR GROUP=MQMD OWNER=MQMD 
    NO SUBGROUPS 
    USER(S)= ACCESS= ACCESS COUNT= UNIVERSAL ACCESS= 
       ADCDE     USE        000000              NONE 
         CONNECT ATTRIBUTES=NONE

All SUPGROUP() does is to define the hierarchy as we can see from the LISTGRP.    We can display the groups  and draw up a picture of the hierarchy.   You can use the LISTGRP command repeatedly,  or use the DSMON program(EXEC PGM=ICHDSM00) and use option
USEROPT RACGRP to get a picture like

 LEVEL GROUP 
1 SYS1 (IBMUSER ) 
2 | MQM (IBMUSER ) 
3 | | MQMD 
4 | | | MQMX 
3 | | MQM2 (IBMUSER )

Using OWNER(group) instead of OWNER(userid)

  • If you have OWNER(groupname) it is easy to administer the groups.  When someone joins or leaves the department, you add or remove the userid from groupname.  One change.
  • If you have OWNER(userid), then you have to explicitly connect the userid to each group with group special.  When there is a new person you have to add the userid to each group individually.  When someone leaves the team you have to remove the persons userid from all of the groups. This could be a lot of work.

Delegation.

You could define an operator MQOP1 and give the userid group-special for group MQOPS.   This userid (MQOP1) can be used to add or remove userids in the MQOPSR and MQOPSW groups.

Looking at the MQOPS groups we could have groups and connected userids

  • MQM with MQ security userids PAICE, BOB having group-special
    • MQOPS with the operations manager and deputy MQOP1, MQOP2 having group special
      • MQOPSR with STUDENT1, STUDENT2 who are only allowed to issue display commands
      • MQOPSW with PAICE, TONYH, CHARLIE
    • MQADMS….

and similarly for the MQ administration eam.

Userid PAICE can connect userids to all groups.  MOP1 can only connect userids to the MQOPSR and MQOPSW, and not connect to the MQ ADMIN groups.

You use groups MQOPSR and MQOPSW for accessing resources. Groups MQM and MQOPS have no authority to access a resource, they are just to make the administration easier.

You may also want to consider having a group for application development.  The group called PAYRDEVT is under MQM, is owned by the manager of the payroll development team.

When the annual userid validation check is done, the development manager does the checks, and tells the security department it has been done.

Permissions

There is no inheritance of permissions.  If a userid needs functions available to groups MQMD and MQMX, the user needs to be connected to both groups.

You only connect userids to groups, you cannot have groups within groups.  There may be many groups of userids which are allowed to issue an MQ display command, but only one group who can issue the SET command.

 

Suggested MQ groups

You need to consider

  • production and test environments
  • resources shared by queue managers, queue managers with the same configurations in a sysplex which can share definitions
  • queue managers as part of a Queue Sharing Group
  • queue manages that need isolation and so may have common operations groups, but different administration and programming groups.

You might define

  • Group MQPA for the queue manager super group. (MQ, Production system, A)
  • Groups for MQWEB. The Web server roles are described here.
  • Groups for controlling MQ, operations and administrations, read only or update
  • Groups for who can connect via batch, CICS etc
  • Groups for application usage, who can use which queues

Groups for MQWEB

For MQWEB the MQ documentation describes 4 roles: MQWebAdmin, MQWebAdminRO, MQWebUser, MFTWebAdmin; and there is console and REST access.

Each role should have its own group.  The requests from “Admin” and “Read Only” run with the userid of the MQWEB started task.   The request from “User” run with the signed on user’s authority.

You might set up groups

  • MQPAWCO MQPAMQWebAdminRO Console Read Only.
  • MQPAWCU MQPA – MQWebUser  Console User only.  The request operates under the signed on userid authority.
  • MQPAWCA MQPA – MQWebAdmin Console Admin.
  • MQPAWRO MQPA – MQWebAdminRO REST Read Only.
  • MQPAWRU MQPA – MQWebUser  REST User only.   The request operates under the signed on userid authority.
  • MQPAWRA MQPA – MQWebAdmin REST Admin Only.
  • MQPAWFA MQPA – MFTWebAdmin MFT REST Admin. 
  • MQPAWFO MQPA-  MFTWebAdmin MFT REST Read Only.

I would expect most people to be in

  • MQPAWCU MQPA – MQWebUser  Console User only.  The request operates under the signed on userid authority.
  • MQPAWRU MQPA – MQWebUser  REST User only.   The request operates under the signed on userid authority.

so you can control who does what, and get reports on any violations etc.  If people use the MQWEB ADMIN you do not know who tried to issue a command.

Groups for operations

The operations team may be managing multiple queue managers, so you may need groups

  • PMQOPS for Production
    • PMQOPSR
    • PMQOPSW
  • TMQOPS for Test
    • PMQOPSR
    • PMQOPSW

If some operators are permitted to manage only a subset of the queue managers you will need a group structure that can handle this, so have a special group XMQOPS for this.

  • XMQOPS for  the special queue manager
    • XMQOPSR
    • XMQOPSW

Groups for administration.

This will be similar to operations.

Groups for end users.

This is for people running work using MQ.

Usually there are checks to make sure a userid can connect to the queue manager, using the MQCONN resource.  Some customers have a loose security set up, and rely on the CICS to check to see if the userid is allowed to use a CICS transaction, rather than if the userid is allowed to access a queue.

No, No, think before you create a naming convention

I remember doing a review of a large customer who had grown by mergers and acquisitions.  We were discussing naming conventions, and did they have them.

“Naming conventions”, he said “we love them.  We have hundreds of them around the place”. He said it was to hard and disruptive to try to get it down to a small number of naming conventions.

I saw someone’s MQ configuration and wished they had thought through their naming convention, or asked someone with more experience.  This is what I saw

  • The MQ libraries were called CSQ910.SCSQAUTH
    • This is OK as it tells you what level of MQ you are using
    • It would be good to have a dataset alias of CSQ pointing to CSQ910.  Without this you have to change the JCL for all job, compiles, runs etc which had CSQ910.  When you moved from CSQ810 to CSQ900 you have change the JCL. If you then decide to go back to CSQ810 for a week, you have to change the JCL again.  With the alias is is easy – change the alias and the JCL does not need to change.    Change the alias again – and the JCL does not need to change.
  • The MQ logs were called CSQ710.QM.LOGCOPY1.DS01, … DS02,…DS03
    • This shows the classic problem of having the queue manager release as part of the object names.  It would have been better to have names like CSQ.QM.LOGCOPY1.DS01 without the MQ version in it.
    • The name does include a queue manager name of sorts, but a queue manager name of QM is not very good.  If you need another queue manager you will have names like QM, QMA, QMB so an inconsistent name.
    • It is good to have the queue manager name as part of the data set name, so if the queue manager was QM01 then have CSQ.QM01.
  • The page sets were CSQ710.QM.PAGESET0, CSQ710.QM.PAGESET1,  CSQ710.QM.PAGESET2,  CSQ710.QM.PAGESET3,  CSQ810.QM.PAGESET4, CSQ910.QM.PAGESET5
    • This shows the naming standard problem as it evolved over time.  They added more page sets, and used the MQ release as the High Level Qualifier.  The page sets are CSQ710,… CSQ810…,  CSQ910… – following the naming standard.

You do not invent a naming convention in isolation, you need to put an architect’s hat on and see the bigger picture, where you have production and test queue managers, different versions of MQ, and see MQ is just a small part of the z/OS infrastructure.

  • People often have one queue manager per LPAR, and call MQ after the LPAR.
  • You are likely to have multiple machines – for example to provide availability, so plan for multiple queue managers.
  • You may want different HLQ to be able to identify production queue manager data sets and test queue manager data sets..
  • The security team will need to set up profiles for queue managers. Having MQPROD and MQTEST as a HLQ may make it easier to set up.
  • The storage team (what I used to call data managers)  set up SMS with rules for data set placement. For example production pagesets with a name like MQPROD.**.PSID* go on the newest, fastest, mirrored disks.  MQTEST.** go on older disks.
  • As part of the SMS definitions, the storage team define how often, and when, to backup data sets.   A production page set may be backed up using flash copy once an hour.   (This is within the Storage subsystem and takes seconds.   It takes a copy by taking a copy of the pointers to the records on disk).   Non production get backed up overnight.

 

Lessons learned

  • For the IBM provided libraries, include the VRM in the data set names.
  • Define an alias pointing to the current libraries so applications do not need to change JCL.   You could have a Unix Services alias for the files in the zFS.
  • Do not put the MQ release in the queue manager data sets names.
  • Use queue manager names that are relevant and scale.
  • Talk to your security and storage managers about the naming conventions; what you want protected, and how you want your queue manager data sets to be managed.

NETTIME does not just mean net time

I saw a post which mentioned NETTIME and where people assume it is the network time.   It is more subtle than that.

If NETTIME is small then dont worry.   If NETTIME is large it can mean

  • Slow network times
  • Problems processing messages at the remote end

Consider a well behaved channel where there are 2 messages on the transmission queue

Sender end Receiver end
  • MQGET first message from the queue
  • TCP Send message in  buffer 1
  • MQGET second message from the queue
  • TCP Send message in buffer 2
  • MQGET – no message found
  • Do end of batch processing
    • TCP Send “End of batch” in buffer 3
    • Start timer
    • Wait
    • buffer arrives, wake up application
    • Stop timer. Interval is “Sender wait time”
  • Extract “receiver processing time” from reply buffer
  • Calculate NETTIME = “sender wait time” – “receiver processing time”
  • buffer 1 arrives, wake up Receiver channel application
  • buffer 2 arrives
  • TCP receive buffer 1 from network
  • MQPUT message 1 to the queue
  • buffer 3 arrives
  • TCP receive buffer 2 from network
  • MQPUT message 2  to the queue
  • TCP receive buffer 3 from the network – get  “end of batch” flag
    • Start timer
    • Issue commit
    • Stop timer
    • Send “OK + time interval back to Sender

In this case the NETTIME is the total time less the time at the receiver end.  So NETTIME is the network time.

In round numbers

  • it takes 2 millisecond from sending the data to it being received
  • get + send takes 0 ms ( the duration of these calls is measured in microseconds)
  • receive (when there is data) + MQPUT and put works, takes 0 ms
  • commit takes 10 ms
  • it takes 1 ms between sending the response and it arriving.
  • “10 ms” is sent in the response message

This is a simplified picture with details omitted.

The sender channel sees 13 ms  between the last send and getting the response.  (13 ms – 10 m)s is 3 ms – the time spent in the network.

 

Now introduce a queue full situation at the receiver end

Sender end Receiver end
  • MQGET first message from the queue
  • TCP Send message in buffer 1
  • MQGET second message from the queue
  • TCP Send message in buffer 2
  • MQGET – no message found
  • Do end of batch processing
    • TCP Send “End of batch” in buffer 3
    • Start timer
    • Wait
    • buffer arrives, wake up application
    • Stop timer. Interval is “Sender wait time”
  • Extract “receiver processing time” from reply buffer
  • Calculate NETTIME = “sender wait time” – “receiver processing time”
  • buffer 1 arrives, wake up Receiver channel application
  • buffer 2 arrives
  • TCP receive buffer 1 from network
  • MQPUT message 1 to the queue – it gets queue full, it pauses
  • buffer 3 arrives.  All of the data is in the buffers in TCP at this end.
  • after 500 ms the MQPUT succeeds.
  • TCP receive buffer 2 from network
  • MQPUT message 2 to the queue
  • TCP receive buffer 3 from the network – get “end of batch” flag
    • Start timer
    • Issue commit
    • Stop timer
    • Send “OK + time interval back to Sender

In round numbers

  • it takes 2 millisecond from sending the data to it being received
  • get + send takes 0 ms ( it is in microseconds)
  • receive (when there is data) takes 0 ms
  • the pause and retry took 500 ms
  • the second receive and MQPUT takes 0 ms
  • commit takes 10 ms
  • it takes 1 ms between sending the response and it arriving.
  • “10 ms” is sent ( as before) in the response message (the time between the channel code seeing the “end of batch” flag and the end of its processing
  • Buffer 3 with the “end of batch” flag was sitting in the TCP buffers for 500 ms

The sender channel sees 513 ms  between the last send and getting the response.  513 ms – 10 ms is 503  ms – and reports this as ” the time spent in the network” when in fact the network time was 3 ms, and 500 ms was spent wait to put the message.

Regardless of the root cause of the problem, a large nettime should be investigated:

  • do a TCP ping to do a quick check of the network
  • check the error logs at the receiver end
  • check events etc to see if the queues are filling up at the receiver end

Using Activity Trace to show a picture of which queues and queue managers your application used.

I used the midrange MQ activity trace to show what my simple application, putting a request to a cluster server queue and getting the reply, was doing.  As a proof of concept (200 lines of Python), I  produced the following

This output is a .png format.   You can create it as an HTML image, and have the nodes and links as clickable html links.

Ive ignored any SYSTEM.* queues, so the SYSTEM.CLUSTER.TRANSMIT.QUEUE does not appear.

The red arrows show the “high level” flow between queue managers at the “architectural”, hand waving level.

  • The application oemput on QMA did a put to a clustered queue CSERVER, there is an instance of the queue on QMB and QMC.   There is a red line from QMA.oemput to the queue CSERVER on QMB and QMC
  • The server programs, server running on QMB and QMC put the reply message to queue CP0000 on queue manager A

The blue arrows show puts to the application specified queue name – even though this may map to the S.C.T.Q.  There are two blue lines from QMA.oemput because one message went to QMC.CSERVER, and another went to QMB.CSERVER

The yellow lines show the path the message took between queue managers.  The message was put by QMA.oemput to queue CSERVER; under the covers this was put to the SCTQ.  From the accounting trace record this shows the remote queue manager and queue name:  the the yellow line links them.

The black line is getting from the local queue

The green line is the get from the underlying queue.  So if I had a reply queue called CP0000, with a QAlias of QAREPLY. If the application does a get from QAREPLY,  There would be a black line to CP0000, and a green line to QAREPLY

How did I get this?

I used the midrange activity trace.

On QMA I had in mqat.ini

applicationTrace:
ApplClass=USER # Application type
ApplName=oemp* # Application name (may be wildcarded)
Trace=ON # Activity trace switch for application
ActivityInterval=30 # Time interval between trace messages
ActivityCount=10 # Number of operations between trace msgs
TraceLevel=MEDIUM  # Amount of data traced for each operation
TraceMessageData=0 # Amount of message data traced

I turned on the activity trace using the runmqsc command

ALTER QMGR ACTVTRC(ON)

I ran some work load, and turned the trace off few seconds later.

I processed the trace data into a json file using

/opt/mqm/samp/bin/amqsevt -m QMA -q SYSTEM.ADMIN.TRACE.ACTIVITY.QUEUE -w 1 -o json > aa.json

I captured the trace on QMB, then on QMC, so I had three files aa.json, bb.json, cc.json.  Although I captured these at different times, I could have collected them all at the same time.

jq is a “sed” like processor for processing json data.   I used it to process these json files and produce one output file which the Python json support can handle.

jq . --slurp aa.json bb.json cc.json  > all.json

The two small python files are zipped here. AT.

I used ATJson.py python script to process the all.json file and extract out key data in the following format:

server,COLIN,127.0.0.1,QMC,Put1,CP0000,SYSTEM.CLUSTER.TRANSMIT.QUEUE,QMC,CP0000,QMA, 400

  • server, the name of the application program
  • COLIN, the channel name, or “Local”
  • 127.0.0.1, the IP address, or “Local”
  • QMC, on this queue manager
  • Put1, the verb
  • CP0000, the name of the object used by the application
  • SYSTEM.CLUSTER.TRANSMIT.QUEUE, the queue actually used, under the covers
  • QMC, which queue manager is the SCTQ on
  • CP0000, the destination (remote) queue name
  • QMA, the destination queue manager
  • 400 the number of times this was used, so 400 puts to this queue.

I had another python program Process.py which took this table and used python graphviz to draw the graph of the contents.  This produces a file with DOT  (graph descriptor language)parameters, and used one of the many programs to draw the chart.

This shows you what can be done, it is not a turn-key solution, but I am willing to spend a bit of time making it easier to use, so you can automate it.  If so please send me your Activity Trace data, and I’ll see what I can do.

When is mid-range accounting information produced?

I was using the mid-range accounting information to draw graphs of usage, and I found I was missing some data.

There is a  “Collect Accounting” Time for every queue every ACCTINT seconds (default 1800 seconds = 30 minutes).  After this time, any MQ activity will cause the accounting record to be produced.  This does not mean you get records every half hour as you do on z/OS, it means you get records with a minimum interval of 30 minutes for long running tasks.

Setup

I had a server which got from its input queue and put a reply message to the reply-to-queue.

Every minutes an application started once a minute which put messages to this server, got the replies and ended.

When are the records produced?

Accounting data is produced (if collecting is enabled) when:

  • an MQDISC is issued, either explicitly implicitly
  • for long running tasks  the accounting record(s) seems to be produced at when the current time is past the “Collect Accounting time”, when there has been some MQ activity. For example  there were accounting records for a server at the following times
    • The queue manager was started at 12:35:51, and the server started soon afterwards
    • 12:36:04 to 13:06:33.   An application put a message to the server queue and got the response back.   This is 27 seconds after the half hour
    • 13:06:33 to 13:36:42  The application had been putting messages to the server and getting the responses back.   This is 6 seconds after the half hour
    • 13:36:42 to 14:29:48 this interval is 57 minutes.  The server did no work from 1400 to 14:29:48 ( as I was having my lunch).  At 14:29:48 a message arrived, and the accounting record was written for the server.
    • 14:29:48 to 15:00:27 during this time messages were being processed, the interval is just over the 30 minutes.

What does this mean?

  • If you want accounting data with an interval “on the half hour”, you need to start your queue manager “just before the half hour”.
  • Data may not be in the time period you expect.  If you have accounting record produced at 1645, the data collected between 1645 and 17:14  may not appear until the first message is processed the next day. The record havean  interval  from 16:45 to  09:00:01 the next day.  You may not want to display averages if the interval is longer than 45 minutes.
  • You may want to stop and restart the servers every night to have the accounting data in the correct day.

 

Stackoverflow: What throughput can a standalone Java program achieve?

There was a question on the MQ section on StackOverflow

I have a standalone multi threaded java application which listen messages from IBM MQ.
Current system take around 500ms for processing of 1 message after it read from queue and till it commit.
I want to know how many messages I can consume

  • Concurrently:
  • Max number of messages can be processed? or throttle limit

A good meaty performance question I thought.  Let me break this into pieces.

Current system take around 500ms for processing of 1 message after it read from queue and till it commit.

Processing one messages and commit should take about 10 milliseconds or less( say 30 ms for a two phase commit).    There is clearly something else going on.  Fix this first.

  1. A long database call.   This could be due to database locking, or a badly designed statement, for example a query which needs to access thousands or millions of rows.
  2. A request to a server far far away
  3. A file system with the speed of writing an illuminated letter to parchment

How many messages I can consume: Concurrently:

Take the worst case of using persistent messages, which require log IO during commit.

For one thread, processing multiple messages before doing a commit means the thread can do more work.  Consider a get taking 1 millisecond, and a commit taking 10 ms. This is one message processed every 11 ms.  If you did 50 gets – taking 50 ms and a commit taking 10 ms, this is 50 messages in 50 + 10 ms which equates to one message every 1.2 milliseconds almost 10 times faster.    This is how channels can send messages efficiently.   There is a “sweet spot” of messages per commit to give you maximum data processed per second.   This depends on the message size, logging rates and other factors.  For a 100MB message it is one message per commit.  For 10KB messages,  this may be 1000 messages per commit.

This may be selfish

This is clearly a great improvement, but possibly selfish.  If the application logic is a get followed by a database insert, followed by a commit, then doing 50 gets, 50 inserts and a commit, will work much faster.  The down side is that the database requests will keep locks until the commit.  These locks may prevent other applications from accessing data, either the recently inserted  records, page locks, or index locks. So overall MQ throughput goes up – but the business transaction suffers.    You need to understand the database and find the optimum number of requests per commit for your business transaction.

How long before the data is visible?

Rather than have one thread process 1000 messages per commit (taking 1010 ms) you may want to have multiple threads processing 10 messages per commit – taking 20 ms.  This means that the data in the database (or replies etc) are visible earlier.    This may be important to your business transaction if you have to worry about response time.

Parallel  threads

  1. Using more threads should improve throughput, unless this is delayed by external factors – such as database locks.
  2. One customer found one thread was optimum because there was no database delays.

How many messages I can consume: Max number of messages can be processed? or throttle limit

There are papers written on this but here is a one minute overview

As fast as the queue manager can process data

  1. The rate at which MQ can write its logs
  2. Keep queue data in memory – ( buffer pools on z/OS, queue buffer on midrange), so few messages on the queue.

Threads

  1. Having parallel threads gives you better throughput than one thread.  You get overlapped writing to the log, the units of work are shorter in duration, you can get parallel IO.
  2. You may be limited by the network.   Having multiple threads from an application means the network can be better utilized.  One thread can be receiving data down the wire, while another thread is waiting in commit.
  3. You may be limited by where your programs run – eg short of CPU, or slow IO (for your System.out.println statements)

Application design

  1. You may get delays due to serialization if all thread are using the same queue.
  2. Remove the debug printf or System.out.println statements.
  3. Using a queue per business application is better than all applications sharing the same queue
  4. Using one reply to queue per web server may be better than a shared reply to queue – especially if you use Apache Camel.
  5. Use get first if possible.  Avoid scans of the queues.

 

The short answer….

You should be able to get thousands of 1KB messages a second through your Java application when using multiple threads.

 

What’s the difference between an MQ Message and a JMS Message

I had problems using the MQI Interface  to create a message for a JMS program to receive.

To see what was in the JMS message,  I used a Java program using JMS to write a message, and used my trusty C program to display it.

I could see that there were message properties in the message

Property 0 name <mcd.Msd> value <jms_text>
Property 1 name <jms.Dst> value <queue:///JMSQ1>
Property 2 name <jms.Rto> value <queue:///JMSQ2>
Property 3 name <jms.Tms> value <1571902099742>
Property 4 name <jms.Dlv> value <2>

These are described here.

The mcd.Msd value is one of jms_none, jms_text, jms_bytes, jms_map, jms_stream, jms_object.   This depends on whether you use Message message, BytesMessage message etc to define your message type.  The jms program receiving the message may be expecting a particular type

The jms.Rto comes from the message.setJMSReplyTo(…).  It was set in the MQMD.ReplyToQ  as well as the message property.

It took me some time to find how to specify value such as for deliveryMode.  I found it here.  For example  message.setDeliveryMode(DeliveryMode.NON_PERSISTENT).   (This comes from javax.jms.DeliveryMode.NON_PERSISTENT,not a com.ibm…. file).

I converted my simple program from JMQI to JMS, in a couple of hours, and was surprised to find it used fewer lines of code than using the JMQI.   Of course I may find I omitted some work, such as error handling, but it seems to be working OK.

Magic methods to decode Java MQ constants to strings.

I had been struggling with MQ and java, and decoding what the return codes numbers were, and found some well gem methods here.

String reasonCode = MQConstants.lookup(2035, “MQRC_.*”);  gave MQRC_NOT_AUTHORIZED

and

String decode  = MQConstants.decodeOptions(gmo.options,”MQGMO_.*”);  gave me

MQGMO_WAIT | MQGMO_SYNCPOINT_IF_PERSISTENT | MQGMO_FAIL_IF_QUIESCING

I wish I had these a couple of years ago – it would have saved me a lot of time!

 

The methods are

static java.lang.String decodeOptions(int optionsP,
java.lang.String optionPattern)

This helper method takes an integer representing a set of IBM MQ options for an MQI structure, and converts them into a string displaying the constants that the options represent.
static int getIntValue(java.lang.String name)

Returns the value of the named MQSeries constant as an int.
static java.lang.Object getValue(java.lang.String name)

Returns the value of the named MQSeries constant.
static java.lang.String lookup(int value,
java.lang.String filter)

Returns the MQSeries constant name or names for the supplied int value.
static java.lang.String lookup(java.lang.Object value,
java.lang.String filter)

Returns the MQSeries constant name or names for the supplied value of type Integer, String, byte[], or char[].
static java.lang.String lookupCompCode(int reason)

Convenience method for finding the constant name for a completion code.
static java.lang.String lookupReasonCode(int reason)

Convenience method for finding the constant name for a reason code.
static void main(java.lang.String[] args)

How do I get a client to disconnect?

I had a question from a customer who asked how they can reduce the number of client connections in use.  They had tried setting a disconnect interval (DISCINT) on the channel, but the connections were like weeds – you kill them off, and they grow back again.

DISCINT is “the length of time after which a channel closes down, if no message arrives during that period”.  This sounds perfect for most people.   The application is in an MQGET, and if no messages arrive, the channel can be disconnected, and the application gets connection broken.   The application can then decide to disconnect or reconnect.
If the application is not in an MQGET, then it will get notified of the broken connection next time it tries to use MQ.

Independent applications

Many applications are well written in that when they get Connection Broken, they just reconnect again, and so the DISCINT has no effect on reducing the number of connections. This may be good for availability but not for resource usage.   It may be good to have 1000 application instances running the day, but perhaps not overnight when there is no work to do.   Ive seen instances where the applications do an MQGET every minute, and with 1000 instances this can use a lot of CPU and doing no useful work.  In this case you want unused application instances to stop, and be restarted when needed.

You cannot use triggering with client connections (unless you have a very smart trigger monitor to produce an event which says start a client program over there).

Use automation periodically check the queue depth, and number of input handles. If there is a high queue depth, or a low number of handles(eg 2)  then start more application instances, across your back-end servers.  Your applications can then disconnect if they have not received a message within say 10 minutes.  This should keep the right number of application instances active.

An administrator should be able to get this automation set up, but getting the application to connect could be a challenge, as this requires the application developer to change the code!

Running under a web server

If your applications are running under a web server you may have mis-configured connection pools.  You can specify the initial size of the pool, and this many connections are made.  As more connections are needed, then more can be added to the pool until the pool maximum is reached. You should specify a time out value, so periodically the pool gets cleaned up, and unused connections are removed, until the pool is back to the initial size.  You should review the initial size of the pools ( is it too large), and the value of the time out value.

This should just be an administrative change.

Good luck, you may be successful in reducing the number of client connections, but do not set your hopes too high.