Using z/OS health checker is good – but how do I use it?

I had a message on the console saying

HZS0001I CHECK(IBMICSF,ICSF_CLEAR_KEYS): CSFH0056E Clear keys in the CKDS, PKDS, or TKDS were found.

Great – but what next?

In SDSF, you can use the CK command to display the health-checker events.

There were a lot of checks. The SDSF command CK E, displays only those with exceptions; where the result was > 0.

There are various line commands, but SE gives you all of the information, in the ISPF editor.

In the page it told me

CSFH0054I Check for clear keys in the CKDS, PKDS, and TKDS. 
Active TKDS: COLIN.SCSFTKDS
-----------------------------------------------------
PKISRVD.PKITOKEN 00000001T
Explanation: ...
System Action: ...
Operator Response: ...
System Programmer Response: ...

Great – this is very useful. However the action “Contact the ICSF administrator” is not very helpful as I am the ICSF administrator!

z/OS health checker – understanding and configuring it

The z/OS Health Checker has many checks for the configuration of your z/OS system and it’s components. For example check for the existence of expired certificates, and z/OS parameters which are not best practice. The checks are configured into Health Checker, and you can have one or more policies which changes these checks. You can make checks inactive if they are not applicable, for example when they refer to a sysplex function, when your system is not in a sysplex.

Some checks are shipped as inactive, but most are active.

There are different sorts of check

  • local – these run in the Health Checker address space
  • remote – these run in other address spaces and report into the Health Checker
  • Rexx – these run in an address space and report to the Health Checker
  • global – these run once in a sysplex

Some of the concepts and commands are not very intuitive (for example the documentation is not very clear on how policy and checks are connected), but on the whole it is pretty easy to understand and use.

There is a policy which allows you to change the operational aspect of individual checks, for example make inactive, or change the description to it provides information specific to your configuration.

You can have multiple policies, so you could have one configuration and policies for different systems (for example different levels of z/OS), or different shifts. 

Within each policy each statement has an identifier (it defaults to a sequence number) or you can give each statement a label using the STATEMENT(…) option. Personally I would have used the term “label” to avoid description like the STATEMENT parameter is used to identify the statement. If you specify a STATEMENT then it makes it easy to find it. A check might report “changed by COLINZFS” or “changed by 17 ” it is easier to find COLINZFS than find the 17th STATEMENT.

When do checks run?

The checks are run when

  • the HSZPROC is started
  • as specified in the configuration (parameter INTERVAL=ONETIME|hhh:mm)
  • when the refresh command is issued, such as F HZSPROC,refresh,check(*,*)

What output do checks provide?

A successful check provides no information on the logs. Each exception writes a few lines to the system log. You can run a print job to get a fuller description include full message text, and operation actions etc. For example one RACF check gave a “one line” entry on syslog, but the print job listed of all digital certificates which had, or were due to expire. The output format is what you typically see in a z/OS messages manual.

You can use the SDSF CK command to display the checks, the status, including when they last ran, and take action on the checks; such as temporarily disable or delete a check.

There are various operator display commands you can issue

A summary of the Health Checker configuration

The commands are described here.

f hzsproc,display,checks

gave me

HZS0203I  16.02.33 HZS INFORMATION                          
POLICY(DEFAULT)
OUTSTANDING EXCEPTIONS: 12
(SEVERITY NONE: 0 LOW: 3 MEDIUM: 8 HIGH: 1)
ELIGIBLE CHECKS: 155 (CURRENTLY RUNNING: 0)
INELIGIBLE CHECKS: 67 DELETED CHECKS: 0
ASID: 0041 LOG STREAM: NOT DEFINED
LOG STREAM WRITES PER HOUR: 1327
LOG STREAM AVERAGE BUFFER SIZE: 2364 BYTES
HZSPDATA DSN: ADCD.S0W1.HZSPDATA
HZSPDATA RECORDS: 828
PARMLIB: AD,CP
ORIGINAL PARMLIB SOURCE: <USER>
OPTIONS: NONE

where

  • members HZSPRMAD and HZSPRMCP were used from the sys1.parmlib concatenation
  • ORIGINAL PARMLIB SOURCE: <USER> the definitions were read from the sys1.parmlib. You can specify PREV, which means use the same as last time – but the use of this is unclear to me.

Display the status of the individual checks

F HZSPROC,DISPLAY,CHECKS                                               
HZS0200I 16.03.07 CHECK SUMMARY 580
CHECK OWNER CHECK NAME STATE STATUS
IBMCS ZOSMIGV2R4PREV_CS_IWQSC_TCPIP IE INACTIVE
IBMCS CSTCP_IWQ_IPSEC_TCPIP AE SUCCESSFUL
IBMCS CSTCP_CINET_PORTRNG_RSV_TCPIP AE EXCEPTION-MED
IBMCS CSTCP_SYSPLEXMON_RECOV_TCPIP AE SUCCESSFUL
...
A - ACTIVE I - INACTIVE
E - ENABLED D - DISABLED
G - GLOBAL CHECK + - CHECK ERROR MESSAGES ISSUED

This shows

  • All of these checks came from the IBMCS (TCPIP) component
  • Check ZOSMIGV2R4PREV_CS_IWQSC_TCPIP is inactive
  • CSTCP_CINET_PORTRNG_RSV_TCPIP has detected a medium level exception
  • State: Indicates whether a check runs at the next specified interval. For example INACTIVE(ENABLED) and ACTIVE(ENABLED)
  • Status: Describes the output of the check when it last ran.
    For example INACTIVE and EXCEPTION-MED

You can display an individual or similar checks.

F HZSPROC,DISPLAY CHECKS,check=(IBMRACF,RACF_I*)
F HZSPROC,DISPLAY CHECKS,check=(IBMRACF,RACF_GRS_RNL)

F HZSPROC,DISPLAY CHECKS,CHECK=(IBMRACF,RACF_I*)                    
HZS0200I 14.41.21 CHECK SUMMARY 812
CHECK OWNER CHECK NAME STATE STATUS
IBMRACF RACF_ICHAUTAB_NONLPA AE SUCCESSFUL
IBMRACF RACF_IBMUSER_REVOKED IE INACTIVE
...

Using F HZSPROC,DISPLAY CHECKS,…DETAIL gives a lot of information about each checks, such as state, status, and last ran.

Display a policy

You can display all the items in a policy, or details about a statement in a policy. A policy is a group of changes you make to checks.  This could be to make a check inactive, or to change the description(reason) to provide more site specific information.


F HZSPROC,DISPLAY,POLICY,STATEMENT=COLINS
HZS0204I 17.29.33 POLICY SUMMARY 988
POLICY(DEFAULT)
STMT NAME TYPE CHECK OWNER CHECK NAME
COLINS UPD IBMRACF RACF_SYSPLEX_COMMUNICATION
F HZSPROC,DISPLAY,POLICY,STATEMENT=COLINS,DETAIL                 
HZS0202I 17.30.16 POLICY DETAIL 992
POLICY(DEFAULT) STATEMENT: COLINS
ORIGIN: HZSPRMUS DATE: 20240120
UPDATE CHECK(IBMRACF,RACF_SYSPLEX_COMMUNICATION)
REASON: Colins - Test/Development env
INACTIVE

Update a policy

You can use the command interface to temporarily update the checks and policy, or you can update the HZSPMxx members.

For example I updated USER.*.PARMLIB(HSZPRMUS) with

ADDREPLACE POLICY STATEMENT(COLINS) 
UPDATE CHECK(IBMRACF,RACF_SYSPLEX_COMMUNICATION)
DATE(20240120)
INACTIVE
REASON('COLIN - Test/Development env')

ADDREPLACE POLICY
UPDATE CHECK(IBMRACF,RACF_PROTECTALL_FAIL)
DATE(20240120)
INACTIVE
REASON('COLIN2- do not want in one person system')

Note:

  • Each update needs and ADDREPLACE POLICY… statement
  • If you do not provide a STATEMENT, a numerical one is generated for you
  • You need a date (see dates in policies). Each check has a default date specified. If you specify a data which is before the default date, the check is not used. I specified the date I changed it to inactive so I have an audit trail!
  • I specified a reason why I made it inactive. This reason is displayed when you display the policy details.

Then used the command

F hzsproc,ADD,PARMLIB=(US,CHECK)

To check the syntax and validity, where US is the suffix of the HZSPRM source (above). I then used

zsproc,ADD,PARMLIB=(US)

To activate the definition.

You can specify which parmlib members are used at startup in the HSZPROC JCL.

How do I check the status of the checks?

In SDSF you can use the CK option to display the checks. There are many columns. The interesting columns (to me) are

  • Name: like RACF_PROTECTALL_FAIL
  • Owner: (what I would call a component) IBMRACF
  • State: Indicates whether a check runs at the next specified interval. For example INACTIVE(ENABLED) and ACTIVE(ENABLED)
  • Status: Describes the output of the check when it last ran.
    For example INACTIVE and EXCEPTION-MEDIUM
  • Run count: such as 9
  • ModifiedBy: such as STMT(37) in the policy. This allows you to find your update statement. By specifying you STATEMENT(…) makes it easier to use SRCHFOR to find the member with the update.
  • Reason: the description from the check such as PROTECTALL(FAIL) should be enabled.
  • Update reason: any update from the policy statement, for example COLIN- do not want in one person system

In SDSF CK, you can use the DL line command to display the information in one screen. Sometimes the line command SV (display in ISPF View) works. You can use the DP line command to display the active policy (if any) for the check.

You can also use standard the SDSF commands sort, arrange and filters (such as FILTER STATUS EQ “INACTIVE”) to limit and change the data displayed.

Printing the full health check log

I used

//IBMHZSPR JOB 1,MSGCLASS=H 
//HZSPRINT EXEC PGM=HZSPRNT,TIME=1440,REGION=0M,PARMDD=SYSIN
//SYSIN DD *
CHECK(*,*)
,EXCEPTIONS
//SYSOUT DD SYSOUT=*,DCB=(LRECL=256)

After I cleaned my system I had two exception, and a total of 150 lines of output.

One minute MVS: Health checker

The z/OS Health checker is a great facility, and makes the systems programmer’s job much easier. z/OS provides a set of configuration guidelines, such as the value for … should be …. At IPL and periodically, it checks the system and reports anything which is out of line. This allows you to check your configuration is consistent with best practice, and may identify problems you were not aware of.

For example it reported

  • I had some digital certificates about to expire or had already expired – whoops.
  • Some OMVS mounts had failed (because the entries in the BPXPRM… file were not active on the system)
  • Some storage allocations were not as recommended.

When I printed out the full report, it told me what the recommended values where, and what values I had in my configuration so it was easy to change.

You can have different sorts of checks

  • local – these run in the Health Checker address space
  • remote – these run in other address spaces and report into the Health Checker
  • Rexx – these run in an address space and report to the Health Checker

You can print out the full list of problems, and this comes with comprehensive help information and instructions on what to do about the problem.

Example output in syslog

HZS0001I CHECK(IBMCS,CSVTAM_CSM_STG_LIMIT): 442                       
ISTH017E Communications storage manager (CSM) storage allocation
definitions might not be optimal
HZS0002E CHECK(IBMRACF,RACF_JESJOBS_ACTIVE): 443
IRRH229E The class JESJOBS is not active.
HZS0001I CHECK(IBMOCE,OCE_XTIOT_CHECK): 444
IECH0101E OPEN macro support for XTIOT, uncaptured UCBs and DSAB
above the line is not enabled for non-VSAM. IBM recommends setting
NON_VSAM_XTIOT=YES in the DEVSUPxx member of PARMLIB.
HZS0001I CHECK(IBMRACF,RACF_PASSWORD_CONTROLS): 445
IRRH283E The RACF_PASSWORD_CONTROLS check found an exception
with one or more password control settings.
HZS0002E CHECK(IBMXCF,XCF_TCLASS_CLASSLEN): 446
IXCH0420E The XCF transport class size segregation configuration on
system S0W1 is inconsistent with the owner specification.

You can disable health checks which you do not want, so after cleaning your system, you should aim to have no health check exceptions.

What do these mean?

You can run a print job

//IBMHZS   JOB 1,MSGCLASS=H 
//HZSPRINT EXEC PGM=HZSPRNT,TIME=1440,REGION=0M,PARMD
//SYSIN DD *
CHECK(*,*)
,EXCEPTIONS
//SYSOUT DD SYSOUT=*,DCB=(LRECL=256)

Example output of print job

Certificates Expiring within 60 Days

CHECK(IBMRACF,RACF_CERTIFICATE_EXPIRATION)                                 
SYSPLEX: ADCDPL SYSTEM: S0W1
START TIME: 01/19/2024 07:14:39.529686
CHECK DATE: 20111010 CHECK SEVERITY: MEDIUM

Certificates Expiring within 60 Days

S Cert Owner Certificate Label End Date Trust Rings
- ------------ -------------------------------- ---------- ----- -----
CERTAUTH Verisign Class 1 Individual CA 2008-05-12 No 0
E ID(START1) JES2 CLIENT EDS 2019-03-21 Yes 1
CERTAUTH GTE CyberTrust Root CA 2006-02-23 No 0
...
Only certificates that are marked as trusted result in exceptions.
Exceptions are indicated by an "E" or an "M" in the "S" (Status)
column. An "E" indicates that the certificate has expired within
time period examined by the check. An "M" indicates that the
certificate has no end date in the certificate profile. The trust
status of the certificate is shown in the "Trust" column. The number
of key rings to which the certificate is connected (other than the
virtual key ring) is shown in the "Rings" column. A value of "99999"
in the "Rings" column indicates that the certificate is connected to
99999 or more rings.
Use the RACDCERT LIST command to list complete information about any
certificate. The RACDCERT command syntax is:

RACDCERT CERTAUTH LIST(LABEL('label-name'))
or
RACDCERT SITE LIST(LABEL('label-name'))
or
RACDCERT ID(user-id) LIST(LABEL('label-name'))
...

BPXH061E One or more file systems specified in the BPXPRMxx parmlib
members are not mounted.

* High Severity Exception *      

BPXH059I The following file systems are not active:
-----------------------------------------------------------
File System: ZWE200.ZFS
Parmlib Member: BPXPRMZW
Path: /usr/lpp/zowe
Return Code: 00000099
Reason Code: EF096150

File System: ZWE200.CONFIG.ZFS
Parmlib Member: BPXPRMZW
Path: /apps/zowe/v20
Return Code: 00000099
Reason Code: EF096150

Whoops – I missed than one due to a finger problem

CSFH0042I Check for weak CCA cryptographic keys in the PKDS

CHECK(IBMICSF,ICSF_WEAK_CCA_KEYS)                                                   
SYSPLEX: ADCDPL SYSTEM: S0W1
START TIME: 01/19/2024 07:15:00.161074
CHECK DATE: 20181101 CHECK SEVERITY: LOW

CSFH0042I Check for weak CCA cryptographic keys in the PKDS

Active PKDS: CSF.CSFPKDS.NEW
---------------------------------------------------------
COLIN
COLIN2

* Low Severity Exception *

CSFH0044E Weak CCA cryptographic keys in the PKDS were found.
....

EZBH008E The port range defined for CINET use has not been reserved for
OMVS on this stack.

CHECK(IBMCS,CSTCP_CINET_PORTRNG_RSV_TCPIP)                                        
SYSPLEX: ADCDPL SYSTEM: S0W1
START TIME: 01/19/2024 07:14:59.665575
CHECK DATE: 20070901 CHECK SEVERITY: MEDIUM

* Medium Severity Exception *

EZBH008E The port range defined for CINET use has not been reserved for
OMVS on this stack.

Explanation: The port range defined for CINET use in the BPXPRMxx
parmlib member is not reserved for OMVS on this stack.
...