The z/OS Health Checker has many checks for the configuration of your z/OS system and it’s components. For example check for the existence of expired certificates, and z/OS parameters which are not best practice. The checks are configured into Health Checker, and you can have one or more policies which changes these checks. You can make checks inactive if they are not applicable, for example when they refer to a sysplex function, when your system is not in a sysplex.
Some checks are shipped as inactive, but most are active.
There are different sorts of check
- local – these run in the Health Checker address space
- remote – these run in other address spaces and report into the Health Checker
- Rexx – these run in an address space and report to the Health Checker
- global – these run once in a sysplex
Some of the concepts and commands are not very intuitive (for example the documentation is not very clear on how policy and checks are connected), but on the whole it is pretty easy to understand and use.
There is a policy which allows you to change the operational aspect of individual checks, for example make inactive, or change the description to it provides information specific to your configuration.
You can have multiple policies, so you could have one configuration and policies for different systems (for example different levels of z/OS), or different shifts.
Within each policy each statement has an identifier (it defaults to a sequence number) or you can give each statement a label using the STATEMENT(…) option. Personally I would have used the term “label” to avoid description like the STATEMENT parameter is used to identify the statement. If you specify a STATEMENT then it makes it easy to find it. A check might report “changed by COLINZFS” or “changed by 17 ” it is easier to find COLINZFS than find the 17th STATEMENT.
When do checks run?
The checks are run when
- the HSZPROC is started
- as specified in the configuration (parameter INTERVAL=ONETIME|hhh:mm)
- when the refresh command is issued, such as F HZSPROC,refresh,check(*,*)
What output do checks provide?
A successful check provides no information on the logs. Each exception writes a few lines to the system log. You can run a print job to get a fuller description include full message text, and operation actions etc. For example one RACF check gave a “one line” entry on syslog, but the print job listed of all digital certificates which had, or were due to expire. The output format is what you typically see in a z/OS messages manual.
You can use the SDSF CK command to display the checks, the status, including when they last ran, and take action on the checks; such as temporarily disable or delete a check.
There are various operator display commands you can issue
A summary of the Health Checker configuration
The commands are described here.
f hzsproc,display,checks
gave me
HZS0203I 16.02.33 HZS INFORMATION
POLICY(DEFAULT)
OUTSTANDING EXCEPTIONS: 12
(SEVERITY NONE: 0 LOW: 3 MEDIUM: 8 HIGH: 1)
ELIGIBLE CHECKS: 155 (CURRENTLY RUNNING: 0)
INELIGIBLE CHECKS: 67 DELETED CHECKS: 0
ASID: 0041 LOG STREAM: NOT DEFINED
LOG STREAM WRITES PER HOUR: 1327
LOG STREAM AVERAGE BUFFER SIZE: 2364 BYTES
HZSPDATA DSN: ADCD.S0W1.HZSPDATA
HZSPDATA RECORDS: 828
PARMLIB: AD,CP
ORIGINAL PARMLIB SOURCE: <USER>
OPTIONS: NONE
where
- members HZSPRMAD and HZSPRMCP were used from the sys1.parmlib concatenation
- ORIGINAL PARMLIB SOURCE: <USER> the definitions were read from the sys1.parmlib. You can specify PREV, which means use the same as last time – but the use of this is unclear to me.
Display the status of the individual checks
F HZSPROC,DISPLAY,CHECKS
HZS0200I 16.03.07 CHECK SUMMARY 580
CHECK OWNER CHECK NAME STATE STATUS
IBMCS ZOSMIGV2R4PREV_CS_IWQSC_TCPIP IE INACTIVE
IBMCS CSTCP_IWQ_IPSEC_TCPIP AE SUCCESSFUL
IBMCS CSTCP_CINET_PORTRNG_RSV_TCPIP AE EXCEPTION-MED
IBMCS CSTCP_SYSPLEXMON_RECOV_TCPIP AE SUCCESSFUL
...
A - ACTIVE I - INACTIVE
E - ENABLED D - DISABLED
G - GLOBAL CHECK + - CHECK ERROR MESSAGES ISSUED
This shows
- All of these checks came from the IBMCS (TCPIP) component
- Check ZOSMIGV2R4PREV_CS_IWQSC_TCPIP is inactive
- CSTCP_CINET_PORTRNG_RSV_TCPIP has detected a medium level exception
- State: Indicates whether a check runs at the next specified interval. For example INACTIVE(ENABLED) and ACTIVE(ENABLED)
- Status: Describes the output of the check when it last ran.
For example INACTIVE and EXCEPTION-MED
You can display an individual or similar checks.
F HZSPROC,DISPLAY CHECKS,check=(IBMRACF,RACF_I*)
F HZSPROC,DISPLAY CHECKS,check=(IBMRACF,RACF_GRS_RNL)
F HZSPROC,DISPLAY CHECKS,CHECK=(IBMRACF,RACF_I*)
HZS0200I 14.41.21 CHECK SUMMARY 812
CHECK OWNER CHECK NAME STATE STATUS
IBMRACF RACF_ICHAUTAB_NONLPA AE SUCCESSFUL
IBMRACF RACF_IBMUSER_REVOKED IE INACTIVE
...
Using F HZSPROC,DISPLAY CHECKS,…DETAIL gives a lot of information about each checks, such as state, status, and last ran.
Display a policy
You can display all the items in a policy, or details about a statement in a policy. A policy is a group of changes you make to checks. This could be to make a check inactive, or to change the description(reason) to provide more site specific information.
F HZSPROC,DISPLAY,POLICY,STATEMENT=COLINS
HZS0204I 17.29.33 POLICY SUMMARY 988
POLICY(DEFAULT)
STMT NAME TYPE CHECK OWNER CHECK NAME
COLINS UPD IBMRACF RACF_SYSPLEX_COMMUNICATION
F HZSPROC,DISPLAY,POLICY,STATEMENT=COLINS,DETAIL
HZS0202I 17.30.16 POLICY DETAIL 992
POLICY(DEFAULT) STATEMENT: COLINS
ORIGIN: HZSPRMUS DATE: 20240120
UPDATE CHECK(IBMRACF,RACF_SYSPLEX_COMMUNICATION)
REASON: Colins - Test/Development env
INACTIVE
Update a policy
You can use the command interface to temporarily update the checks and policy, or you can update the HZSPMxx members.
For example I updated USER.*.PARMLIB(HSZPRMUS) with
ADDREPLACE POLICY STATEMENT(COLINS)
UPDATE CHECK(IBMRACF,RACF_SYSPLEX_COMMUNICATION)
DATE(20240120)
INACTIVE
REASON('COLIN - Test/Development env')
ADDREPLACE POLICY
UPDATE CHECK(IBMRACF,RACF_PROTECTALL_FAIL)
DATE(20240120)
INACTIVE
REASON('COLIN2- do not want in one person system')
Note:
- Each update needs and ADDREPLACE POLICY… statement
- If you do not provide a STATEMENT, a numerical one is generated for you
- You need a date (see dates in policies). Each check has a default date specified. If you specify a data which is before the default date, the check is not used. I specified the date I changed it to inactive so I have an audit trail!
- I specified a reason why I made it inactive. This reason is displayed when you display the policy details.
Then used the command
F hzsproc,ADD,PARMLIB=(US,CHECK)
To check the syntax and validity, where US is the suffix of the HZSPRM source (above). I then used
zsproc,ADD,PARMLIB=(US)
To activate the definition.
You can specify which parmlib members are used at startup in the HSZPROC JCL.
How do I check the status of the checks?
In SDSF you can use the CK option to display the checks. There are many columns. The interesting columns (to me) are
- Name: like RACF_PROTECTALL_FAIL
- Owner: (what I would call a component) IBMRACF
- State: Indicates whether a check runs at the next specified interval. For example INACTIVE(ENABLED) and ACTIVE(ENABLED)
- Status: Describes the output of the check when it last ran.
For example INACTIVE and EXCEPTION-MEDIUM - Run count: such as 9
- ModifiedBy: such as STMT(37) in the policy. This allows you to find your update statement. By specifying you STATEMENT(…) makes it easier to use SRCHFOR to find the member with the update.
- Reason: the description from the check such as PROTECTALL(FAIL) should be enabled.
- Update reason: any update from the policy statement, for example COLIN- do not want in one person system
In SDSF CK, you can use the DL line command to display the information in one screen. Sometimes the line command SV (display in ISPF View) works. You can use the DP line command to display the active policy (if any) for the check.
You can also use standard the SDSF commands sort, arrange and filters (such as FILTER STATUS EQ “INACTIVE”) to limit and change the data displayed.
Printing the full health check log
I used
//IBMHZSPR JOB 1,MSGCLASS=H
//HZSPRINT EXEC PGM=HZSPRNT,TIME=1440,REGION=0M,PARMDD=SYSIN
//SYSIN DD *
CHECK(*,*)
,EXCEPTIONS
//SYSOUT DD SYSOUT=*,DCB=(LRECL=256)
After I cleaned my system I had two exception, and a total of 150 lines of output.