Giving a started task userid a password should be a sackable offence.

Last week I was going through some product documentation, and I got to the part where it said “Now change the started task userid, and give it a password”. I made a note to raise a defect on this, because this is a no-no.
This week someone asked on IBM-MAIN, the impact of giving a started task userid a password. There were many well informed comments covering things I didn’t know about, so I thought I’d make a blog post of of the comments.

Best practices dictate that passwords only be provided when a userid will be used by a specific person and that person needs a password for that userid. All other userids should never have a password.

A userid can be revoked because of too many invalid password attempts, or revoked because of inactivity. If a userid is revoked it cannot logon. If the userid of a started task is revoked, the started task may start, but it may be restricted as to what it can do – because it cannot logon.

You can alter a userid to have NOPASSWORD (and have no PHRASE). This means started tasks can start, but the userid cannot be used to logon to the system. This is known as making the userid PROTECTED.

Often started task userids have special capabilities, such as running as different userids, being able to set security options, or modify system storage. This means you do not want your Help Desk staff from adding or resetting passwords for protected userid. Changes to these protected userids should be done from a secure, limited access userid.

If you think through the impact of using started tasks. It may be better for all routine production jobs to be run as started tasks. This has the advantage that there are no passwords involved, and you can use automation to issue the start command based on a timer.

You might have a CICS started task userid for all CICS regions or just for a subset of regions. You might have one started task userid for all started tasks, or a started task userid for each logical instance, eg CICS, MQ, DB2, Zowe, TCP/IP etc.

System jobs

System jobs should be run as started tasks, and the started tasks should be protected

Personal jobs

You can submit jobs from your userid, (and not specify a password) and the job will run under your userid.

You can put USER=name,PASSWORD=… on a job card, and if these validate the job will run with the specified userid. This is not a good idea, as the password may be visible in the dataset.

Departmental userids

You can put USER=name and omit the password, and use surrogate checking. The documentation says

You can allow the use of surrogate users. A surrogate user is a RACF-defined user who has been authorized to submit jobs on behalf of another user (the execution user) without specifying the execution user’s password. Jobs submitted by a surrogate user run with the identity of the execution user. For example, if user JOE submits a job with the following JOB statement, JOE is the surrogate user and TOM is the execution user:

JOE can submit userid containing

//jobname JOB 'accounting-information',USER=TOM

You set up a security profile (see the documentation) to control which userid can specify a userid on the JOB USER= statement.

All access checks are done with TOM’s user ID.

The TOM userid can be a protected userid – without a password, if surrogates are used.

To set up surrogates, defined the profile, and give a group access to the profile, rather than give userids access. You are likely to have a group already defined. Administration, such as when someone leaves the department, is much easier, as you just need to remove the persons userid from the group.

Thanks to

Robert S. Hansel, Seymour J Metz, Steve Beaver, Jack Zukt, Mike Schwab, Jon Perryman for their comments.

Jump up and down: Do not give userids access to resources!

I am doing some work connecting subsystems together. The documentation for all the products involved, describe giving a userid permission to access to a resource.

This is not best practice. Like many things it will be obvious once you understand it. You need to look at the bigger pictures.
The documentation says for userids who want to use “this”, give them access to the profile “…”. What’s wrong with that?

  • If you have a department of 1000 people, giving them all access to the resource will be tedious
  • There are likely to be several resources people need access to, so connecting these 1000 userids to multiple resources will be even more tedious.
  • Someone joins your department – so you need to connect their userid to the long list of groups.
  • Someone leaves your department – you cannot trivially ask what resources can this userid access – you have to look at the access list for each resource and remove the userid from the group.

You most probably have groups set up already. Rather than give the userid access, give the group access to the resource.

If someone joins your group – you connect their userid to the groups – and they have access. If someone leaves your department, remove them from the groups – and they no longer have access to the resources.

You my want to support a new product… That’s easy – give the group(s) access to the resources – and the people will auto-magically get access.

As I said it is obvious once you see it.

Do not give userids access to resources – give groups access, and connect the userid to the group. I’ll go and raise documentation comments on the products’ documentation.

Zowe: Planning – certificates

A private key is used for encryption and should be kept private. If someone has access to the private key they can impersonate you.

A public key is the opposite of the private key. It is used for decrypting data encrypted with the private key. The public key can (and should) be generally available.

Certificates are used in TLS for authentication and encryption, and can be used for identification. They include a public key.

A Certificate Authority(CA) certificate is used to validate other certificates. It involves doing a checksum of a certificate, and encrypting it with the CA private key – a process known as signing the certificate

To validate a certificate, the recipient needs a copy of the CA which was used to sign the original certificate. It can the decrypt the encrypted checksum, and compare it with the certificates checksum.

If you create a Certificate Authority certificate you will need to distribute this to all machines that might communicate with your server(possibly thousands of machines), and installed into the keystore on those machines. This can be a big task. Most system have one site wide CA certificate which is distributed to all machines. You might have a second CA to limit access to a system. This CA is used to sign the server certificates.

Creating certificates

As part of creating a certificate you create the private/public keys. There are different algorithms for these keys. Some are stronger than others (it takes more time and CPU to break them) Keys with elliptic curve algorithms are generally stronger than using RSA techniques, and there other techniques resistant to Quantum Computing.

You have to use a server certificate with RSA and keysize 2048.

I found that authentication with JWT (Java Web Tokens) only worked with RSA keys and not Elliptic Curves. This is because of encryption with JWT.

Key stores

You should use keyrings rather than .pem files, as they are more secure. .pem files can be copied, and anyone with authority can copy them. You have to give explicit permission to be allowed to access a keyring. Certificate and the private keys within a keyring can be stored in cryptographic hardware, and the private keys are never exposed in clear text.

Many systems uses a keystore for storing the private key used by the server, and a trust store for the Certificate Authority keys needed to validate any client certificate sent to the server. This can, but is not recommended, be the same as the keystore.

The keystore will need the server key. You can specify which key should be used.

  • the Certificate Authority keys needed to validate the server key.
  • the server userid needs access to the keyring. If the private key belongs to the server’s userid, then the server’s userid needs read access to the keyring. If the private key belongs to a different userid, the server’s userid needs update access to the keyring. See here for more information.

The trust store needs all the CA certificates which may be needed to verify a client certificate.

The client machine will have one or more certificates. A copy of the CA used to create these needs to be installed in the trust store on the server.

I understand that if you add a certificate to the trust store, you need to restart Zowe for it to be picked up, so try to get a list of all the CAs you will need before you start.

Validating certificates

The trust store needs the certificates to validate any client certificates sent to the server. This will usually just be Certificate Authority certificates.

Zowe works with z/OSMF. They communicate with certificates. The z/OSMF trust store keyring needs the CA of the Zowe server certificate, and the Zowe trust store keyring needs the CA of the z/OSMF server key. If they are not set up properly you can get messages “LoadBalancer does not contain an instance for the service zaas”.

If you have a site wide CA, which is the same for both of the server keys, then you do not need to do any more work, as both trust store keyrings will already have the CA. If Zowe and z/OSMF have different CA certificates, the CA certificates need to be connected to the other keyrings.

Subject Alternative Name

The Subject Alternative Name within a certificate provides the IP addresses, or IP Names of the server. A client can check this address with the IP address of the session, and terminate the session if they do not match. This is considered best practice. This check can be disabled in Zowe by using verifyCertificates NOSTRICT in the zowe.yaml file.

RACF allows one name or address when the certificate is defined, for example ALTNAME(IP(127.0.0.1))

On my z/OS system the command TSO NETSTAT HOME gives three addresses 127.0.0.1 and 10.1.1.2 and 10.1.2.6.

You can configure your sysplex, so all systems in the sysplex have the same IP address, and traffic gets routed internally to the correct system. Without this, if you start a server on a different LPAR it will have a different IP address, and so the validation will fail.

On my system, the zOSMF certificate did not have an ALTNAME specified, and so failed the Zowe checks. I had to set the Zowe option verifyCertificates NOSTRICT for it to work, until I fixed the certificate.

If the z/OSMF certificate has an ALTNAME(IP…) specified, use the IP address value when you configure zOSMF for example

zOSMF: 
host: 127.0.0.1
port: 10443
applId: IZUDFLT

Mapping certificates

If you are using client certificate to authenticate rather than a userid and password, then you’ll need to map certificates to userids, for example with the RACDCERT MAP command. You can specify which CA the certificates were signed with, and fields from the subject Distinguished Name. The question of which is better: using userid and password, or client certificate to logon has no easy answer

  • It is easy for a hacker to get a password (lost handbag, yellow sticky stuck to the screen, a couple of pints down the pub) It is easy to change a password once is is compromised.
  • It is harder for a hacker to get a certificate – but it is harder to change and re-issue a certificate. You have to get the updated certificate down to the client’s machine. It could be stolen if a hacker has access to the machine.
  • Using Biometric data to logon is the ultimate limit in this area. Hackers could steal it – but there is no way of changing it if it is compromised!

Decisions

You need to decide

  • Are you going to have a CA just for Zowe? or reuse the site CA.
    • If you are going to have a Zowe specific CA – how are you going to distribute it to all the client machines.
    • You’ll need to ensure Zowe and z/OSMF have the other’s CA certificates in their trust store.
  • Are you going to use Subject Alternative Name ALTNAME(IP(10.1.1.2))
    • What value of verifyCertificates STRICT|NOSTRICT are you going to specify in the zowe.yaml file.
  • Are you going to authenticate using certificates. You will need to set up mapping from certificate to userid. RACDCERT MAP
  • If you will be using JWT, and so need an RSA key in the server certificate

Why can’t java use my key ring?

I had a problem with z/OSMF. I configured it to use an exiting keyring, but it consistently refused to use it. I had messages like

[WARNING ] CWPKI0809W: There is a failure loading the defaultKeyStore keystore. If an SSL configuration references the defaultKeyStore keystore, then the SSL configuration will fail to initialize.

This blog post covers how I debugged this situation.

What seemed strange was this only occurred when an Elliptic Curve certificate was being used – and not an RSA certificate.

Even more curiouser was the documentation mentioned access to the <ringOwner>.<ringName>.LST resource in the RDATALIB class. See here. I didn’t have this defined and yet RSA certificates would work! So curiouser and curiouser (or for the people who like correct grammar, curiouser and more curiouser).

All applications needing access to certificates and private keys use the R_datalib callable service.

The bottom line

  • z/OSMF has userid IZUSVR
  • I had a keyring and used two certificates
    • An RSA certificate, CCPKeyring.IZUDFLT, belonging to userid IZUSVR – based on the sample JCL provided by z/OSMF
    • An existing Elliptic Curve certificate NISTEC224 belonging to userid COLIN. This works else where.
  • Without <ringOwner>.<ringName>.LST defined the class(RDATALIB) the RSA certificate worked
  • Without <ringOwner>.<ringName>.LST defined the class(RDATALIB) the Elliptic Curve certificate failed
  • Once I found the problem I defined <ringOwner>.<ringName>.LST in class(RDATALIB), and gave the userid IZUSVR Update access to it – and the Elliptic curve worked
  • The reasons (being wise after the event)
    • R_datalib checks access on one profile in the RDATALIB class first – <ringowner>.<ringname>.LST. If there is none, it will fall back to check on two profiles in the FACILITY class – IRR.DIGTCERT.LISTRING and IRR.DIGTCERT.GENCERT. If the certificate is not owned by the accessing ID (except CERTAUTH or SITE), RDATALIB class has to be used for private key access.
    • This is true for the RSA certificate, used the IRRDIGTCERT.LISTRING class(FACILITY) and had access. So this worked.
    • For the Elliptic Curve, the caller’s userid (IZUSVR) is not the associated with the certificate (COLIN) so this fails, and the logic drops through to the RDATALIB checking.
    • The caller’s user ID has READ or UPDATE authority to the ..LST resource in the RDATALIB class. READ access enables retrieving one’s own private key, UPDATE access enables retrieving other’s. The ring did not exist, and so this access was not given.

How did I debug this? – Using Java trace

Adding configuration to z/OSMF

I copied /global/zosmf/configuration/local_override.cfg to /global/zosmf/configuration/local_override.colin

I edited/global/zosmf/configuration/local_override.cfg and changes the JVM options line to

JVM_OPTIONS=”-Xoptionsfile=’/global/zosmf/configuration/local_override.colin'”

I edited the local_override.colin, deleted all but the JVM options line, then split the line at \n so it looks like

-Dcom.ibm.ws.classloading.tcclLockWaitTimeMillis=300000
-Xscmx150M
-Xquickstart

Add debug information to the configuraton file

I added

-Djava.security.auth.debug=pkcs11keystore
-Dlog.level=Error

The output

[err] Jan 17, 2025 8:18:52 AM com.ibm.crypto.ibmjcehybrid.provider.HybridRACFKeyStore engineLoad 
TRACE: Loading keyring CCPKeyring.IZUDFLT as a JCECCARACFKS type keystore.
...
[err] Jan 17, 2025 8:19:02 AM com.ibm.crypto.hdwrCCA.provider.RACFInputStream getEntry
FINER: The private key of NISTEC224 is not available or no authority to access the private key
[err] Jan 17, 2025 8:19:02 AM com.ibm.crypto.ibmjcehybrid.provider.HybridRACFKeyStore engineLoad
TRACE: Error loading and storing certificates and key material from underlying JCECCARACFKS keyring CCPKeyring.IZUDFLT
java.io.IOException: The private key of NISTEC224 is not available or no authority to access the private key . This can be expected if the IBMJCECCA is not setup correctly or
ICSF is down. Will now attempt to load the keyring as a JCERACFKS keyring.

Which is not a very helpful message.

How did I debug this? – Using RACF trace

R_datalib is the callable service to ALL the exploiters which need access to a RACF keyring (certificates and private keys). It is r_datalib or its alias irrsdl00 with callable type number 41.

Enable the RACF trace

#SET TRACE(CALLABLE(TYPE(41))JOBNAME(IZU*))

Start GTF

S GTF.GTF,M=GTFRACF

This reported

IEF403I GTF - STARTED - TIME=08.17.03                                  
IEF188I PROBLEM PROGRAM ATTRIBUTES ASSIGNED
AHL121I TRACE OPTION INPUT INDICATED FROM MEMBER GTFRACF OF PDS
USER.Z24C.PROCLIB
TRACE=USRP
USR=(F44)
END
AHL103I TRACE OPTIONS SELECTED --USR=(F44)
AHL906I THE OUTPUT BLOCK SIZE OF 27998 WILL BE USED FOR OUTPUT 702
DATA SETS:
SYS1.TRACE

I started z/OSMF until it failed.

Stop GTF

p GTF 
AHL006I GTF ACKNOWLEDGES STOP COMMAND
AHL904I THE FOLLOWING TRACE DATASETS CONTAIN TRACE DATA :
SYS1.TRACE

Use IPCS to look at the dump, using command GTF USR(ALL). Go to the bottom of the output, use the command report view. This gives an ISPF edit session.

  • x all
  • f ‘RACF Reason code:’ all
    • You are interested in the non zero codes. “Label” each line of interest using the line prefix command .a, .b etc.
  • reset
  • loc .a
    • This will position you by the labelled line. Look up the RACF return and reason codes here. I had Reason Code 2c, which is decimal 44. Look for the keyring, or other information. I do not know which data tells you which sub operation r_datalib was doing, but for me it had the keyring name “CCPKeyring.IZUDFLT “. The description in the reason code documentation does not cover the situation of not having update access to the keyring, so I’ve raised a doc comment on it.

Using and debugging RACF CLASS(APPL) and pthread_security_np

I’ve blogged I want to be someone else – or using pthread_security_np, how a server can be passed a userid or password to logon to a server to do some work. For example I was logging on from a web server to the RMF GPMSERVE for displaying RMF reports in a a web browser.

I had followed the documentation, but all my userids had access, when none of them should have done. This blog post covers the steps I used to dig into this.

The RACF calls used, have a flag set “Suppress any RACF Messages”, which makes it harder to diagnose problems.

With most RACF calls you can define a profile

ADDSD 'COLIN.PROT.DATASET' UACC(NONE) WARNING

where the WARNING option says “If the userid does not have access – give the userid access, but write a message on the console”. This allows you to find when you need to grant access. Once the profile is established, you can use the ALTSD … NOWARNING. Userids requesting access, but which are not permitted will now fail.

Defining a profile with CLASS(APPL), the warning has no effect. I had to use NOTIFY(COLIN) to get notified of problems.

Tracing the requests

To trace all of the RACF calls made by my job COLINNT I issued

#set trace(callable(all),racroute(all),jobname(COLINNT))      

This gave me many hundreds of calls.

Once I had been through the whole process, I found the trace for the pthread_security_np etc calls was just

set trace(callable(type(38)),jobname(COLINNT))

Collect a trace

See Collecting and understanding a RACF GTF trace output.

Looking at the trace output

There is a trace record “before” (PRE) matching “after” (POST). Many parameters are the same (as you might expect).

The output of these traces is verbose, with unnecessary information and with extra blanks lines. For example for one parameter

   Area length:                  00000008 

Area value:
D6C6C6E2 C5E30000 | OFFSET.. |

Area length: 00000004

Area value:
00000000 | .... |

Area length: 00000008

In the description below, I’ve squashed this down to a single line

area length  8  OFFSET0000 length 4 00000000 

It looks like the field OFFSET0000 has the offset in hex from somewhere. I didn’t find this information useful.

A compressed trace record is given below. For the interpretation of the fields see the following trace record.

RTRACE
OMVSPRE
Service number 00000026
Parameters
area length x6c
area length 8 OFFSET0000 length 4 00000000
area length 8 OFFSET0004 length 4 00000000
area length 8 OFFSET0008 length 4 00000000
area length 8 OFFSET000C length 4 00000000
area length 8 OFFSET0010 length 4 40404040
area length 8 OFFSET0014 length 4 00000000
area length 8 OFFSET0018 length 4 40404040
area length 8 OFFSET001C length 1 01
area length 8 OFFSET0020 length 4 C4800000
area length 8 OFFSET0024 length 6 05c1c2c3c4C2 = .ABCDB
area length 8 OFFSET0028 length 4 00000000
area length 8 OFFSET002C length 9 08C7D7D4 E2C5D9E5 C5 = .GPMSERVE
Internal information
area length 8 OFFSET0034 length x37 = "Server Userid=IBMUSER created a full ACEE"
area length 8 OFFSET0048 length 4 00000000
area length 8 OFFSET0050 length 9 08000000 00000000 00
area length 8 OFFSET0054 length 1 00
area length 8 OFFSET0018 length 4 40404040
Internal data
area length xc0 ACEE
area length x50 userid information
area length x90 ACEX
area length x50 USP

hex dump of record

The RACF commands documentation says for a callable service, 0x00000026 is function number for IRRSIA00. This page gives the name of the function name IRRSIA00 and the description z/OS kernel on behalf of servers that use pthread_security_np servers or __login, or MVS servers that do not use z/OS UNIX services.

The parameters are for the initACEE (IRRSIA00) call.

The matching “after” record, OMVSPOST was (with the parameters of the the IRRSIA00 call format, parameters) are

RTRACE
OMVSPOST
Service number: 00000026
RACF Return code: 00000008
RACF Reason code: 00000020

Parameters
area length 8 OFFSET0000 length 4 00000000 >work_area<
area length 8 OFFSET0004 length 4 00000000 >ALET<
area length 8 OFFSET0008 length 4 00000008 >SAF_return_code<
area length 8 OFFSET000C length 4 00000000 >ALET<
area length 8 OFFSET0010 length 4 00000008 >RACF_return_code<
area length 8 OFFSET0014 length 4 00000000 >ALET<
area length 8 OFFSET0018 length 4 00000020 >RACF_reason_code<
area length 8 OFFSET001C length 1 01 >Function_code 1 = Create an ACEE<
area length 8 OFFSET0020 length 4 C4800000 >Attributes - see below<
area length 8 OFFSET0024 length 6 05c1c2c3c4C2 >Userid .ABCDB<
area length 8 OFFSET0028 length 4 00000000 >ACEE_pointer<
area length 8 OFFSET002C length 9 08C7D7D4 E2C5D9E5 C5 >APPLID = .GPMSERVE<
Internal information
area length 8 OFFSET0034 length x37 = "Server Userid=IBMUSER created a full ACEE"
area length 8 OFFSET0048 length 4 00000000
area length 8 OFFSET0050 length 9 08000000 00000000 00
area length 8 OFFSET0054 length 1 00
area length 8 OFFSET0018 length 4 40404040
Internal data
area length xc0 ACEE
area length x50 userid information
area length x90 ACEX
area length x50 USP

There the attributes C480000 mean

  • X80000000 – Create the ACEE
  • X40000000 – Createthe USP for the userid
  • X04000000 – Suppress any RACF Messages
  • X00800000 – Return an OUSP in the output area

Note the flag:X04000000 – Suppress any RACF Messages.

The return code

area length  8  OFFSET0008 length 4 00000008 >SAF_return_code<
...
area length 8 OFFSET0010 length 4 00000008 >RACF_return_code<
area length 8 OFFSET0014 length 4 00000000 >ALET<
area length 8 OFFSET0018 length 4 00000020 >RACF_reason_code<

8,8,32 means The user does not have appropriate RACF access to either the SECLABEL, SERVAUTH profile, or APPL specified in the parmlist.

For other RACF services the trace entries follow a similar format.

Collecting and understanding a RACF GTF trace output

A RACF request can be

  • A callable service, such as IRRSDL00 which is used to extract information about keyrings. This can be used by High Level Languages. These have a trace type of OMVS
  • A RACROUTE request, an assembler macro interface, such as FASTAUTH. These have a trace type of RACF.

You get a trace entry PRE the call, and and a trace entry POST the call. A callable service would have an OMVSPRE, and an OMVSPOST entry.

When setting up your RACF trace you need to define which callable service, or which RACROUTE requests you want to trace.

Example of a callable server

Entry trace

Trace Identifier:             00000036 
Record Eyecatcher: RTRACE
Trace Type: OMVSPRE
Ending Sequence: ........
Calling address: 00000000 20801A0F
Requestor/Subsystem: ........ ........
Primary jobname: IBMINC5
Primary asid: 00000029
Primary ACEEP: 00000000 008FA8A8
Home jobname: IBMINC5
Home asid: 00000029
Home ACEEP: 00000000 008FA8A8
Task address: 00000000 008D6A88
Task ACEEP: 00000000 00000000
Time: DFF36A27 5460CC00
Error class: ........
Service number: 00000029
RACF Return code: 00000000
RACF Reason code: 01000000
Return area address: 00000000 00000000
Parameter count: 00000021

Interesting information

  • Trace Type: OMVSPRE this is an entry trace
  • Primary jobname: IBMINC5
  • Home jobname: IBMINC5
  • The RACF return and reason code have no meaning on entry
  • Service number: 00000029 is hex so the service is 41. The RACF commands documentation says for a callable service, 41 is function number IRRSDL00.

Exit trace

Trace Identifier:             00000036 
Record Eyecatcher: RTRACE
Trace Type: OMVSPOST
Ending Sequence: ........
Calling address: 00000000 20801A0F
Requestor/Subsystem: ........ ........
Primary jobname: IBMINC5
Primary asid: 00000029
Primary ACEEP: 00000000 008FA8A8
Home jobname: IBMINC5
Home asid: 00000029
Home ACEEP: 00000000 008FA8A8
Task address: 00000000 008D6A88
Task ACEEP: 00000000 00000000
Time: DFF36A27 5468DC40
Error class: ........
Service number: 00000029
RACF Return code: 00000008
RACF Reason code: 00000014
Return area address: 00000000 00000000
Parameter count: 00000021

Interesting fields

  • Trace Type: OMVSPOST this is the exit trace
  • As above service number 41 is function
  • The RACF Return code is 8
  • The RACF reason code is 14

Example of a RACROUTE trace

Entry

Trace Identifier:             00000036 
Record Eyecatcher: RTRACE
Trace Type: RACFPRE
Ending Sequence: ........
Calling address: 00000000 A096A08A
Requestor/Subsystem: CRYPTO CRYPTO
Primary jobname: CSF
Primary asid: 00000038
Primary ACEEP: 00000000 008F6B98
Home jobname: IBMEXP
Home asid: 00000029
Home ACEEP: 00000000 008FA8A8
Task address: 00000000 008D6A88
Task ACEEP: 00000000 00000000
Time: DFF37DD1 AB7FF040
Error class: ........
Service number: 00000002
RACF Return code: 00000000
RACF Reason code: 00000000
Return area address: 00000000 00000000
Parameter count: 00000009

This trace entry was from a batch job IBMEXP, using an ICSF service to access an encryption key.

Interesting fields

  • RACFPRE this is a RACROUTE before entry
  • Requestor/Subsystem: CRYPTO which z/OS component
  • Primary jobname: CSF This was the address space which issued the RACROUTE request
  • Home jobname: IBMEXP This was the job requesting an ICSF service
  • Service number: 2. The RACF commands documentation says for a RACROUTE service, 2 is FASTAUTH
  • Ignore RACF return and RACF reason codes. They are set on exit,

Exit

Trace Identifier:             00000036 
Record Eyecatcher: RTRACE
Trace Type: RACFPOST
Ending Sequence: ........
Calling address: 00000000 A096A08A
Requestor/Subsystem: CRYPTO CRYPTO
Primary jobname: CSF
Primary asid: 00000038
Primary ACEEP: 00000000 008F6B98
Home jobname: IBMEXP
Home asid: 00000029
Home ACEEP: 00000000 008FA8A8
Task address: 00000000 008D6A88
Task ACEEP: 00000000 00000000
Time: DFF37DD1 AB85DE40
Error class: ........
Service number: 00000002
RACF Return code: 00000008
RACF Reason code: 00000000
Return area address: 00000000 00000000
Parameter count: 00000009

This trace entry was from a batch job IBMEXP, using an ICSF service to access an encryption key.

Interesting fields

  • RACFPOST this is a RACROUTE after exit entry
  • Requestor/Subsystem: CRYPTO which z/OS component
  • Primary jobname: CSF This was the address space which issued the RACROUTE request
  • Home jobname: IBMEXP This was the job requesting an ICSF service
  • Service number: 2. The RACF commands documentation says for a RACFROUTE service, 2 is FASTAUTH
  • RACF Return code: 00000008, RACF Reason code: 00000000 This shows there was a problem. The documentation for FASTAUTH shows
    • RC=8 -> the user or group is not authorized to use the resource.
    • RS=0 -> The invoker does not need to log the request.

The RACF trace command for this was

#SET TRACE(JOBNAME(csf),callable(none),RACROUTE(type(2)))

and the following commands

S GTF.GTF
R 1,trace=usrp
R 2,USR=(F44)
R 3,END
R 4,U

Fast start GTF

I can use

S GTF.GTF,M=GTFRACF

Where my GTF proc has

//GTFNEW  PROC M=GTFPARM,ID=SYS1 
//DELETE EXEC PGM=IEFBR14
//IEFRDER DD DSNAME=&ID..TRACE,UNIT=SYSDA,SPACE=(TRK,20),
// DISP=(MOD,DELETE)
//IEFPROC EXEC PGM=AHLGTF,PARM='MODE=EXT,DEBUG=NO,TIME=YES,NP',
// TIME=1440,REGION=2880K
//IEFRDER DD DSNAME=&ID..TRACE,UNIT=SYSDA,SPACE=(TRK,20),
// DISP=(NEW,KEEP)
//SYSLIB DD DSNAME=USER.Z24C.PROCLIB(&M),DISP=SHR

I have a member in USER.Z24C.PROCLIB(GTFRACF)

TRACE=USRP 
USR=(F44)
END

Debugging the “you do not have access to something, but I’m not telling you what” problem

The problem, I had a message

SSL Handshake Failed, ICSF error. Review ‘RACF CSFSERV Resource Requirements’ of the z/OS documentation.
Reason: The webservers userid does not have access to CSFSERV resource classes required for SSL.

But it does not tell me what it does not have access to.

When an application tries to access a resource, and the userid is not authorised to that resource, RACF can produce an error message, which tells you the resource name.

If the application asks “does this application have access to this resource”, then RACF produces no error message, and it is up to the application to provide a sensible and useful message.

Collecting a RACF trace

You can use a command

#SET TRACE(CLASS(CSFSERV),RACROUTE(ALL))

to turn on the trace for that class. You can also use USERID(…) and jobname(…) to further restrict what is traced.

The output goes to GTF.

s gtf,gtf
01 AHL125A  RESPECIFY TRACE OPTIONS OR REPLY U 
 1,trace=usrp                                                                                 
IEE600I REPLY TO 01 IS;TRACE=USRP                                                              
    09.53.19 STC00315  TRACE=USRP                                                                                     
02 AHL101A  SPECIFY TRACE EVENT KEYWORDS --USR=                                                
  - 09.53.27           r 2,usr=(F44),end                                                                              
    09.53.27 STC00315  IEE600I REPLY TO 02 IS;USR=(F44),END                                                           
    09.53.27 STC00315  USR=(F44),END                                                                                  
    09.53.27 STC00315  AHL103I  TRACE OPTIONS SELECTED --USR=(F44)                                                    
  | 09.53.27 STC00315 *03 AHL125A  RESPECIFY TRACE OPTIONS OR REPLY U                                                 
00- 09.53.30           r 3,u                                                                                          
    09.53.30 STC00315  IEE600I REPLY TO 03 IS;U                                                                       
    09.53.30 STC00315  U                                                                                              

Run your work.

P GTF
AHL006I GTF ACKNOWLEDGES STOP COMMAND
AHL904I THE FOLLOWING TRACE DATASETS CONTAIN TRACE DATA :
SYS1.TRACE

if you do not get “THE FOLLOWING TRACE DATASETS CONTAIN TRACE DATA…” it means you did not collect any data.

Use IPCS to look at it

  • =0 and specify the trace data set name
  • if you change scope to both it will remember the data for next time
  • =6 to get you to IPCS Subcommand Entry panel
  • if this is the first time you have used this instance of the data set, you should issue the dropd command to get IPCS to forget about previous usage
  • gtf usr(all) This displays the data
  • You can process this
    • type M and press PF8 to get to the bottom of the data
    • report view will display the data in ISPF edit (view mode)
    • You can now issue commands like
    • x all
    • f code all and look for non zero return codes.
    • del all x
    • sort 30 50 to display all return codes in numerical order. You need to look at the top and the bottom.
    • make a note of the return code ( copy the line to your clipboard)
    • quit
    • report view
    • find return code

To get rid of some of the forest of unhelpful data

  • x all
  • find ‘ ‘ 1 20 all
  • find ‘+’ 1 2 all
  • delete all nx

#SET TRACE(NOCLASS,RACROUTE(ALL))

How do I test my RACF updates or recover from a boo-boo?

I am running z/OS on zPDT on my Linux server. I want to test out some RACF commands, but want to be able to recover if I get it wrong. What can I do?

The product zSecure Admin offers RACF Offline can be used by those who have the license.

I do not have this product so I need a different way.

I am running a single user z/OS image. For anything more complex than this, talk to the experts.

Background to RACF

RACF has a database, typically SYS1.RACFDS. For availability you can have a backup RACF database, typically SYS1.RACFDS.BACKUP. Typically these are both updated when changes are made to the RACF environment. I would call this a duplexed dataset.

You can make a copy of a RACF database using the RACF utility IRRUT200. This takes a lock on database for the duration of the copy and ensures the copy is consistent. If you use other tools to backup the primary database, and there were concurrent database updates, the backup may not be consistent. You risk a partial update, and an inconsistent database.

Changing your RACF

You can use the RVARY command (TSO and operator) to change the RACF data sets. You can setup controls for the RVARY command.

If you use SETROPTS LIST it will list the RACF options being used. Mine has (near the bottom)

DEFAULT RVARY PASSWORD IS IN EFFECT FOR THE SWITCH FUNCTION.
...
DEFAULT RVARY PASSWORD IS IN EFFECT FOR THE STATUS FUNCTION.

In this case the default password is YES

See SETROPTS RVARY(…)

You will get a prompt

ICH703A ENTER PASSWORD TO SWITCH RACF DATASETS JOB=IBMUSER USER=IBMUSER

If you have an I/O problem with the SYS1.RACFDS database

You can

  • switch to use the backup, and stop duplex updates.
  • fix the problem. Perhaps delete the data set and recreate it on a different device.
  • copy the backup dataset using IRRUT200 to the newly allocated dataset
  • activate the newly allocated dataset activate the duplex updates.

Updates to the RACF database are written to SMF 80 records.

If you make mistakes with your commands

Some RACF command changes you can undo, for example, you can use an alter command to set a new value. To be able to undo the change you need to know what the original value was, so it is always a good idea to display a resource before you change or delete it.

If you delete a record there is no easy way of undoing it.

Rob van Hoboken suggested the following.

Plan A – online switch

Display the status

#rvary list

This gave me

ICH15013I RACF DATABASE STATUS:                          
ACTIVE USE NUM VOLUME DATASET
------ --- --- ------ -------
YES PRIM 1 A3CFG1 SYS1.RACFDS
YES BACK 1 A3CFG1 SYS1.RACFDS.BACKUP

This shows both data sets are active and which is the primary and which ise the backup.

Switch off your RACF DUPLEX (backup) data set

#RVARY INACTIVE,DATASET(SYS1.RACFDS.BACKUP)

It prompts with

ICH702A ENTER PASSWORD TO DEACTIVATE RACF JOB=RACF USER=START1

I didnt know the password, but I replied r xx,YES and it worked!

Run the commands that you hope will work.

If the commands do not work successfully

If the commands do not work successfully, make the backup active and switch to it.

#RVARY ACTIVE,DATASET(SYS1.RACFDS.BACKUP)
#RVARY SWITCH

See the command RVARY (Change status of RACF database) .

After you have switched to the backup, make the primary inactive

#RVARY INACTIVE,DATASET(SYS1.RACFDS)

and copy the backup into the primary using IRRUT200

//IBMUSRAC  JOB 1,MSGCLASS=H                              
//S1 EXEC PGM=IRRUT200 PARM=ACTIVATE
//SYSRACF DD DISP=SHR,DSN=SYS1.RACFDS.BACKUP
//SYSUT1 DD DISP=SHR,DSN=SYS1.RACFDS
//SYSUT2 DD SYSOUT=*
//SYSIN DD *
//SYSPRINT DD SYSOUT=*

First run it without PARM=ACTIVATE. If this works successfully, then use PARM=ACTIVATE, this take a lock during the copy, and then activates the SYSUT1 data set, so both are now active.

This gave the output

IRR62005I - IDCAMS REPRO copied SYSRACF to the work data set SYSUT1                                         
ICH15013I RACF DATABASE STATUS:
ACTIVE USE NUM VOLUME DATASET
------ --- --- ------ -------
YES PRIM 1 A3CFG1 SYS1.RACFDS.BACKUP
YES BACK 1 A3CFG1 SYS1.RACFDS
ICH15020I RVARY COMMAND HAS FINISHED PROCESSING.

The #RVARY LIST gave, as expected

ICH15013I RACF DATABASE STATUS:                                                  
ACTIVE USE NUM VOLUME DATASET
------ --- --- ------ -------
YES PRIM 1 A3CFG1 SYS1.RACFDS.BACKUP
YES BACK 1 A3CFG1 SYS1.RACFDS

Which shows it has worked.

If the commands do work successfully

If the commands do work successfully, you need to bring the backup up to date with the primary.

//IBMUSRAC  JOB 1,MSGCLASS=H                              
//S1 EXEC PGM=IRRUT200,PARM=ACTIVATE
//SYSRACF DD DISP=SHR,DSN=SYS1.RACFDS
//SYSUT1 DD DISP=SHR,DSN=SYS1.RACFDS.BACKUP
//SYSUT2 DD SYSOUT=*
//SYSIN DD *
//SYSPRINT DD SYSOUT=*

The #rvary active command.

I would be careful about using the #RVARY ACTIVE command. If you copy a RACF data set, then use the RVARY ACTIVE command to make it available, there is a small window where changes made to the active RACF data base are not made to the offline one. You could tell people “do not issue any commands while this work is going on”, but the RACF database is updated during normal work, for example with profile use counts, and last used date.

Plan B – REIPL with different data sets.

  • Convert to using IRRPRMxx PARMLIB member definitions to specify the data set names, instead of using a load module. See RACF setup, and changing the RACF dataset
  • Copy the RACF database to a recovery RACF data set using IRRUT200
  • Create a new IRRPRMxx member pointing to your recovery data set.
  • Create a new IEASYSxx member to use the RACF=xx member
  • You may have to create a new LOADxx member in SYS1.IPLPARM to point to the new IEASYSxx member.
  • REIPL, and specify the appropriate LOADxx suffix.
  • Fix the problem; copy from the recovery into the SYS1.RACFDS dataset and SYS1.RACFDS.BACKUP, so the primary and backup are the same.
  • Reipl using the normal IPL parameters

You could use the emergency stand-alone one pack system SARES1, but this may not have access to all of the data sets due to SMS and catalogs.

RACF setup, and changing the RACF dataset

I used ADCD supplied version of z/OS, and needed to change which data set I used, so it could be used on z/OS 3.1 or earlier (but not both at the same time).

What options are active?

You can use the operator command

#rvary

where # is the RACF subsystem character (defined in INITPARM(‘#’) in the IEFSSN definition for RACF).

Before z/OS 2.3

You had to define RACF data set parameters using a load module.

  • The databases are defined in a module ICHRDSNT.
  • If you use SDSF and issue the LOAD ICHRDSNT command, it will display the data.
  • The source is in ADCD.LIB.JCL(ICHRDSNT).
  • If you want to change it, you need to generate the module, and REIPL

Parmlib definitions

In z/OS 2.3 and later you can specify parameters in the parmlib concatenation.

In your IEASYS member you can specify RACF=(xx,yy,zz) and specify up to three parameters. These parameters take precedence over the ICHRDSNT table. I specified RACF=(00)

My equivalent definition in USER.Z25D.PARMLIB(IRRPRM00) is

DATABASE_OPTIONS 
DATASETNAMETABLE
ENTRY
PRIMARYDSN(SYS1.RACFDS)
BACKUPDSN(SYS1.RACFDS.BACKUP)

You can have one primary data set and up to one backup data set. If you want to partition your database by key, you can have more primary and backup data sets entries, and you need to specify the key range to data set mapping.

The syntax is defined here.

The datasets have to be cataloged. I have used a data set catataloged in a user catalog, they do not need to be SYS1.xxxx.

If you get it wrong and the member is invalid, at IPL RACF prompts

 *IRRY115I RACF IS NOT USING PARMLIB FOR INITIALIZATION.                     
*01 ICH502A SPECIFY NAME FOR PRIMARY RACF DATA SET SEQUENCE 001 OR 'NONE'
IRA600I SRM CHANNEL DATA NOW AVAILABLE FOR ALL SRM FUNCTIONS
IRA860I HIPERDISPATCH MODE IS NOW ACTIVE
R 01,SYS1.RACFDS
*02 ICH502A SPECIFY NAME FOR BACKUP RACF DATA SET SEQUENCE 001 OR 'NONE'
R 2,SYS1.RACFDB.BACKUP

Changing parmlib

If you change the IEASYS RACF=, or change an IRRPMRxx member, you have to REIPL to pick up the changes. Stopping and restarting RACF does not pick up the changes, because the RACF started task handles operator commands and other admin stuff, it does not process the data sets. See here.

Copying the RACF database

You can copy a RACF database and use that. For example

//IBMCRACO  JOB 1,MSGCLASS=H 
//STEP EXEC PGM=IRRUT200
//SYSRACF DD DSN=SYS1.RACFDS,DISP=SHR
//SYSUT1 DD UNIT=SYSDA,SPACE=(CYL,(20)),
// DCB=(LRECL=4096,RECFM=F),DSN=COLIN.RACFDS,DISP=(MOD,CATLG)
//SYSUT2 DD SYSOUT=A
//SYSPRINT DD SYSOUT=A
//SYSIN DD *
INDEX
MAP
END
/*

IRRUT200 is called a RACF database verification utility program (IRRUT200), and to make an exact copy of a RACF data set. It takes a lock on the RACF database for the duration of the copy to ensure the data is consistent.

You should check the size of the source dataset, and make the copy the same size.

RACF – I cannot delete a certificate!

I made a mistake creating a certificate, and could not delete it, because I got a message

IRRD109I The certificate cannot be added. Profile … is already defined.

In my JCL I had

RACDCERT GENCERT  -                                           
CERTAUTH -
SUBJECTSDN(CN('DocZosCADSA')-
O('COLIN') -
OU('CA')) -
NOTAFTER( DATE(2027-07-02 ))-
KEYUSAGE( CERTSIGN ) -
DSA
SIZE(1024) -
WITHLABEL('DocZosCADSA')

This created a certificate

I was missing the – after DSA, so the DSA, SIZE(1024) and importantly the WITHLABEL() was missing.

When I tried to recreate this certificate I got message

IRRD109I The certificate cannot be added. Profile 00.CN=DocZosCADSA.OU=CA.O=COLIN is already defined.

My problem was, what do I delete ?

The following command listed all of the certificate owned by certauth.

RACDCERT certauth LIST

I searched for CN=DocZosCADSA and it found. (Use the CN=… not the whole string)

Label: LABEL00000002 
...
Subject's Name:
>CN=DocZosCADSA.OU=CA.O=COLIN<
...

Note the label.

The command

RACDCERT CERTAUTH DELETE(LABEL('LABEL00000002'))

Deleted the certificate in error – problem solved.

Note: For a personal certificate the reported certificate was like

02AB.CN=SSCA256.OU=CA.O=SSS.C=GB

If you use the list command the certificate sequence number 02AB is on a different line to the remainder of the label.