Write instructions for your target audience – not for yourself.

Over the last couple of weeks, I’ve been asked questions about installing two products on z/OS. I looked at the installation documentation, and it was written the way I would write it for myself – it was not written for other people to follow.

I sent some comments to one of the developers, and as the comments mainly apply to the other products as well, I thought I would write them down – for when another product comes along.

I’ve been doing some documentation of for AT-TLS which allows you to give applications TLS support, without changing the application, so I’ll focus on a product using TCP/IP.

What is the environment?

The environment can range from one person running z/OS on a laptop, to running a Parallel Sysplex where you have multiple z/OS instances running as a Single System Image; and taking it further, you can have multiple sites.

What levels of software

Within a Sysplex you can have different levels of software, for example one image at z/OS 2.4 and another image at z/OS 2.5 You tend to upgrade one system to the next release, then when this has been demonstrated to be stable, migrate the other systems in turn.

Within one z/OS image you can have multiple levels of products, for example MQ 9.2.3 and MQ 9.1. People may have multiple levels so they test the newer level, and when it looks stable, they switch to the newer level and later remove the older level. If the newer level does not work in production – they can easily switch back to the previous level.

Each version may have specific requirements.

  • If your product has an SVC, you may need an SVC for each version, unless the higher level SVC supports the lower level code.
  • If your product uses a TCP/IP port, you will need a port for each instance.

You need to ensure your product can run in this environment, with more than one version installed on an image.

How do things run?

Often z/OS images and programs run for many months. For example IPLing every three months to get the latest fixes on. Your product instance may run for 3 months before restarting. If you write message to the joblog, or have output going to the JES2 spool, you want to be able to purge old output without shutting down your instance. You can specify options to “spin” off output and make the file purge-able.

Your instance may need to be able to refresh its parameters. For example, if a key in a keyring changes, you need to close and reopen the keyring. This implies a refresh command, or the keyring is opened for each request.

Who is responsible for the system?

For me – I am the only person using the system and I am responsible for every thing.

For big systems there will be functions allocated to different departments:

  • Installation of software (getting the libraries and files to the z/OS image)
  • The z/OS systems team – creating and updating the base z/OS system
  • The Security team – this may be split into platform security(RACF), and network security
  • Data management – responsible for data, backup (and restore), migration of unused data sets to tape, ensuring there is enough disk space available.
  • Communications team – responsible for TCPIP connectivity, DNS, firewalls etc.
  • Database team – responsible for DB2 and other products
  • Liberty and z/OSMF etc built on top of Liberty.
  • MQ – responsible for MQ, and MQ to MQ connectivity.

Some responsibilities could be done by different teams, for example creating the security profile when creating a started task. This is a “security” task – but the z/OS systems programmer will usually do it.

How are systems changes managed?

Changes are usually made on a test system and migrated into production. I’ve seen a rule “nothing goes into production which has not been tested”. Some implications of this are

  • No changes are typed into production. A file can be copied into production, and a file may have symbolic substitution, such as SYSTEM=&SYSNAME. You can use cut and paste, but no typing. This eliminates problems like 0 being misread as O, and 1,i,l looking similar.
  • Changes are automated.
  • Every change needs a back-out process – and this back-out has been tested.
    • Delete is a 2 phase operation. Today you do a rename rather than a delete; next week you do the actual delete. If there is a problem with the change you can just rename it back again. Some objects have non obvious attributes, and if you recreate an object, it may be different, and not work the same way as it used to.

There are usually change review meetings. You have to write a change request, outlining

  • the change description
  • the impact on the existing system
  • the back-out plan
  • dependencies
  • which areas are affected.

You might have one change request for all areas (z/OS, security, networking), or a change request for each area, one for z/OS, one for security, one for networking.

Affected areas have to approve changes in their area.

How to write installation instructions

You need to be aware of differences between installing a product first time, and successive times. For example creating a security definition. It is easy to re-test an install, and not realise you already have security profiles set up. A pristine new image is great for testing installation because it is clean, and you have to do everything.

Instructions like

  • Task 1 – create sys1.proclib member
  • Task 2 – define security profile
  • Task 3 – allocate disk storage
  • Task 4 – define another security profile
  • Task 5 – update parmlib

may make sense when one person is doing the work, but not if there are many teams.

It is better to have a summary by role like

  • z/OS systems programmer
    • create proclib member
    • update parmlib
  • Security team
    • Define security profile 1
    • Define security profile 2
  • Storage management team
    • Allocate disk space

and have links to the specific topics. This way it is very clear what a team’s responsibilities are, and you can raise one change request per team.

This summary also gives a good road map so you can see the scale of the installation task.

It is also good to indicate if this needs to be done once only per z/OS image, or for every instance. For example

  • APF authorise the load libraries – once per z/OS image
  • Create a JCL procedure in SYS1.PROCLIB – once per instance

Some tasks for the different roles

z/OS system programmers

  • Create alias for MYPROD.* to a user catalog
  • APF authorise MYPROD…. datasets
  • Create PARMLIB entries
  • Update LNKLST and LPA
  • Update PROCLIB concatenation with product JCL
  • Create security profiles for any started tasks; which userid should be used?
  • WLM classification of the started task or job.
  • Schedule delete of any old log files older than a specified criteria
  • When multiple instances per LPAR, decide whether to use S MYSTASK1, S MYSTASK2, or S MYSTASK.T1, S MYSTASK.T2
  • Do you need to specify JESLOG SPIN to allows JES2 logs to be spun regulary, or when they are greater than a certain size, or any DD SYSOUT with SPIN?
  • ISPF
    • Add any ISPF Panels etc into logon procedures, or provide CLIST to do it.
    • Update your ISPF “extras” panel to add product to page.
  • Try to avoid SVCs. There are better ways, for example using authorized services.
  • Propagate the changes to all systems in the Sysplex.
  • What CF structures are needed. Do they have any specific characteristics, such as duplexed?
  • How much (e)CSA is needed, for each product instance.
  • Does your product need any Storage Class Memory (SCM).

Security team

  • Create groups as needed eg MYPRODSYS, MYPRODRO, and make requester’s userid group special, so they can add and remove userids to and from the groups.
  • Create a userid for the started task. Create the userid with NOPASSWORD, to prevent people logging on with the userid and password.
  • Protect the MYPROD.* datasets, for example members of group MYPRODSYS can update the datasets, members of group MYPRODRO only have read-only access.
  • Create any other profiles.
  • Create any certificate or keyrings, and give users access to them.
  • Set up profiles for who can issue operator commands against the jobs or procedures.
  • Does the product require an “applid”. For example users much have access to a specific APPL to be able to use the facilities. An application can use pthread_security_applid_np, to change the userid a thread is running on – but they must have access to an applid. The default applid is OMVSAPPL.
  • Do users needing to use this product need anything specific? Such as id(0), needing a Unix Segment, or access to any protected resources? See below for id(0).
  • If a client authenticates to the server, the server needs access to BPX.SERVER in the RACF FACILITY.
  • The started task userid may need access to BPX.DAEMON.
  • If a userid needs access to another user’s keyring, the requestor needs read access to user.ring.LST in CLASS(RDATALIB) or access to IRR.DIGTCERT.LISTRING.
  • If a userid needs access to a private key in a keyring the requester needs If a userid needs access to another user’s keyring, the requester needs control access to user.ring.LST in CLASS(RDATALIB).
  • You might need to program control data sets, for example RDEF PROGRAM * ADDMEM(‘SYS1.LINKLIB’//NOPADCHK) UACC(READ) .
  • Users may need access to ICSF class CSFSERV and CSFKEYS.
  • Use of CLASS(SURROGAT) BPX.SRV.<userid> to allow one userid to be a surrogate for another userid.
  • Use of CLASS(FACILITY) BPX.CONSOLE to remove the generation of BPXM023I messages on the syslog.

Storage team

  • How much disk space is needed once the product has been installed, for data sets, and Unix file systems. This includes product libraries and instance data, and logs which can grow without limit.
  • How much temporary space is needed during the install.
  • Where do Unix files for the product go? for example /opt/ or /var….
  • Where do instance files go. For example on image local disks, or sysplex shared disks. You have an instance on every member of the Sysplex – where you do put the instance files?
  • How much data will be produced in normal running – for example traces or logs.
  • When can the data be pruned?
  • Does the product need its own ZFS for instance data, to keep it isolated and so cannot impact other products.
  • Are any additional Storage Classes etc needing to be defined? These determine if and when datasets are migrated to tape, or get deleted.
  • Are any other definitions needed. For example for datasets MYPROD.LOG*, they need to go on the fastest disks, MYPROD.SAMPLES* can go on any disks, and could be migrated.

Database team

  • What databases, tables,indexes etc are required?
  • How much disk space is needed.
  • What volume of updates per second. Can the existing DB2 instances sustain the additional throughput?
  • What security and protection is needed at the table level and at the field level.
  • What groups are permitted to access which fields?
  • What auditing is needed?
  • Is encryption needed?

MQ

  • Do you need to uses MQ Shared Queue between queue managers?
  • How much data will be logged per second?
  • What is the space needed for the message storage, disk space, buffer pool and Coupling Facility?
  • Product specific definitions.
  • Security protection of any product specific definitions.

Networking

  • Which port(s) to use?
    • Do you need to control access to ports with the SAF resource on the PORT entry, and permit access to profile EZB.PORTACCESS.sysname.tcpname.resname
    • Use of SHAREPORT and SHAREPORTWLM
  • Use of Sysplex Distributor to route work coming in to a Sysplex to any available system?
  • Update the port list – so only specific job can use it
  • RACF profile for port?
  • Which cipher specs
  • Which level of TLS
  • Which certificates
  • Any AT-TLS profile?
  • Any firewall changes?
  • Any class of service?
  • Any changes to syslogd profile?
  • Are there any additional sites that will be accessed, and so need adding to the “allow” list.

Automation

  • If the started tasks, or jobs need to be started at IPL, create the definitions. Do they have any pre-reqs, for example requiring DB2 to be active.
  • If the jobs are shutdown during the day, should they be automatically restarted?
  • Add automation to shut down any jobs or started tasks, when the system is shutdown
  • Which product messages need to be managed – such as events requiring operator action, or events reported to the site wide monitoring team.

Operations

  • Play book for product, how to start, and stop it
  • Are there any other commands?

Monitoring

  • Any SMF data needed to be collected.
  • Any other monitoring.
  • How much additional CPU will be needed – at first, and in the long term.

Making your product secure

Many sites are ultra careful about keeping their system secure. The philosophy is give a user access for what they need to do – but no more. For example

  • They will not be comfortable installing a non IBM SVC into their system. An SVC can be used from any address space, so if there is any weakness in the SVC it could be exploiter.
  • Using id(0) (superuser) in Unix Services is not allowed. The userid needs to be given specific permission. If the code “changes userid” then services like pthread_security_applid_np() should be used; where the applid is part of the configuration. Alternatives include __login_applid. End users of this facility will need read access to the specific applid.

TLS and SSL

If you are using TLS there are other considerations

  • Any certificate you generate needs a long validity date, and JCL to recreate it when it expires.
  • If you create a Certificate Authority you need to document how to export it and distribute it to other platforms
  • Browsers and application may verify the host name, so you need to generate a certificate with a valid name. The external z/OS name may be different from the internal name.
  • You should support TLS V1.2 and TLS 1.3 Other TLS and SSL versions are deprecated.
  • It is good practice to have one keyring with the server certificate with its private key, and a “common” trust store keyring which has the Certificate Authorities for all the sites connecting to the z/OS image. If you connect to a new site, you update the common keyring, and all applications pick up the new CA. If you have one keyring just for your instance, you need to maintain multiple keyrings when a new certificate is added, one for each application.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s