From data to reports missing the potholes

I’ve been doing work with datasets on z/OS to produce reports. These range from SMF data to DCOLLECT data on datasets and SMS data.

It took a while to “get it right”, because I made some poor decisions as to how to process the data, and some of my processing was much more complex than it needed to be. It was easiest to start again!

I’ve been working with Python and Python tools, and other tools available on the platforms. See Pandas 102, notes on using pandas.

My current environment is to use some Python code to read a record, parse the record into a dictionary(dict), then add the dict to a list of records. Then either pass the list of dicts to Pandas to display, or to externalise the data, and have a second Python program to read the externalised data, and do the Pandas processing.

Reading the data
Processing the data
Now I’ve got a record – now what?
- Calculations
- Adding or deleting fields
How you accumulate the data, dicts or lists?
Error handling
Using the data

Reading the data

The data is usually in data sets rather than files in Unix Services. You can copy a dataset to a file, but it is easier to use the python package pyzfile to read datasets directly.

from pyzfile import * 

try:
  with ZFile("//'COLIN.DCOLLECT.OUT'", "rb,type=record,noseek") as file: 
     for rec in file: 
        #l = len(rec) 
        yield  rec 
    except ZFileError as e: 
        print(e,file=sys.stderr)

Often a data source will contain a mixture of record types, for example a dump of SMF datasets, may contain many different record types and subtypes.

You need to consider if you want to process all record types in one pass, or process one record type in one run, and a different record type in a different run.

Processing the data

You will normally have a mapping of the layout of the data in a record. Often there are a mix of records types, you need to decide which record types you process and which you ignore.

Field names

Some of the field names in a record are cryptic, they were created when field names could only be 8 characters or less. For example DCDDSNAM. This stands for DCollect records record type D, field name DS NAMe. You need to decide what you name the field. Do you name it DCDDSNAM, and tell the reader to go and look in the documentation to understand the field names in the report, or do you try to add value and just call it DSN, or DataSetName. You cannot guess some fields, such as DCDVSAMI. This is VSAM Inconsistency.

You also need to consider the printed report. If you have a one character field in the record, and a field name which is 20 characters long – by the default the printed field will be 20 characters long, and so waste space in the report. If the field is rarely used you could call it BF1 for Boring Field 1.

Character strings

Python works in ASCII, and strings need to be in ASCII to be printable. You will need to convert character data from EBCIDC to ASCII.

You can use substring to extract data from a record for example. So to extract a string, and convert it…

DSN =  record[20:63].decode('cp500').strip())

Integers

Integers – you will need to covert this to internal format. I found using the Python Struct very good to use. You give a string of conversion characters (integer, integer, …) and it returns an array of the data. If you are processing the data on a different platform, you may need to worry about big end and little end conversion of numbers.

Strange integers

Some records have units like hundredths of a second. You may want to convert these to float

float_value = float(input_value)/100

Packed numbers

Packed numbers are a representation of a date in “decimal” format. For example a yyyyddd for year 2025, day 5 is 0X2025005F” where the F is a sign digit. You cannot just print it (it comes out as 539295839).

Bit masks

Bit masks are very common, for example there is a 1 byte field DCVFLAG1 with values:

DCVUSPVT 0x20 Private
DCVUSPUB 0x10 Public
DCVUSSTG 0x08 Storage
DCVSHRDS 0X04 Device is sharable

If the value of the field is 0x14 – what do you return? I would create a field Flag1 with value of a list[“Public”,”Shareable”]. If all the bits were off, this would return an empty list []. It would be easy to create [“DCVUSPUB”,”DCVSHRDS”] or just display the hex value 14 (or 0x14) – but this makes it hard to interpret the data for the people reading the reports.

Triplets

SMF records contain triplets. These are defined by [offset to start, length of data, count of data] within the record.

For example in the SMF30 record there are many triplet sections. There is one for “usage data” involved in usage based pricing. There can be zero or more sections like

Product owner
Product name
…
TCB time used in hundredths of a second

How are you going to process this?
The SMF record has 3 fields for usage

SMF30UDO Offset to Usage Data section in SMF 30 record
SMF30UDL Length of each Usage Data section in SMF 30 record
SMF30UDN Number of Usage Data section in SMF 30 record

I would create a variable UsageData = [{“ProdOwner”: …,”ProdName”: …, “TCBTime”: …},{“ProdOwner”: …,”ProdName”: …, “TCBTime”: …},]

and convert TCBTime from an integer representing a hundreds of a second, to a floating point number.

Having these triplets make a challenge when printing the record. You could decide to

omit this data
summarise the data – and provide only a sum of the TCBTime value
give the data as a list of dicts, then have a Pandas step to copy only the fields you need for your reports.

For this usage data, I may want a report showing which jobs used which product, and how my much CPU the job used in that product. Although I may capture the data as a list of products, i could extract the data, and create another data record with

jobname1, product1, … CPU used1
jobname1, product2, … CPU used2
jobname2, product1, … CPU used1
jobname2, product3, … CPU used3
…

and remove the product data from the original data record.

Do you want all of the fields?

You may want to ignore fields, such as reserved values, length of record values, record_ type, and any fields you are not interested in. Record length tends to be the first field, and this is usually not interesting when generating default reports.

How to handle a different length record?

The format of many records change with new releases, typically adding new fields.

You need to be able to handle records from the previous release, where the record is shorter. For example do not add these fields to your dict, or give add them with a “None” value.

Now I’ve got a record – now what?

Once you have got your record, and created a dict from the contents {fieldname1=value, fieldname2=value2…} , you could just add it to the list to be passed to Pandas. It is not always that simple.

I found that some records need post processing before saving.

Calculations

For a DCOLLECT record, there is a field which says

DCVFRESP: Free Space on Volume (in KB when DCVCYLMG is set to 0 or in MB when DCVCLYMG is set to 1)

You need to check bit DCVCYLMG and have logic like

if DCVCYLMG  == 1:
	  data["FreeSpVolKB"] = data["FreeSpVolKB"] * 1024

Adding or deleting fields

For some fields I did some calculations to simplify the processing. For example I wanted average time when I had total_time, and count.

I created average_time = total_time / count, added this field, and deleted total_time and count fields.

Error handling

I found some records have an error flag, for example “Error calculating volume capacity”. You need to decide what to do.

Do you include them, and have the risk, that the calculations/display of volume capacity might be wrong?
Do you report record during the collection stage, and not include them in the overall data?

How you accumulate the data, dicts or lists?

When using Pandas you can build each record as a dict of values {“kw1″:”v1″,”kw2″:”v2”}, then build a list of dicts [{}, {}…]

or have “column” have a list of values {“Jobname”: [“job1″,”job2″…],”CPUUsed”:[99,101…] … }. As you process each field you append it to the appropriate “column” field.

# a dict of lists
datad = {"dsn":["ABC","DEF"],
         "volser":["SYSRES","USER02"]}
datad["dsn"].append("GHI")
datad["volser"].append("OLDRES")

pdd =  pd.DataFrame(datal) 
 

# a list of dicts
dictl = [
    {"dsn":"ABC","volser":"SYSRES"},
    {"dsn":"DEF","volser":"USER02S"}]
newdict = {"dsn":"GHI","volser":"OLDRES"}

dictl.append(newdict)

pdl = pd.DataFrame.from_records(datal)

I think it is better to capture your data in a dict, then add the dict to the list of records.

For example with

DCVFRESP: Free Space on Volume (in KB when DCVCYLMG is set to 0 or in MB when DCVCLYMG is set to 1)

If you use a dict to collect the data, you can then easily massage the values, before adding the dict to the list.

if DCVCYLMG  == 1:
  data["FreeSpVolKB"] = data["FreeSpVolKB"] * 1024

grand_data.append[data]

If you try to do this using “column” values it gets really messy trying to do a similar calculation.

Using the data

It took a long time to process the dataset and create the Python data. I found it quicker overall to process the dataset once, and externalise the data using Pickle, or JSON. Then have different Python programs which read the data in and processed it. For example

Creating a new data structure using just the columns I was interested in.
Filtering which rows I wanted.
Save it

Pandas 102, notes on using pandas

Pandas is a great tool for displaying data from Python. You give it arrays of data, and it can display, summarise, group, print and plot. It is used for the simplest data, up to data analysts processing megabytes of data.

There is a lot of good information about getting started with Pandas, and how you can do advanced things with Pandas. I did the Pandas 101 level of reading, I struggled with the next step, so my notes for the 102 level of reading are below. Knowing that something can be done means you can go and look for it. If you look but cannot find, it may be that you are using the wrong search arguments, or there is no information on it.

Working with data
Feeding data into Pandas
What can you do with it?
Which columns are displayed ?
Select which rows you want displayed
You can select rows and columns
- What does .loc do?
You can process the data, such as sort
Saving data
- Export the data
- Import and do the analysis
Processing multiple data sources as one
Processing data within fields
Basic operations on columns
Doing aggregation, count, sum, maximum, minimum etc.
- Simple aggregation .sum()
- More complex aggregation .agg()

Working with data

I’ve been working with “flat files” on z/OS. For example the output of DCOLLECT which is information about dataset etc from SMS.

One lesson I learned was you should isolate the extraction from the processing (except for trivial amounts of data). Extracting data from flat files can be expensive, and take a long time, for example it may include conversion from EBCDIC to ASCII. It is better to capture the data from the flat file in python variables, then write the data to disk using JSON, or pickle (Python object serialisation). As a separate step read the data into memory from your saved file, then do your data processing work, with pandas, or other tools.

Feeding data into Pandas

The work I’ve done has been two dimensional, rows and columns; you can have multi dimensional.

You can use a list of dictionaries(dicts), or dict of list:

# a dict of lists
datad = {"dsn":["ABC","DEF"],
         "volser":["SYSRES","USER02"]}
pdd =  pd.DataFrame(datal)  

# a list of dicts
datal = [{"dsn":"ABC","volser":"SYSRES"},
         {"dsn":"DEF","volser":"USER02S"},
        ]   

pdl = pd.DataFrame.from_records(datal)

Processing data like pdd = pd.DataFrame(datal) creates a pandas data frame. You take actions on this data frame. You can create other data frames from an original data fram, for example with a subset of the rows and columns.

I was processing a large dataset of data, and found it easiest to create a dict for each row of data, and then accumulate each row as a list. Before I used Pandas, I had just printed out each row. I do not know which performs better. Someone else used a dict of lists, and appended each row’s data to the “dsn” or “volser” list.

What can you do with it?

The first thing is to print it. Once the data is in Pandas you can use either of pdd or pdl above.

print(pdd)

gave

   dsn   volser
0  ABC   SYSRES
1  DEF  USER02S

Where the 0, 1 are the row numbers of the data.

With my real data I got

                                             DSN  ... AllocSpace
0   SYS1.VVDS.VA4RES1                             ...       1660
1   SYS1.VTOCIX.A4RES1                            ...        830
2   CBC.SCCNCMP                                   ...     241043
3   CBC.SCLBDLL                                   ...        885
4   CBC.SCLBDLL2                                  ...        996
..                                           ...  ...        ...
93  SYS1.SERBLPA                                  ...        498
94  SYS1.SERBMENU                                 ...        277
95  SYS1.SERBPENU                                 ...      17652
96  SYS1.SERBT                                    ...        885
97  SYS1.SERBTENU                                 ...        332

[98 rows x 7 columns]

The data was formatted to match my window size. With a larger window I got more columns.

You can change this by using

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

Which columns are displayed ?

Rather than all of the columns being displayed you can select which columns are displayed.

You can tell from the data you passed to pandas, or use the command

print(list(opd.columns.values))

This displays the values of the column names, as a list.

To display the columns you specify use

print(opd[["DSN","VOLSER","ExpDate","CrDate","LastRef","63bit alloc space KB", "AllocSpace"]])

You can say display all but the specified columns

print(opd.loc[:, ~opd.columns.isin(["ExpDate","CrDate","LastRef"])])

Select which rows you want displayed

print(opd.loc[opd["VOLSER"].str.startswith("A4"),["DSN","VOLSER",]])

print(opd.loc[opd["DSN"].str.startswith("SYS1."),["DSN","VOLSER",]])

gave

                                             DSN  VOLSER  
0   SYS1.VVDS.VA4RES1                             A4RES1  
1   SYS1.VTOCIX.A4RES1                            A4RES1  
12  SYS1.ADFMAC1                                  A4RES1  
13  SYS1.CBRDBRM                                  A4RES1  
14  SYS1.CMDLIB                                   A4RES1  
..                                           ...     ...  
93  SYS1.SERBLPA                                  A4RES1  
94  SYS1.SERBMENU                                 A4RES1  
95  SYS1.SERBPENU                                 A4RES1  
96  SYS1.SERBT                                    A4RES1  
97  SYS1.SERBTENU                                 A4RES1  

[88 rows x 2 columns]

From this we can see 88 (out of 97) rows were displayed. Row 0, 1 , 12, 13, but not rows 2, 3, …

What does .loc do?

My interpretation of this (which I haven’t seen documented)

If there is one parameter, this is a list of the columns you want.

If there are two parameters, the second is the list of the columns you want displayed. The first column is conceptually a list of True or False, with one value per row, saying if the row should be selected or not. So for

print(opd.loc[opd["VOLSER"].str.startswith("A4"),["DSN","VOLSER",]])

opd[“VOLSER”].str.startswith(“A4”)

says take the column called VOLSER, convert it to a string. If it starts with the string “A4” then return True, else return False. This returns a list of one entry per row.
The opd.loc[opd[“VOLSER”].str.startswith(“A4”),…) then selects the rows.

You can select rows and columns

print(opd.loc[opd["VOLSER"].str.startswith("A4"),["DSN","VOLSER","63bit alloc space KB",]])

You can process the data, such as sort

The following statements extracts columns from the original data, sorts the data, and creates a new data frame. The new data frame is printed.

sdata= opd[["DSN","VOLSER","63bit alloc space KB",]].sort_values(by=["63bit alloc space KB","DSN"], ascending=False)
print(sdata)

This gave

                                             DSN  VOLSER  63bit alloc space KB
2   CBC.SCCNCMP                                   A4RES1                241043
35  SYS1.MACLIB                                   A4RES1                210664
36  SYS1.LINKLIB                                  A4RES1                166008
90  SYS1.SEEQINST                                 A4RES1                103534
42  SYS1.SAMPLIB                                  A4RES1                 82617
..                                           ...     ...                   ...
62  SYS1.SBPXTENU                                 A4RES1                    55
51  SYS1.SBDTMSG                                  A4RES1                    55
45  SYS1.SBDTCMD                                  A4RES1                    55
12  SYS1.ADFMAC1                                  A4RES1                    55
6   FFST.SEPWMOD3                                 A4RES1                    55

[98 rows x 3 columns]

Showing that all the rows, and all the (three) columns which had been copied to the sdata data frame.

Saving data

Reading an external file and processing the data into Python arrarys took an order of magnitude longer than processing it Pandas.

You should consider a two step approach to looking at data

Extract the data and exported it in an access format, such as Pickle or JSON. While getting this part working, use only a few rows of data. Once it works, you can process all of the data.
Do the analysis using the exported data.

Export the data

You should consider externalising the data in JSON or pickles format for example

# write out the data to a file
fPickle = open('pickledata', 'wb')    
    # source, destination
pickle.dump(opd, fPickle)                    
fPickle.close()

Import and do the analysis


# and read it in
fPickle = open('pickledata', 'rb')    
opd = pickle.load(fPickle)
fPickle.close()
print(odp)

Processing multiple data sources as one

If you have multiple sets of data, for example for Monday, Tuesday, Wednesday, etc you can use

week =  pd.concat(monday,tuesday,wednesday,thursday,friday)

Processing data within fields

Within my data, I have a field with information like

                   DSN  VOLSER           FormatType
0    SYS1.VVDS.VC4RES1  C4RES1                   []
1   SYS1.VTOCIX.C4RES1  C4RES1              [Fixed]
2          CBC.SCCNCMP  C4RES1    [Fixed, Variable]
3          CBC.SCLBDLL  C4RES1    [Fixed, Variable]
4         CBC.SCLBDLL2  C4RES1    [Fixed, Variable]

Where the data under FormatType is a list. You can reference elements in a list.

For example

x =  data.FormatType.apply(lambda x: 'Variable' in x)
print(x)

gives

0     False
1     False
2      True
3      True
4      True

The command

print(data.loc[ data.FormatType.apply(lambda x: 'Blocked' in x)])

gives

              DSN  VOLSER           FormatType
2     CBC.SCCNCMP  C4RES1    [Fixed, Variable]
3     CBC.SCLBDLL  C4RES1    [Fixed, Variable]
4    CBC.SCLBDLL2  C4RES1    [Fixed, Variable]

Basic operations on columns

You can do basic operations on columns such as

print(dataset[["CountIO","CacheHits"]].sum())

The sum() (and count() etc) functions add up the specified columns.

This gave

[361 rows x 10 columns]
CountIO      74667.0
CacheHits     1731.0
dtype: float64

An operation like

print(dataset.sum())

Would have totalled all the columns, including some which are meaningless, for example, maximum value found.

Doing aggregation, count, sum, maximum, minimum etc.

Simple aggregation

You can aggregate data

# Extract just the fields of interest
d = dataset[["DSN","CountIO","CacheHits"]]
print(d.groupby("DSN").sum())

Gave

                                        CountIO  CacheHits
DSN                                                       
ADCD.Z31B.PARMLIB                          68.0       60.0
ADCD.Z31B.PROCLIB                          66.0       66.0
ADCD.Z31B.VTAMLST                         141.0      141.0
COLIN.TCPPARMS                              4.0        4.0
FEU.Z31B.PARMLIB                            4.0        0.0
IXGLOGR.ATR.S0W1.RM.DATA.A0000000.DATA      4.0        0.0
SYS1.DAE                                    0.0        0.0
SYS1.DBBLIB                               974.0      932.0

More complex aggregation

The .agg() gives you much more control as to what, and how you want to process data.

print(d.groupby("DSN").agg({'DSN' : ['count'], 'CountIO' : ['sum','max'],"CacheHits": ["sum"]}))

gave

                                         DSN  CountIO          CacheHits
                                       count      sum      max       sum
DSN                                                                     
ADCD.Z31B.PARMLIB                         19     68.0      7.0      60.0
ADCD.Z31B.PROCLIB                         30     66.0      3.0      66.0
ADCD.Z31B.VTAMLST                          6    141.0     41.0     141.0
COLIN.TCPPARMS                             2      4.0      3.0       4.0
FEU.Z31B.PARMLIB                           1      4.0      4.0       0.0
IXGLOGR.ATR.S0W1.RM.DATA.A0000000.DATA     4      4.0      1.0       0.0
SYS1.DAE                                   1      0.0      NaN       0.0
SYS1.DBBLIB                                2    974.0    932.0     932.0

Notes:

The columns are not in the order I specified. It is hard to see which field Max applies to
There is a Not a Number (Nan) in one of the value. You need to allow for this.
In the simple case using .sum() by default it tries to sum all of the columns. Using .agg you can specify which columns you want to process