Pandas is a great tool for displaying data from Python. You give it arrays of data, and it can display, summarise, group, print and plot. It is used for the simplest data, up to data analysts processing megabytes of data.
There is a lot of good information about getting started with Pandas, and how you can do advanced things with Pandas. I did the Pandas 101 level of reading, I struggled with the next step, so my notes for the 102 level of reading are below. Knowing that something can be done means you can go and look for it. If you look but cannot find, it may be that you are using the wrong search arguments, or there is no information on it.
Working with data
I’ve been working with “flat files” on z/OS. For example the output of DCOLLECT which is information about dataset etc from SMS.
One lesson I learned was you should isolate the extraction from the processing (except for trivial amounts of data). Extracting data from flat files can be expensive, and take a long time, for example it may include conversion from EBCDIC to ASCII. It is better to capture the data from the flat file in python variables, then write the data to disk using JSON, or pickle (Python object serialisation). As a separate step read the data into memory from your saved file, then do your data processing work, with pandas, or other tools.
Feeding data into Pandas
The work I’ve done has been two dimensional, rows and columns; you can have multi dimensional.
You can use a list of dictionaries(dicts), or dict of list:
# a dict of lists
datad = {"dsn":["ABC","DEF"],
"volser":["SYSRES","USER02"]}
pdd = pd.DataFrame(datal)
# a list of dicts
datal = [{"dsn":"ABC","volser":"SYSRES"},
{"dsn":"DEF","volser":"USER02S"},
]
pdl = pd.DataFrame.from_records(datal)
Processing data like pdd = pd.DataFrame(datal) creates a pandas data frame. You take actions on this data frame. You can create other data frames from an original data fram, for example with a subset of the rows and columns.
I was processing a large dataset of data, and found it easiest to create a dict for each row of data, and then accumulate each row as a list. Before I used Pandas, I had just printed out each row. I do not know which performs better. Someone else used a dict of lists, and appended each row’s data to the “dsn” or “volser” list.
What can you do with it?
The first thing is to print it. Once the data is in Pandas you can use either of pdd or pdl above.
print(pdd)
gave
dsn volser
0 ABC SYSRES
1 DEF USER02S
Where the 0, 1 are the row numbers of the data.
With my real data I got
DSN ... AllocSpace
0 SYS1.VVDS.VA4RES1 ... 1660
1 SYS1.VTOCIX.A4RES1 ... 830
2 CBC.SCCNCMP ... 241043
3 CBC.SCLBDLL ... 885
4 CBC.SCLBDLL2 ... 996
.. ... ... ...
93 SYS1.SERBLPA ... 498
94 SYS1.SERBMENU ... 277
95 SYS1.SERBPENU ... 17652
96 SYS1.SERBT ... 885
97 SYS1.SERBTENU ... 332
[98 rows x 7 columns]
The data was formatted to match my window size. With a larger window I got more columns.
You can change this by using
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
Which columns are displayed ?
Rather than all of the columns being displayed you can select which columns are displayed.
You can tell from the data you passed to pandas, or use the command
print(list(opd.columns.values))
This displays the values of the column names, as a list.
To display the columns you specify use
print(opd[["DSN","VOLSER","ExpDate","CrDate","LastRef","63bit alloc space KB", "AllocSpace"]])
You can say display all but the specified columns
print(opd.loc[:, ~opd.columns.isin(["ExpDate","CrDate","LastRef"])])
Select which rows you want displayed
print(opd.loc[opd["VOLSER"].str.startswith("A4"),["DSN","VOLSER",]])
or
print(opd.loc[opd["DSN"].str.startswith("SYS1."),["DSN","VOLSER",]])
gave
DSN VOLSER
0 SYS1.VVDS.VA4RES1 A4RES1
1 SYS1.VTOCIX.A4RES1 A4RES1
12 SYS1.ADFMAC1 A4RES1
13 SYS1.CBRDBRM A4RES1
14 SYS1.CMDLIB A4RES1
.. ... ...
93 SYS1.SERBLPA A4RES1
94 SYS1.SERBMENU A4RES1
95 SYS1.SERBPENU A4RES1
96 SYS1.SERBT A4RES1
97 SYS1.SERBTENU A4RES1
[88 rows x 2 columns]
From this we can see 88 (out of 97) rows were displayed. Row 0, 1 , 12, 13, but not rows 2, 3, …
What does .loc do?
My interpretation of this (which I haven’t seen documented)
If there is one parameter, this is a list of the columns you want.
If there are two parameters, the second is the list of the columns you want displayed. The first column is conceptually a list of True or False, with one value per row, saying if the row should be selected or not. So for
print(opd.loc[opd["VOLSER"].str.startswith("A4"),["DSN","VOLSER",]])
opd[“VOLSER”].str.startswith(“A4”)
says take the column called VOLSER, convert it to a string. If it starts with the string “A4” then return True, else return False. This returns a list of one entry per row.
The opd.loc[opd[“VOLSER”].str.startswith(“A4”),…) then selects the rows.
You can select rows and columns
print(opd.loc[opd["VOLSER"].str.startswith("A4"),["DSN","VOLSER","63bit alloc space KB",]])
You can process the data, such as sort
The following statements extracts columns from the original data, sorts the data, and creates a new data frame. The new data frame is printed.
sdata= opd[["DSN","VOLSER","63bit alloc space KB",]].sort_values(by=["63bit alloc space KB","DSN"], ascending=False)
print(sdata)
This gave
DSN VOLSER 63bit alloc space KB
2 CBC.SCCNCMP A4RES1 241043
35 SYS1.MACLIB A4RES1 210664
36 SYS1.LINKLIB A4RES1 166008
90 SYS1.SEEQINST A4RES1 103534
42 SYS1.SAMPLIB A4RES1 82617
.. ... ... ...
62 SYS1.SBPXTENU A4RES1 55
51 SYS1.SBDTMSG A4RES1 55
45 SYS1.SBDTCMD A4RES1 55
12 SYS1.ADFMAC1 A4RES1 55
6 FFST.SEPWMOD3 A4RES1 55
[98 rows x 3 columns]
Showing that all the rows, and all the (three) columns which had been copied to the sdata data frame.
Saving data
Reading an external file and processing the data into Python arrarys took an order of magnitude longer than processing it Pandas.
You should consider a two step approach to looking at data
- Extract the data and exported it in an access format, such as Pickle or JSON. While getting this part working, use only a few rows of data. Once it works, you can process all of the data.
- Do the analysis using the exported data.
Export the data
You should consider externalising the data in JSON or pickles format for example
# write out the data to a file
fPickle = open('pickledata', 'wb')
# source, destination
pickle.dump(opd, fPickle)
fPickle.close()
Import and do the analysis
# and read it in
fPickle = open('pickledata', 'rb')
opd = pickle.load(fPickle)
fPickle.close()
print(odp)
Processing multiple data sources as one
If you have multiple sets of data, for example for Monday, Tuesday, Wednesday, etc you can use
week = pd.concat(monday,tuesday,wednesday,thursday,friday)
Processing data within fields
Within my data, I have a field with information like
DSN VOLSER FormatType
0 SYS1.VVDS.VC4RES1 C4RES1 []
1 SYS1.VTOCIX.C4RES1 C4RES1 [Fixed]
2 CBC.SCCNCMP C4RES1 [Fixed, Variable]
3 CBC.SCLBDLL C4RES1 [Fixed, Variable]
4 CBC.SCLBDLL2 C4RES1 [Fixed, Variable]
Where the data under FormatType is a list. You can reference elements in a list.
For example
x = data.FormatType.apply(lambda x: 'Variable' in x)
print(x)
gives
0 False
1 False
2 True
3 True
4 True
The command
print(data.loc[ data.FormatType.apply(lambda x: 'Blocked' in x)])
gives
DSN VOLSER FormatType
2 CBC.SCCNCMP C4RES1 [Fixed, Variable]
3 CBC.SCLBDLL C4RES1 [Fixed, Variable]
4 CBC.SCLBDLL2 C4RES1 [Fixed, Variable]
Basic operations on columns
You can do basic operations on columns such as
print(dataset[["CountIO","CacheHits"]].sum())
The sum() (and count() etc) functions add up the specified columns.
This gave
[361 rows x 10 columns]
CountIO 74667.0
CacheHits 1731.0
dtype: float64
An operation like
print(dataset.sum())
Would have totalled all the columns, including some which are meaningless, for example, maximum value found.
Doing aggregation, count, sum, maximum, minimum etc.
Simple aggregation
You can aggregate data
# Extract just the fields of interest
d = dataset[["DSN","CountIO","CacheHits"]]
print(d.groupby("DSN").sum())
Gave
CountIO CacheHits
DSN
ADCD.Z31B.PARMLIB 68.0 60.0
ADCD.Z31B.PROCLIB 66.0 66.0
ADCD.Z31B.VTAMLST 141.0 141.0
COLIN.TCPPARMS 4.0 4.0
FEU.Z31B.PARMLIB 4.0 0.0
IXGLOGR.ATR.S0W1.RM.DATA.A0000000.DATA 4.0 0.0
SYS1.DAE 0.0 0.0
SYS1.DBBLIB 974.0 932.0
More complex aggregation
The .agg() gives you much more control as to what, and how you want to process data.
print(d.groupby("DSN").agg({'DSN' : ['count'], 'CountIO' : ['sum','max'],"CacheHits": ["sum"]}))
gave
DSN CountIO CacheHits
count sum max sum
DSN
ADCD.Z31B.PARMLIB 19 68.0 7.0 60.0
ADCD.Z31B.PROCLIB 30 66.0 3.0 66.0
ADCD.Z31B.VTAMLST 6 141.0 41.0 141.0
COLIN.TCPPARMS 2 4.0 3.0 4.0
FEU.Z31B.PARMLIB 1 4.0 4.0 0.0
IXGLOGR.ATR.S0W1.RM.DATA.A0000000.DATA 4 4.0 1.0 0.0
SYS1.DAE 1 0.0 NaN 0.0
SYS1.DBBLIB 2 974.0 932.0 932.0
Notes:
- The columns are not in the order I specified. It is hard to see which field Max applies to
- There is a Not a Number (Nan) in one of the value. You need to allow for this.
- In the simple case using .sum() by default it tries to sum all of the columns. Using .agg you can specify which columns you want to process