Pandas 102, notes on using pandas

Pandas is a great tool for displaying data from Python. You give it arrays of data, and it can display, summarise, group, print and plot. It is used for the simplest data, up to data analysts processing megabytes of data.

There is a lot of good information about getting started with Pandas, and how you can do advanced things with Pandas. I did the Pandas 101 level of reading, I struggled with the next step, so my notes for the 102 level of reading are below. Knowing that something can be done means you can go and look for it. If you look but cannot find, it may be that you are using the wrong search arguments, or there is no information on it.

Working with data

I’ve been working with “flat files” on z/OS. For example the output of DCOLLECT which is information about dataset etc from SMS.

One lesson I learned was you should isolate the extraction from the processing (except for trivial amounts of data). Extracting data from flat files can be expensive, and take a long time, for example it may include conversion from EBCDIC to ASCII. It is better to capture the data from the flat file in python variables, then write the data to disk using JSON, or pickle (Python object serialisation). As a separate step read the data into memory from your saved file, then do your data processing work, with pandas, or other tools.

Feeding data into Pandas

The work I’ve done has been two dimensional, rows and columns; you can have multi dimensional.

You can use a list of dictionaries(dicts), or dict of list:

# a dict of lists
datad = {"dsn":["ABC","DEF"],
"volser":["SYSRES","USER02"]}
pdd = pd.DataFrame(datal)

# a list of dicts
datal = [{"dsn":"ABC","volser":"SYSRES"},
{"dsn":"DEF","volser":"USER02S"},
]

pdl = pd.DataFrame.from_records(datal)

Processing data like pdd = pd.DataFrame(datal) creates a pandas data frame. You take actions on this data frame. You can create other data frames from an original data fram, for example with a subset of the rows and columns.

I was processing a large dataset of data, and found it easiest to create a dict for each row of data, and then accumulate each row as a list. Before I used Pandas, I had just printed out each row. I do not know which performs better. Someone else used a dict of lists, and appended each row’s data to the “dsn” or “volser” list.

What can you do with it?

The first thing is to print it. Once the data is in Pandas you can use either of pdd or pdl above.

print(pdd)

gave

   dsn   volser
0 ABC SYSRES
1 DEF USER02S

Where the 0, 1 are the row numbers of the data.

With my real data I got

                                             DSN  ... AllocSpace
0 SYS1.VVDS.VA4RES1 ... 1660
1 SYS1.VTOCIX.A4RES1 ... 830
2 CBC.SCCNCMP ... 241043
3 CBC.SCLBDLL ... 885
4 CBC.SCLBDLL2 ... 996
.. ... ... ...
93 SYS1.SERBLPA ... 498
94 SYS1.SERBMENU ... 277
95 SYS1.SERBPENU ... 17652
96 SYS1.SERBT ... 885
97 SYS1.SERBTENU ... 332

[98 rows x 7 columns]

The data was formatted to match my window size. With a larger window I got more columns.

You can change this by using

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

Which columns are displayed ?

Rather than all of the columns being displayed you can select which columns are displayed.

You can tell from the data you passed to pandas, or use the command

print(list(opd.columns.values))

This displays the values of the column names, as a list.

To display the columns you specify use

print(opd[["DSN","VOLSER","ExpDate","CrDate","LastRef","63bit alloc space KB", "AllocSpace"]])

You can say display all but the specified columns

print(opd.loc[:, ~opd.columns.isin(["ExpDate","CrDate","LastRef"])])

Select which rows you want displayed

print(opd.loc[opd["VOLSER"].str.startswith("A4"),["DSN","VOLSER",]])

or

print(opd.loc[opd["DSN"].str.startswith("SYS1."),["DSN","VOLSER",]])

gave

                                             DSN  VOLSER  
0 SYS1.VVDS.VA4RES1 A4RES1
1 SYS1.VTOCIX.A4RES1 A4RES1
12 SYS1.ADFMAC1 A4RES1
13 SYS1.CBRDBRM A4RES1
14 SYS1.CMDLIB A4RES1
.. ... ...
93 SYS1.SERBLPA A4RES1
94 SYS1.SERBMENU A4RES1
95 SYS1.SERBPENU A4RES1
96 SYS1.SERBT A4RES1
97 SYS1.SERBTENU A4RES1

[88 rows x 2 columns]

From this we can see 88 (out of 97) rows were displayed. Row 0, 1 , 12, 13, but not rows 2, 3, …

What does .loc do?

My interpretation of this (which I haven’t seen documented)

If there is one parameter, this is a list of the columns you want.

If there are two parameters, the second is the list of the columns you want displayed. The first column is conceptually a list of True or False, with one value per row, saying if the row should be selected or not. So for

print(opd.loc[opd["VOLSER"].str.startswith("A4"),["DSN","VOLSER",]])

opd[“VOLSER”].str.startswith(“A4”)

says take the column called VOLSER, convert it to a string. If it starts with the string “A4” then return True, else return False. This returns a list of one entry per row.
The opd.loc[opd[“VOLSER”].str.startswith(“A4”),…) then selects the rows.

You can select rows and columns

print(opd.loc[opd["VOLSER"].str.startswith("A4"),["DSN","VOLSER","63bit alloc space KB",]])

You can process the data, such as sort

The following statements extracts columns from the original data, sorts the data, and creates a new data frame. The new data frame is printed.

sdata= opd[["DSN","VOLSER","63bit alloc space KB",]].sort_values(by=["63bit alloc space KB","DSN"], ascending=False)
print(sdata)

This gave

                                             DSN  VOLSER  63bit alloc space KB
2 CBC.SCCNCMP A4RES1 241043
35 SYS1.MACLIB A4RES1 210664
36 SYS1.LINKLIB A4RES1 166008
90 SYS1.SEEQINST A4RES1 103534
42 SYS1.SAMPLIB A4RES1 82617
.. ... ... ...
62 SYS1.SBPXTENU A4RES1 55
51 SYS1.SBDTMSG A4RES1 55
45 SYS1.SBDTCMD A4RES1 55
12 SYS1.ADFMAC1 A4RES1 55
6 FFST.SEPWMOD3 A4RES1 55

[98 rows x 3 columns]

Showing that all the rows, and all the (three) columns which had been copied to the sdata data frame.

Saving data

Reading an external file and processing the data into Python arrarys took an order of magnitude longer than processing it Pandas.

You should consider a two step approach to looking at data

  • Extract the data and exported it in an access format, such as Pickle or JSON. While getting this part working, use only a few rows of data. Once it works, you can process all of the data.
  • Do the analysis using the exported data.

Export the data

You should consider externalising the data in JSON or pickles format for example

# write out the data to a file
fPickle = open('pickledata', 'wb')
# source, destination
pickle.dump(opd, fPickle)
fPickle.close()

Import and do the analysis


# and read it in
fPickle = open('pickledata', 'rb')
opd = pickle.load(fPickle)
fPickle.close()
print(odp)

Processing multiple data sources as one

If you have multiple sets of data, for example for Monday, Tuesday, Wednesday, etc you can use

week =  pd.concat(monday,tuesday,wednesday,thursday,friday)

Processing data within fields

Within my data, I have a field with information like

                   DSN  VOLSER           FormatType
0 SYS1.VVDS.VC4RES1 C4RES1 []
1 SYS1.VTOCIX.C4RES1 C4RES1 [Fixed]
2 CBC.SCCNCMP C4RES1 [Fixed, Variable]
3 CBC.SCLBDLL C4RES1 [Fixed, Variable]
4 CBC.SCLBDLL2 C4RES1 [Fixed, Variable]

Where the data under FormatType is a list. You can reference elements in a list.

For example

x =  data.FormatType.apply(lambda x: 'Variable' in x)
print(x)

gives

0     False
1 False
2 True
3 True
4 True

The command

print(data.loc[ data.FormatType.apply(lambda x: 'Blocked' in x)])

gives

              DSN  VOLSER           FormatType
2 CBC.SCCNCMP C4RES1 [Fixed, Variable]
3 CBC.SCLBDLL C4RES1 [Fixed, Variable]
4 CBC.SCLBDLL2 C4RES1 [Fixed, Variable]

Basic operations on columns

You can do basic operations on columns such as

print(dataset[["CountIO","CacheHits"]].sum())

The sum() (and count() etc) functions add up the specified columns.

This gave

[361 rows x 10 columns]
CountIO 74667.0
CacheHits 1731.0
dtype: float64

An operation like

print(dataset.sum())

Would have totalled all the columns, including some which are meaningless, for example, maximum value found.

Doing aggregation, count, sum, maximum, minimum etc.

Simple aggregation

You can aggregate data

# Extract just the fields of interest
d = dataset[["DSN","CountIO","CacheHits"]]
print(d.groupby("DSN").sum())

Gave

                                        CountIO  CacheHits
DSN
ADCD.Z31B.PARMLIB 68.0 60.0
ADCD.Z31B.PROCLIB 66.0 66.0
ADCD.Z31B.VTAMLST 141.0 141.0
COLIN.TCPPARMS 4.0 4.0
FEU.Z31B.PARMLIB 4.0 0.0
IXGLOGR.ATR.S0W1.RM.DATA.A0000000.DATA 4.0 0.0
SYS1.DAE 0.0 0.0
SYS1.DBBLIB 974.0 932.0

More complex aggregation

The .agg() gives you much more control as to what, and how you want to process data.

print(d.groupby("DSN").agg({'DSN' : ['count'], 'CountIO' : ['sum','max'],"CacheHits": ["sum"]}))

gave

                                         DSN  CountIO          CacheHits
count sum max sum
DSN
ADCD.Z31B.PARMLIB 19 68.0 7.0 60.0
ADCD.Z31B.PROCLIB 30 66.0 3.0 66.0
ADCD.Z31B.VTAMLST 6 141.0 41.0 141.0
COLIN.TCPPARMS 2 4.0 3.0 4.0
FEU.Z31B.PARMLIB 1 4.0 4.0 0.0
IXGLOGR.ATR.S0W1.RM.DATA.A0000000.DATA 4 4.0 1.0 0.0
SYS1.DAE 1 0.0 NaN 0.0
SYS1.DBBLIB 2 974.0 932.0 932.0

Notes:

  • The columns are not in the order I specified. It is hard to see which field Max applies to
  • There is a Not a Number (Nan) in one of the value. You need to allow for this.
  • In the simple case using .sum() by default it tries to sum all of the columns. Using .agg you can specify which columns you want to process

One thought on “Pandas 102, notes on using pandas

Leave a comment