Pandas 102, notes on using pandas

Pandas is a great tool for displaying data from Python. You give it arrays of data, and it can display, summarise, group, print and plot. It is used for the simplest data, up to data analysts processing megabytes of data.

There is a lot of good information about getting started with Pandas, and how you can do advanced things with Pandas. I did the Pandas 101 level of reading, I struggled with the next step, so my notes for the 102 level of reading are below. Knowing that something can be done means you can go and look for it. If you look but cannot find, it may be that you are using the wrong search arguments, or there is no information on it.

Working with data
Feeding data into Pandas
What can you do with it?
Which columns are displayed ?
Select which rows you want displayed
You can select rows and columns
- What does .loc do?
You can process the data, such as sort
Saving data
- Export the data
- Import and do the analysis
Processing multiple data sources as one
Processing data within fields
Basic operations on columns
Doing aggregation, count, sum, maximum, minimum etc.
- Simple aggregation .sum()
- More complex aggregation .agg()

Working with data

I’ve been working with “flat files” on z/OS. For example the output of DCOLLECT which is information about dataset etc from SMS.

One lesson I learned was you should isolate the extraction from the processing (except for trivial amounts of data). Extracting data from flat files can be expensive, and take a long time, for example it may include conversion from EBCDIC to ASCII. It is better to capture the data from the flat file in python variables, then write the data to disk using JSON, or pickle (Python object serialisation). As a separate step read the data into memory from your saved file, then do your data processing work, with pandas, or other tools.

Feeding data into Pandas

The work I’ve done has been two dimensional, rows and columns; you can have multi dimensional.

You can use a list of dictionaries(dicts), or dict of list:

# a dict of lists
datad = {"dsn":["ABC","DEF"],
         "volser":["SYSRES","USER02"]}
pdd =  pd.DataFrame(datal)  

# a list of dicts
datal = [{"dsn":"ABC","volser":"SYSRES"},
         {"dsn":"DEF","volser":"USER02S"},
        ]   

pdl = pd.DataFrame.from_records(datal)

Processing data like pdd = pd.DataFrame(datal) creates a pandas data frame. You take actions on this data frame. You can create other data frames from an original data fram, for example with a subset of the rows and columns.

I was processing a large dataset of data, and found it easiest to create a dict for each row of data, and then accumulate each row as a list. Before I used Pandas, I had just printed out each row. I do not know which performs better. Someone else used a dict of lists, and appended each row’s data to the “dsn” or “volser” list.

What can you do with it?

The first thing is to print it. Once the data is in Pandas you can use either of pdd or pdl above.

print(pdd)

gave

   dsn   volser
0  ABC   SYSRES
1  DEF  USER02S

Where the 0, 1 are the row numbers of the data.

With my real data I got

                                             DSN  ... AllocSpace
0   SYS1.VVDS.VA4RES1                             ...       1660
1   SYS1.VTOCIX.A4RES1                            ...        830
2   CBC.SCCNCMP                                   ...     241043
3   CBC.SCLBDLL                                   ...        885
4   CBC.SCLBDLL2                                  ...        996
..                                           ...  ...        ...
93  SYS1.SERBLPA                                  ...        498
94  SYS1.SERBMENU                                 ...        277
95  SYS1.SERBPENU                                 ...      17652
96  SYS1.SERBT                                    ...        885
97  SYS1.SERBTENU                                 ...        332

[98 rows x 7 columns]

The data was formatted to match my window size. With a larger window I got more columns.

You can change this by using

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

Which columns are displayed ?

Rather than all of the columns being displayed you can select which columns are displayed.

You can tell from the data you passed to pandas, or use the command

print(list(opd.columns.values))

This displays the values of the column names, as a list.

To display the columns you specify use

print(opd[["DSN","VOLSER","ExpDate","CrDate","LastRef","63bit alloc space KB", "AllocSpace"]])

You can say display all but the specified columns

print(opd.loc[:, ~opd.columns.isin(["ExpDate","CrDate","LastRef"])])

Select which rows you want displayed

print(opd.loc[opd["VOLSER"].str.startswith("A4"),["DSN","VOLSER",]])

print(opd.loc[opd["DSN"].str.startswith("SYS1."),["DSN","VOLSER",]])

gave

                                             DSN  VOLSER  
0   SYS1.VVDS.VA4RES1                             A4RES1  
1   SYS1.VTOCIX.A4RES1                            A4RES1  
12  SYS1.ADFMAC1                                  A4RES1  
13  SYS1.CBRDBRM                                  A4RES1  
14  SYS1.CMDLIB                                   A4RES1  
..                                           ...     ...  
93  SYS1.SERBLPA                                  A4RES1  
94  SYS1.SERBMENU                                 A4RES1  
95  SYS1.SERBPENU                                 A4RES1  
96  SYS1.SERBT                                    A4RES1  
97  SYS1.SERBTENU                                 A4RES1  

[88 rows x 2 columns]

From this we can see 88 (out of 97) rows were displayed. Row 0, 1 , 12, 13, but not rows 2, 3, …

What does .loc do?

My interpretation of this (which I haven’t seen documented)

If there is one parameter, this is a list of the columns you want.

If there are two parameters, the second is the list of the columns you want displayed. The first column is conceptually a list of True or False, with one value per row, saying if the row should be selected or not. So for

print(opd.loc[opd["VOLSER"].str.startswith("A4"),["DSN","VOLSER",]])

opd[“VOLSER”].str.startswith(“A4”)

says take the column called VOLSER, convert it to a string. If it starts with the string “A4” then return True, else return False. This returns a list of one entry per row.
The opd.loc[opd[“VOLSER”].str.startswith(“A4”),…) then selects the rows.

You can select rows and columns

print(opd.loc[opd["VOLSER"].str.startswith("A4"),["DSN","VOLSER","63bit alloc space KB",]])

You can process the data, such as sort

The following statements extracts columns from the original data, sorts the data, and creates a new data frame. The new data frame is printed.

sdata= opd[["DSN","VOLSER","63bit alloc space KB",]].sort_values(by=["63bit alloc space KB","DSN"], ascending=False)
print(sdata)

This gave

                                             DSN  VOLSER  63bit alloc space KB
2   CBC.SCCNCMP                                   A4RES1                241043
35  SYS1.MACLIB                                   A4RES1                210664
36  SYS1.LINKLIB                                  A4RES1                166008
90  SYS1.SEEQINST                                 A4RES1                103534
42  SYS1.SAMPLIB                                  A4RES1                 82617
..                                           ...     ...                   ...
62  SYS1.SBPXTENU                                 A4RES1                    55
51  SYS1.SBDTMSG                                  A4RES1                    55
45  SYS1.SBDTCMD                                  A4RES1                    55
12  SYS1.ADFMAC1                                  A4RES1                    55
6   FFST.SEPWMOD3                                 A4RES1                    55

[98 rows x 3 columns]

Showing that all the rows, and all the (three) columns which had been copied to the sdata data frame.

Saving data

Reading an external file and processing the data into Python arrarys took an order of magnitude longer than processing it Pandas.

You should consider a two step approach to looking at data

Extract the data and exported it in an access format, such as Pickle or JSON. While getting this part working, use only a few rows of data. Once it works, you can process all of the data.
Do the analysis using the exported data.

Export the data

You should consider externalising the data in JSON or pickles format for example

# write out the data to a file
fPickle = open('pickledata', 'wb')    
    # source, destination
pickle.dump(opd, fPickle)                    
fPickle.close()

Import and do the analysis


# and read it in
fPickle = open('pickledata', 'rb')    
opd = pickle.load(fPickle)
fPickle.close()
print(odp)

Processing multiple data sources as one

If you have multiple sets of data, for example for Monday, Tuesday, Wednesday, etc you can use

week =  pd.concat(monday,tuesday,wednesday,thursday,friday)

Processing data within fields

Within my data, I have a field with information like

                   DSN  VOLSER           FormatType
0    SYS1.VVDS.VC4RES1  C4RES1                   []
1   SYS1.VTOCIX.C4RES1  C4RES1              [Fixed]
2          CBC.SCCNCMP  C4RES1    [Fixed, Variable]
3          CBC.SCLBDLL  C4RES1    [Fixed, Variable]
4         CBC.SCLBDLL2  C4RES1    [Fixed, Variable]

Where the data under FormatType is a list. You can reference elements in a list.

For example

x =  data.FormatType.apply(lambda x: 'Variable' in x)
print(x)

gives

0     False
1     False
2      True
3      True
4      True

The command

print(data.loc[ data.FormatType.apply(lambda x: 'Blocked' in x)])

gives

              DSN  VOLSER           FormatType
2     CBC.SCCNCMP  C4RES1    [Fixed, Variable]
3     CBC.SCLBDLL  C4RES1    [Fixed, Variable]
4    CBC.SCLBDLL2  C4RES1    [Fixed, Variable]

Basic operations on columns

You can do basic operations on columns such as

print(dataset[["CountIO","CacheHits"]].sum())

The sum() (and count() etc) functions add up the specified columns.

This gave

[361 rows x 10 columns]
CountIO      74667.0
CacheHits     1731.0
dtype: float64

An operation like

print(dataset.sum())

Would have totalled all the columns, including some which are meaningless, for example, maximum value found.

Doing aggregation, count, sum, maximum, minimum etc.

Simple aggregation

You can aggregate data

# Extract just the fields of interest
d = dataset[["DSN","CountIO","CacheHits"]]
print(d.groupby("DSN").sum())

Gave

                                        CountIO  CacheHits
DSN                                                       
ADCD.Z31B.PARMLIB                          68.0       60.0
ADCD.Z31B.PROCLIB                          66.0       66.0
ADCD.Z31B.VTAMLST                         141.0      141.0
COLIN.TCPPARMS                              4.0        4.0
FEU.Z31B.PARMLIB                            4.0        0.0
IXGLOGR.ATR.S0W1.RM.DATA.A0000000.DATA      4.0        0.0
SYS1.DAE                                    0.0        0.0
SYS1.DBBLIB                               974.0      932.0

More complex aggregation

The .agg() gives you much more control as to what, and how you want to process data.

print(d.groupby("DSN").agg({'DSN' : ['count'], 'CountIO' : ['sum','max'],"CacheHits": ["sum"]}))

gave

                                         DSN  CountIO          CacheHits
                                       count      sum      max       sum
DSN                                                                     
ADCD.Z31B.PARMLIB                         19     68.0      7.0      60.0
ADCD.Z31B.PROCLIB                         30     66.0      3.0      66.0
ADCD.Z31B.VTAMLST                          6    141.0     41.0     141.0
COLIN.TCPPARMS                             2      4.0      3.0       4.0
FEU.Z31B.PARMLIB                           1      4.0      4.0       0.0
IXGLOGR.ATR.S0W1.RM.DATA.A0000000.DATA     4      4.0      1.0       0.0
SYS1.DAE                                   1      0.0      NaN       0.0
SYS1.DBBLIB                                2    974.0    932.0     932.0

Notes:

The columns are not in the order I specified. It is hard to see which field Max applies to
There is a Not a Number (Nan) in one of the value. You need to allow for this.
In the simple case using .sum() by default it tries to sum all of the columns. Using .agg you can specify which columns you want to process

Pandas 102, notes on using pandas

Working with data

Feeding data into Pandas

What can you do with it?

Which columns are displayed ?

Select which rows you want displayed

What does .loc do?

You can select rows and columns

You can process the data, such as sort

Saving data

Export the data

Import and do the analysis

Processing multiple data sources as one

Processing data within fields

Basic operations on columns

Doing aggregation, count, sum, maximum, minimum etc.

Simple aggregation

More complex aggregation

Published by Colin Paice

One thought on “Pandas 102, notes on using pandas”

Leave a comment Cancel reply

Working with data

Feeding data into Pandas

What can you do with it?

Which columns are displayed ?

Select which rows you want displayed

What does .loc do?

You can select rows and columns

You can process the data, such as sort

Saving data

Export the data

Import and do the analysis

Processing multiple data sources as one

Processing data within fields

Basic operations on columns

Doing aggregation, count, sum, maximum, minimum etc.

Simple aggregation

More complex aggregation

Share this:

Related

Published by Colin Paice

One thought on “Pandas 102, notes on using pandas”

Leave a comment Cancel reply