Pandas is a great tool for displaying data from Python. You give it arrays of data, and it can display, summarise, group, print and plot. It is used for the simplest data, up to data analysts processing megabytes of data.
There is a lot of good information about getting started with Pandas, and how you can do advanced things with Pandas. I did the Pandas 101 level of reading, I struggled with the next step, so my notes for the 102 level of reading are below. Knowing that something can be done means you can go and look for it. If you look but cannot find, it may be that you are using the wrong search arguments, or there is no information on it.
- Working with data
- Feeding data into Pandas
- What can you do with it?
- Which columns are displayed ?
- Select which rows you want displayed
- You can select rows and columns
- You can process the data, such as sort
- Saving data
- Processing multiple data sources as one
Working with data
I’ve been working with “flat files” on z/OS. For example the output of DCOLLECT which is information about dataset etc from SMS.
One lesson I learned was you should isolate the extraction from the processing (except for trivial amounts of data). Extracting data from flat files can be expensive, and take a long time, for example it may include conversion from EBCDIC to ASCII. It is better to capture the data from the flat file in python variables, then write the data to disk using JSON, or pickle (Python object serialisation). As a separate step read the data into memory from your saved file, then do your data processing work, with pandas, or other tools.
Feeding data into Pandas
The work I’ve done has been two dimensional, rows and columns; you can have multi dimensional.
You can use a list of dictionaries(dicts), or dict of list:
# a dict of lists
datad = {"dsn":["ABC","DEF"],
"volser":["SYSRES","USER02"]}
pdd = pd.DataFrame(datal)
# a list of dicts
datal = [{"dsn":"ABC","volser":"SYSRES"},
{"dsn":"DEF","volser":"USER02S"},
]
pdl = pd.DataFrame.from_records(datal)
Processing data like pdd = pd.DataFrame(datal) creates a pandas data frame. You take actions on this data frame. You can create other data frames from an original data fram, for example with a subset of the rows and columns.
I was processing a large dataset of data, and found it easiest to create a dict for each row of data, and then accumulate each row as a list. Before I used Pandas, I had just printed out each row. I do not know which performs better. Someone else used a dict of lists, and appended each row’s data to the “dsn” or “volser” list.
What can you do with it?
The first thing is to print it. Once the data is in Pandas you can use either of pdd or pdl above.
print(pdd)
gave
dsn volser
0 ABC SYSRES
1 DEF USER02S
Where the 0, 1 are the row numbers of the data.
With my real data I got
DSN ... AllocSpace
0 SYS1.VVDS.VA4RES1 ... 1660
1 SYS1.VTOCIX.A4RES1 ... 830
2 CBC.SCCNCMP ... 241043
3 CBC.SCLBDLL ... 885
4 CBC.SCLBDLL2 ... 996
.. ... ... ...
93 SYS1.SERBLPA ... 498
94 SYS1.SERBMENU ... 277
95 SYS1.SERBPENU ... 17652
96 SYS1.SERBT ... 885
97 SYS1.SERBTENU ... 332
[98 rows x 7 columns]
The data was formatted to match my window size. With a larger window I got more columns.
You can change this by using
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
Which columns are displayed ?
Rather than all of the columns being displayed you can select which columns are displayed.
You can tell from the data you passed to pandas, or use the command
print(list(opd.columns.values))
This displays the values of the column names, as a list.
To display the columns you specify use
print(opd[["DSN","VOLSER","ExpDate","CrDate","LastRef","63bit alloc space KB", "AllocSpace"]])
You can say display all but the specified columns
print(opd.loc[:, ~opd.columns.isin(["ExpDate","CrDate","LastRef"])])
Select which rows you want displayed
print(opd.loc[opd["VOLSER"].str.startswith("A4"),["DSN","VOLSER",]])
or
print(opd.loc[opd["DSN"].str.startswith("SYS1."),["DSN","VOLSER",]])
gave
DSN VOLSER
0 SYS1.VVDS.VA4RES1 A4RES1
1 SYS1.VTOCIX.A4RES1 A4RES1
12 SYS1.ADFMAC1 A4RES1
13 SYS1.CBRDBRM A4RES1
14 SYS1.CMDLIB A4RES1
.. ... ...
93 SYS1.SERBLPA A4RES1
94 SYS1.SERBMENU A4RES1
95 SYS1.SERBPENU A4RES1
96 SYS1.SERBT A4RES1
97 SYS1.SERBTENU A4RES1
[88 rows x 2 columns]
From this we can see 88 (out of 97) rows were displayed. Row 0, 1 , 12, 13, but not rows 2, 3, …
You can select rows and columns
print(opd.loc[opd["VOLSER"].str.startswith("A4"),["DSN","VOLSER","63bit alloc space KB",]])
You can process the data, such as sort
The following statements extracts columns from the original data, sorts the data, and creates a new data frame. The new data frame is printed.
sdata= opd[["DSN","VOLSER","63bit alloc space KB",]].sort_values(by=["63bit alloc space KB","DSN"], ascending=False)
print(sdata)
This gave
DSN VOLSER 63bit alloc space KB
2 CBC.SCCNCMP A4RES1 241043
35 SYS1.MACLIB A4RES1 210664
36 SYS1.LINKLIB A4RES1 166008
90 SYS1.SEEQINST A4RES1 103534
42 SYS1.SAMPLIB A4RES1 82617
.. ... ... ...
62 SYS1.SBPXTENU A4RES1 55
51 SYS1.SBDTMSG A4RES1 55
45 SYS1.SBDTCMD A4RES1 55
12 SYS1.ADFMAC1 A4RES1 55
6 FFST.SEPWMOD3 A4RES1 55
[98 rows x 3 columns]
Showing that all the rows, and all the (three) columns which had been copied to the sdata data frame.
Saving data
Reading an external file and processing the data into Python arrarys took an order of magnitude longer than processing it Pandas.
You should consider a two step approach to looking at data
- Extract the data and exported it in an access format, such as Pickle or JSON. While getting this part working, use only a few rows of data. Once it works, you can process all of the data.
- Do the analysis using the exported data.
Export the data
You should consider externalising the data in JSON or pickles format for example
# write out the data to a file
fPickle = open('pickledata', 'wb')
# source, destination
pickle.dump(opd, fPickle)
fPickle.close()
Import and do the analysis
# and read it in
fPickle = open('pickledata', 'rb')
opd = pickle.load(fpickle)
fpickle.close()
print(odp)
Processing multiple data sources as one
If you have multiple sets of data, for example for Monday, Tuesday, Wednesday, etc you can use
week = pd.concat(monday,tuesday,wednesday,thursday,friday)
One thought on “Pandas 102, notes on using pandas”