Getting table data out of html – unsuccessfully

A couple of times I’ve wanted to get information from documentation into my program for example

I want to extract out

  • “cics:operator_class” : "set","add","remove","delete"
  • “cics:operator_classes: N/A

then extract those which have N/A (or those with “set” etc).

Background

From the picture you can see the HTML table is not simple, it has a coloured background, some text is in one font, and other text is in a different font.

The table will not just be

<table>
<tr><td>"cics:operator_classes"</td>...<td>N/A</td></tr>
</table>

and so relatively easy to parse. My first thoughts were to use grep to extract the rows, then extract the data.

It will be more like one long string containing

<td>
<code class="language-plaintext highlighter-rouge">"cics:operator_classes"</code>
</td>

which means grep will not work directly with the data.

Getting the source

Some browsers allow you do save the source of a page, and some do not.
I use Chrome to display and save the page.

Parsing using Linux tools

For me the obvious approach was to use Python to process it. Unfortunately it complained about some of the HTML, so I spent some time trying to remove the stuff I didn’t need. This proved to be an interesting diversion to a dead end. In the end Python was the right answer, see Getting table data out of html – successfully.

Linux utilities only go so far

One of the pages I wanted to process was over 500KB, and I wanted just a small part of it. Looking at the data, it was one long string with no new lines so was very difficult to display.

Splitting the text up

The first thing I did was to extract the <table>… </table> information and ignore the rest.

I found the Unix command sed (stream editor) very useful

sed 's!<table!\n<table!g' racf.html 

This reads from the file racf.html and changes <table…. to \n<table... so adding a new line to the before each <table. When you edit or display the file, the <table... are at the start of a line.

I fed the output of this into another sed command

sed 's!/table>!/table>\n!g' 

which puts a line end after the end of every /table> tag. The data now looks like

<DOCTYPE html><html lang=”en-US”><meta http-equiv=”Content-Type” content=”text/..
<table…
</table>
…..

I then used sed to include lines between lines starting with <table> and </table> where they were at the start of the line

sed -n  '/<table/, /<\/table>/p' > racf1.html 

The final command was

sed 's!<table!\n<table!g' racf.html |sed 's!/table>!/table>\n!g'  | sed -n  '/<table/, /<\/table>/p' > racf1.html 

Using regular expressions

You can use regular expressions and say remove data between <colgroup to /colgroup>.

This is where it starts to get hard

If there was a string and you process it to remove data between <colgroup to /colgroup>, then

my <colgroup> abc</colgroup> and <colgroup>xyz</colgroup> and the rest

then some tools will give

my and the rest

which is called greedy – it removes as much as possible, from the first <colgroup to the last /colgroup> to meet the instructions.

I had to use perl’s regular expressions which could be configured as non greedy

cat racf1.html |perl -p -e 's,<colgroup.*?/colgroup>,,g'

and produced

my and and the rest

which is what I was expecting.
The commands to extract the few fields from the KB of data were getting more and more complex, it would have been quicker to extract the fields of interest by hand.
I backtracked and went back to my Python program. Which was successful.

One thought on “Getting table data out of html – unsuccessfully

Leave a comment