A couple of times I’ve wanted to get information from documentation into my program for example

I want to extract out
- “cics:operator_class” :
"set","add","remove","delete" - “cics:operator_classes: N/A
then extract those which have N/A (or those with “set” etc).
Background
From the picture you can see the HTML table is not simple, it has a coloured background, some text is in one font, and other text is in a different font.
The table will not just be
<table>
<tr><td>"cics:operator_classes"</td>...<td>N/A</td></tr>
</table>
and so relatively easy to parse. My first thoughts were to use grep to extract the rows, then extract the data.
It will be more like one long string containing
<td>
<code class="language-plaintext highlighter-rouge">"cics:operator_classes"</code>
</td>
which means grep will not work directly with the data.
Getting the source
Some browsers allow you do save the source of a page, and some do not.
I use Chrome to display and save the page.
Parsing using Linux tools
For me the obvious approach was to use Python to process it. Unfortunately it complained about some of the HTML, so I spent some time trying to remove the stuff I didn’t need. This proved to be an interesting diversion to a dead end. In the end Python was the right answer, see Getting table data out of html – successfully.
Linux utilities only go so far
One of the pages I wanted to process was over 500KB, and I wanted just a small part of it. Looking at the data, it was one long string with no new lines so was very difficult to display.
Splitting the text up
The first thing I did was to extract the <table>… </table> information and ignore the rest.
I found the Unix command sed (stream editor) very useful
sed 's!<table!\n<table!g' racf.html
This reads from the file racf.html and changes <table…. to \n<table... so adding a new line to the before each <table. When you edit or display the file, the <table... are at the start of a line.
I fed the output of this into another sed command
sed 's!/table>!/table>\n!g'
which puts a line end after the end of every /table> tag. The data now looks like
<DOCTYPE html><html lang=”en-US”><meta http-equiv=”Content-Type” content=”text/..
<table…
</table>
…..
I then used sed to include lines between lines starting with <table> and </table> where they were at the start of the line
sed -n '/<table/, /<\/table>/p' > racf1.html
The final command was
sed 's!<table!\n<table!g' racf.html |sed 's!/table>!/table>\n!g' | sed -n '/<table/, /<\/table>/p' > racf1.html
Using regular expressions
You can use regular expressions and say remove data between <colgroup to /colgroup>.
This is where it starts to get hard
If there was a string and you process it to remove data between <colgroup to /colgroup>, then
my <colgroup> abc</colgroup> and <colgroup>xyz</colgroup> and the rest
then some tools will give
my and the rest
which is called greedy – it removes as much as possible, from the first <colgroup to the last /colgroup> to meet the instructions.
I had to use perl’s regular expressions which could be configured as non greedy
cat racf1.html |perl -p -e 's,<colgroup.*?/colgroup>,,g'
and produced
my and and the rest
which is what I was expecting.
The commands to extract the few fields from the KB of data were getting more and more complex, it would have been quicker to extract the fields of interest by hand.
I backtracked and went back to my Python program. Which was successful.
One thought on “Getting table data out of html – unsuccessfully”