4 Apr 2012 olea   » (Master)

How to use XPath expressions in shell scripting using xmllint

This is a minor tip I want to share. A little example of a nice software feature that made my day.

I've been messing with HTML scrapping and I took a look on xmllint (maybe new) features. My intention was to extract a particular pattern, for which the --xpath option could be fine. I've never been very good tuning xpath expressions so I made a search about how to approach this. I found an amazing feature of the xmllint shell mode. As explanation here I show the workflow used:

  • get your document, I used and HTML one
  • I didn't tested with broken HTML but you can test it with xmllint --html
  • get into shell: xmllint --html --shell [document], keep in mind [document] can be a remote URI.
  • in the shell mode you can search for a precise string, in my case I chose the one inside the desired pattern: grep [string]
  • here is when magic happens: xmllint answers with the xpath expression you can use for a xpath query
  • exit the shell
  • copy the extracted xpath expression to CLI: xmllint --html --xpath [xpath]
  • here it is.

You can tune your expressions adding new predicates, as using specific attributes, or extracting the text() node, etc.

Enjoy.

Syndicated 2012-04-04 20:00:00 from Ismael Olea

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!