Saturday, October 16, 2010

Ruby, Mechanize, Nokogiri, Xpath and Firebug

I have done a fair amount of extremely painful screen scraping to get data for a web page in the past using tools like WWW::Mechanize or LWP for perl. It can be disgustingly ugly to parse HTML to get at the data you need. When it came it recently all the bad experiences I have had came rushing back and I was dreading starting. Luckily the world is a nicer place and this had become trivial with ruby. The once painful process became:
  1. Open page I want to scrape in firefox
  2. load firebug
  3. rick click on the html element I want to scrape and select copy xpath
  4. fire up textmate and create a simple ruby mechanize script
  5. past in the xpath to script.
The code simply looks like this:
require 'rubygems'
require 'mechanize'
require 'mysql'

agent = Mechanize.new

page = agent.get('http://url.to.scrape.com')

upString = page.parser.xpath("/html/body/form/table[2]/tr/td/table/tr[5]/td/table/tr[6]/td[3]/a/img")[0]['title'].to_s
Life is good.