I have done a fair amount of extremely painful screen scraping to get data for a web page in the past using tools like WWW::Mechanize or LWP for perl. It can be disgustingly ugly to parse HTML to get at the data you need. When it came it recently all the bad experiences I have had came rushing back and I was dreading starting. Luckily the world is a nicer place and this had become trivial with ruby. The once painful process became:
- Open page I want to scrape in firefox
- load firebug
- rick click on the html element I want to scrape and select copy xpath
- fire up textmate and create a simple ruby mechanize script
- past in the xpath to script.
The code simply looks like this:
require 'rubygems'
require 'mechanize'
require 'mysql'
agent = Mechanize.new
page = agent.get('http://url.to.scrape.com')
upString = page.parser.xpath("/html/body/form/table[2]/tr/td/table/tr[5]/td/table/tr[6]/td[3]/a/img")[0]['title'].to_s
Life is good.