A New Begining: Ruby, Mechanize, Nokogiri, Xpath and Firebug

Saturday, October 16, 2010

Ruby, Mechanize, Nokogiri, Xpath and Firebug

I have done a fair amount of extremely painful screen scraping to get data for a web page in the past using tools like WWW::Mechanize or LWP for perl. It can be disgustingly ugly to parse HTML to get at the data you need. When it came it recently all the bad experiences I have had came rushing back and I was dreading starting. Luckily the world is a nicer place and this had become trivial with ruby. The once painful process became:

Open page I want to scrape in firefox
load firebug
rick click on the html element I want to scrape and select copy xpath
fire up textmate and create a simple ruby mechanize script
past in the xpath to script.

The code simply looks like this:

require 'rubygems'
require 'mechanize'
require 'mysql'

agent = Mechanize.new

page = agent.get('http://url.to.scrape.com')

upString = page.parser.xpath("/html/body/form/table[2]/tr/td/table/tr[5]/td/table/tr[6]/td[3]/a/img")[0]['title'].to_s

Life is good.

2 comments:

Anonymous said...: And life can get even better when using Selectorgadget: http://www.selectorgadget.com/, try it out...; December 30, 2010 at 11:47 AM
Anonymous said...: geotorelxzp credit companies
credit companies; April 4, 2013 at 11:32 PM