Saturday, October 16, 2010

Ruby, Mechanize, Nokogiri, Xpath and Firebug

I have done a fair amount of extremely painful screen scraping to get data for a web page in the past using tools like WWW::Mechanize or LWP for perl. It can be disgustingly ugly to parse HTML to get at the data you need. When it came it recently all the bad experiences I have had came rushing back and I was dreading starting. Luckily the world is a nicer place and this had become trivial with ruby. The once painful process became:
  1. Open page I want to scrape in firefox
  2. load firebug
  3. rick click on the html element I want to scrape and select copy xpath
  4. fire up textmate and create a simple ruby mechanize script
  5. past in the xpath to script.
The code simply looks like this:
require 'rubygems'
require 'mechanize'
require 'mysql'

agent = Mechanize.new

page = agent.get('http://url.to.scrape.com')

upString = page.parser.xpath("/html/body/form/table[2]/tr/td/table/tr[5]/td/table/tr[6]/td[3]/a/img")[0]['title'].to_s
Life is good.

Saturday, September 25, 2010

Creating an image of a running EC2 instance

Prerequisites
  1. AWS User ID
  2. AWS Key ID
  3. AWS Secret Key
  4. x.509 Key pair (cert and private key)
All of this can be obtained from the AWS account info page here. Note AWS does not store the private key of your x.509 key pair, if you do not have you will need to create a new key pair.

Creating the bundle
  1. Upload your x.509 cert and private key to your running ec2 instance.
  2. scp PATH_TO_KEYS/{cert,pk}-*.pem root@AWS_INSTANCE:/mnt
  3. Log into your ec2 instance
  4. ssh -i YOURKEY.pem root@AWS_INSTANCE
  5. Set up some environment variables to make the processes a little easier. Set arch to either i386 or x86_64 depending if you have a 64 bit or 23 bit instance. If your not sure which to choose you can check here
  6. # export AWS_USER_ID=YOUR_AWS_USER_ID
    # export AWS_ACCESS_KEY_ID=YOUR_KEY_ID
    # export AWS_SECRET_ACCESS_KEY=YOUR_SECRET_KEY
    # export arch=i386
  7. Create bundle
  8. ec2-bundle-vol -r $arch -d /mnt/ -p $prefix -u $AWS_USER_ID -k /mnt/pk-*.pem -c /mnt/cert-*.pem -s 10240 -e /mnt,/root/.ssh
  9. Upload bundle
  10. ec2-upload-bundle -b $bucket -m /mnt/$prefix.manifest.xml -a $AWS_ACCESS_KEY_ID -s $AWS_SECRET_ACCESS_KEY
  11. Register bundle
  12. ec2-register --name "$bucket/$prefix" $bucket/$prefix.manifest.xml

Saturday, February 27, 2010

Compiling GCC 4.4.3 on Solaris

First I have to say the compiling on solaris can be a major pain in the ass. I found my-self wanting gnu find on one of our solaris 10 machines but had issues compiling it, there was a known bug in gcc that can be fixed by upgrading gcc. That is when the fun began. I download gcc from gnu.org. The first problem I ran into was that I didn't read the dependency list, my bad. I needed to grab gmp and mpfr. I downloaded and compiled both and passed the --with-gmp and --with-mpfr flags and I though I was all set. Turns out that I was wrong I ran into a problem finding the mpfr lib. That sucked seeing as I just downloaded and compiled them. After looking at the config.log it was pretty obvious, I was building against 64 bit libraries. Adding the following got me to the next error:
env CC="gcc -m64"
The next problem was similar, apparently Solaris includes the 32 bit binaries in the search path but not 64 bit. The error I got was:
configure: error: cannot compute suffix of object files
When I checked the config.log the specific error I found was:
libgcc_s.so.1: wrong ELF class: ELFCLASS32
I was able to fix it with the following:
export LD_LIBRARY_PATH=/usr/sfw/lib/64/
Ok so now things looked to be on the right path. The compile went on for an hour and half and died with a new error, again the error was wrong ELF class. At this point I was about to pull my hair out. The problem was the CFLAGS were not being passed down to the cross coplier, note the following line:

/users/srb55/gcc-4.4.3/host-sparc-sun-solaris2.10/prev-gcc/xgcc -B/users/srb55/gcc-4.4.3/host-sparc-sun-solaris2.10/prev-gcc/ -B/global/inf/sys/software/gcc/gcc-4.4.2/sparc-sun-solaris2.10/bin/ -c -g -O2 -DIN_GCC
--- snip ---
So the fix to this wasnt so bad I just disbabled the bootstrap compile with --disable-bootstrap. This time the compile completed (after like 3 hours). The configure line that worked in the end was:
./configure --prefix=/global/inf/sys/software/gcc/gcc-4.4.3 --with-mpfr=/global/inf/sys/software/mpfr/mpfr-2.4.0 --with-gmp=/global/inf/sys/software/gmp/gmp-5.0.1/ --enable-shared --disable-nls --disable-bootstrap --disable-multilib -enable-languages=c,c++
Victory!