Do you ever do web scraping in Ruby? Perhaps you would like to create a list of items (like say chicken breeds) that are already on a Wikipedia page. Why do a bunch of copy and pasting when the information is already in a text based format (html)?
Generally the list you are gathering has to be long enough to justify the effort of creating the a script. I would like to share a technique that I recently came across that changes the balance of the equation greatly in favor of making a script to do the web scraping.
The technique is to combine a Ruby parsing script with the selector gadget Chrome extension. This extension makes it easy to generate CSS selectors that can be used to retrieve just the text you want from a web page.
The Swiss Army Ruby Knife
Peter Cooper gives a demonstration of the technique in the following Ruby Remote Conference talk starting at around minute 27. Watch the video to see how powerful this technique is.
The entire video is worth watching to get a glimpse of how Ruby can be used to automate different tasks. You may find inspiration from a different example in the video. The speaker talks about taking what he calls a scrappy approach because sometimes a quick script is all you need to automate a repetitive task.
The Selector Gadget
There are a couple of features that make the selector gadget easy to use. First of all, after starting the gadget all you have to do is click on a portion of the text you are interested in. This will generate a selector and visually show exactly what text the selector applies to. The first selector will probably be too broad and contain text that you are not interested in. Just click on an example of the unwanted text and it will be eliminated from the selector.
Once you have selected the text of interest, the generated CSS selector can just be dropped into a ruby script via copy and paste.
Web Scraping in Ruby
(or applying the Swiss Army Ruby Knife)
I wanted to get a list of chicken breeds for a project that I was working on. A google search showed that the information I needed was on wikipedia.
It turns out that there are a total of 68 breeds listed on the page and a lot of the breeds are duplicated for bantam chickens. That would be a lot of copy and pasting and maddening effort if the de duplication had to be done manually.
However using the selector gadget, the list was generated in short order with the following script.
url = "https://en.wikipedia.org/wiki/Chicken_breeds_recognized_by_the_American_Poultry_Association"
data = open(url).read
doc = Nokogiri::HTML(data)
breeds = Array.new
num_dups = 0
doc.css("h5+ ul a , h4+ ul a").each do |el|
text = el.text
num_dups += 1
puts "Found duplicate " + num_dups.to_s + " " + text
breeds.each do |b|
The script keeps track of the number of duplicates and only adds the chicken breed to the breeds array if it is not already in the array. Note also that the selector used on line 14 was copied directly from the selector gadget tool.