Scraping Web Pages with jQuery, Node.js and Jsdom

I always found it odd that accessing DOM elements with Ruby, or Python, wasn’t as easy as it was with jQuery. Many HTML parsing libraries employ Simple API for XML (SAX) that can handle extremely large XML documents, but is cumbersome and adds complexity. Other parsing libraries use XML Path Language (XPath), which is conceptually simpler than SAX, but still more of an effort than jQuery. I was pleasantly surprised to discover that it’s possible to use jQuery to parse web pages with Node.js. This is accomplished by using jsdom, “a javascript implementation of the W3C DOM”.

jQuery and jsdom

Using jsdom you can specify a local file, or url, and jsdom will return the window object for that document. Additionally, JavaScript can be inserted into the document; in our case we’re inserting the jQuery library. In the example below all the links from the Hacker News front page are logged to the console.

Scraping Links From Hacker News

var jsdom  = require('jsdom');
var fs     = require('fs');
var jquery = fs.readFileSync("./jquery-1.7.1.min.js").toString();

jsdom.env({
  html: 'http://news.ycombinator.com/',
  src: [
    jquery
  ],
  done: function(errors, window) {
    var $ = window.$;
    $('a').each(function(){
      console.log( $(this).attr('href') );
    });
  }
});

Making Scraping More Robust

Unfortunately there are few common bugs that I ran into when scraping content with jQuery and jsdom. Specifically there are two issues, that aren’t necessarily specific to jsdom, that are worth watching out for.

jQuery Return Values

The first issue are return values from jQuery function calls. Extra attention has to be paid to return values. Applying a method to undefined will crash a program, a problem that can be especially apparent in DOM parsing. Consider the example below:

$($('a')[7]).attr('href').split('/')

If there are 8, or more, links on a page the 8th link will be returned and its href attribute will be split into an array. However, if there are less than 8 the attr('href') will return undefined and calling split() on it will crash the program. Since HTML pages aren’t as structured as API responses, its important not to assume too much and always check return values.

Web Page Errors

It’s entirely possible that the url passed to jsdom returns an error. If the error is temporary, your scraper might miss out on important information. This issue can be mitigated by recursively retrying the url, like the example below:

Managing Errors

var getLinks = function(retries){

  if(retries === 3){
    return;
  }else if (retries === undefined){
    retries = 0;
  }

  jsdom.env({
    html: 'http://news.ycombinator.com/',
    src: [
      jquery
    ],
    done: function(errors, window) {

      if(errors){
        return getLinks(retries + 1);
      }

      var $ = window.$;
      $('a').each(function(){
        console.log( $(this).attr('href') );
      });
    }
  });
}

With the above approach, if errors are encountered getLinks will be called recursively with a larger retries value. On the 3rd retry the function will return. If you wanted to go further you could wrap the recursive call in setTimeout to ensure that the recursive web request was not made immediately after the error was encountered.

Conclusions

Parsing web pages with jQuery on the server is a much more natural experience for developers already accustomed to using jQuery in the client. However, prior to scraping it’s worth checking if the site 1) allows scraping and 2) does not already have an API. Consuming a JSON API would be even easy than scraping and parsing!

Liam Kaufman

Software Developer and Entrepreneur

Scraping Web Pages With jQuery, Node.js and Jsdom

jQuery and jsdom

Making Scraping More Robust

jQuery Return Values

Web Page Errors

Conclusions

Comments