I always found it odd that accessing DOM elements with Ruby, or Python, wasn’t as easy as it was with jQuery. Many HTML parsing libraries employ Simple API for XML (SAX) that can handle extremely large XML documents, but is cumbersome and adds complexity. Other parsing libraries use XML Path Language (XPath), which is conceptually simpler than SAX, but still more of an effort than jQuery. I was pleasantly surprised to discover that it’s possible to use jQuery to parse web pages with Node.js. This is accomplished by using jsdom, “a javascript implementation of the W3C DOM”.
jQuery and jsdom
Using jsdom you can specify a local file, or url, and jsdom will return the window
object for that document. Additionally, JavaScript can be inserted into the document; in our case we’re inserting the jQuery library. In the example below all the links from the Hacker News front page are logged to the console.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Making Scraping More Robust
Unfortunately there are few common bugs that I ran into when scraping content with jQuery and jsdom. Specifically there are two issues, that aren’t necessarily specific to jsdom, that are worth watching out for.
jQuery Return Values
The first issue are return values from jQuery function calls. Extra attention has to be paid to return values. Applying a method to undefined
will crash a program, a problem that can be especially apparent in DOM parsing. Consider the example below:
1
|
|
If there are 8, or more, links on a page the 8th link will be returned and its href attribute will be split into an array. However, if there are less than 8 the attr('href')
will return undefined
and calling split()
on it will crash the program. Since HTML pages aren’t as structured as API responses, its important not to assume too much and always check return values.
Web Page Errors
It’s entirely possible that the url passed to jsdom returns an error. If the error is temporary, your scraper might miss out on important information. This issue can be mitigated by recursively retrying the url, like the example below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
|
With the above approach, if errors are encountered getLinks
will be called recursively with a larger retries
value. On the 3rd retry the function will return. If you wanted to go further you could wrap the recursive call in setTimeout
to ensure that the recursive web request was not made immediately after the error was encountered.
Conclusions
Parsing web pages with jQuery on the server is a much more natural experience for developers already accustomed to using jQuery in the client. However, prior to scraping it’s worth checking if the site 1) allows scraping and 2) does not already have an API. Consuming a JSON API would be even easy than scraping and parsing!