jQuery and jsdom
Using jsdom you can specify a local file, or url, and jsdom will return the
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Making Scraping More Robust
Unfortunately there are few common bugs that I ran into when scraping content with jQuery and jsdom. Specifically there are two issues, that aren’t necessarily specific to jsdom, that are worth watching out for.
jQuery Return Values
The first issue are return values from jQuery function calls. Extra attention has to be paid to return values. Applying a method to
undefined will crash a program, a problem that can be especially apparent in DOM parsing. Consider the example below:
If there are 8, or more, links on a page the 8th link will be returned and its href attribute will be split into an array. However, if there are less than 8 the
attr('href') will return
undefined and calling
split() on it will crash the program. Since HTML pages aren’t as structured as API responses, its important not to assume too much and always check return values.
Web Page Errors
It’s entirely possible that the url passed to jsdom returns an error. If the error is temporary, your scraper might miss out on important information. This issue can be mitigated by recursively retrying the url, like the example below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
With the above approach, if errors are encountered
getLinks will be called recursively with a larger
retries value. On the 3rd retry the function will return. If you wanted to go further you could wrap the recursive call in
setTimeout to ensure that the recursive web request was not made immediately after the error was encountered.
Parsing web pages with jQuery on the server is a much more natural experience for developers already accustomed to using jQuery in the client. However, prior to scraping it’s worth checking if the site 1) allows scraping and 2) does not already have an API. Consuming a JSON API would be even easy than scraping and parsing!