Scrapy is a fantastic framework for web scraping with loads of stuff you need built in - including stats, and hooks you can use for email notifications. I’ll focus on that in this post to collect some of the techniques I mixed and matched.
In Scrapy you’ll set up your crawler using the scrapy startproject command and work in a main file, say imagescraper.py, and at its most simple, that’s really all you’ll need. But probably you’ll want to edit the settings.py file to adjust things like autothrottling.
Here’s our hypothetical scraper:
In the above code we’re scraping an imaginary site for the photos with a certain tag; all the results are listed on this one page as links, and each photo we want to scrape has full-size links and metadata on its own entry, which we can access through the list of links.
Let’s say you we wanted to have this scraper run on a schedule and email a report when it’s done. You can do that using Scrapy’s pipelines system, which includes a few built-in methods like close_spider, which you can use to send an email like this:
Out of the box, this will send you a nice little email with a dump of all of Scrapy’s stats from the crawl, that looks something like this when it lands in your email:
That’s cool. But what if you want to do a little more? Maybe there’s some other stats you want to add to the report, based on data your crawler encounters when it first hits the initial page, or calculations you do while parsing, or custom validation?
It would be nice if our stats would include the website’s own information on how many photos are included in the results, and include that and the name of the category in its summary statistics when it sends a report when the scraper is done crawling the site. That way we can see it matches the report of requests and see at a glance if there were problems pulling down the data.
There’s a really easy way to do this, though docs are thin on how; my motivation to write this was in part to make more easily findable a code snippet that was really helpful for me and so buried online it took me a while to locate (here, if you’re interested).
We’d take our scraper from above and add the code in the following example.
To do this, we’re adding to the crawler’s stats collection using the self.crawler.stats.set_value() method. You don’t have to do anything at the pipeline end, whatever you add while you’re scraping / parsing gets sent on through.