Robust Web Scraping in the Public Interest with AutoScrape (Research Paper)

Presented to Computation + Journalism 2019

I presented this paper on improving web scraping techniques at Computation + Journalism 2019. An implementation of this scraping framework is available as an open source project.

Abstract: Web scraping is a foundational task in journalism and tends to be performed using custom, one-off tools. Traditional methods involve constructing HTTP requests and extracting data using XPath[2]. As web sites become more interactive, these methods require an increasing amount of manual ef- fort to develop and maintain. This paper builds on previous work in text-based extraction techniques[8], adapts them to navigating a real browser, and proposes using Hext, a novel domain-specific language for extracting structured data from HTML. We introduce AutoScrape, an investigative-focused web scraping tool which implements this framework. Auto- Scrape can simplify many common journalistic data gather- ing tasks and reduce maintenance costs. In partnership with several non-profit media organizations, this paper will also present case studies describing common investigative tasks and illustrate the use of this framework to successfully solve each problem.

You can download the paper here.