Robust Web Scraping in the Public Interest with AutoScrape (Research Paper)
Presented to Computation + Journalism 2019
I presented this paper on improving web scraping techniques at Computation + Journalism 2019. An implementation of this scraping framework is available as an open source project.
Abstract: Web scraping is a foundational task in journalism and tends to be performed using custom, one-off tools. Traditional methods involve constructing HTTP requests and extracting data using XPath[2]. As web sites become more interactive, these methods require an increasing amount of manual ef- fort to develop and maintain. This paper builds on previous work in text-based extraction techniques[8], adapts them to navigating a real browser, and proposes using Hext, a novel domain-specific language for extracting structured data from HTML. We introduce AutoScrape, an investigative-focused web scraping tool which implements this framework. Auto- Scrape can simplify many common journalistic data gather- ing tasks and reduce maintenance costs. In partnership with several non-profit media organizations, this paper will also present case studies describing common investigative tasks and illustrate the use of this framework to successfully solve each problem.