Extracting References from Political Auto-Transcripts (Research Paper)

Presented to Data Science + Journalism Workshop at KDD 2017

Mon 14 August 2017 ~ Brandon Roberts
Example image of a document with lots of transcription errors

I developed a methodology for extracting topics and subjects for messy data sources like OCR and audio transcription.

Abstract: This paper presents an unsupervised method for counting references in noisy auto-transcribed political speeches. Transcriptions are vectorized using learned embeddings which are then clustered using k-means resulting in groups of words which represent highly granular, specific topics within the text. Words from each cluster are then extracted from each transcript, counted, and arranged for time-series analysis. The approach finds semantically coherent topics representing specific references despite transcription inaccuracies. We use this framework to extract references from over 400 political speech transcriptions from a 2016 U.S. Presidential campaign.

You can download the paper here.