Extracting References from Political Auto-Transcripts (Research Paper)
Presented to Data Science + Journalism Workshop at KDD 2017
Abstract: This paper presents an unsupervised method for counting references in noisy auto-transcribed political speeches. Transcriptions are vectorized using learned embeddings which are then clustered using k-means resulting in groups of words which represent highly granular, specific topics within the text. Words from each cluster are then extracted from each transcript, counted, and arranged for time-series analysis. The approach finds semantically coherent topics representing specific references despite transcription inaccuracies. We use this framework to extract references from over 400 political speech transcriptions from a 2016 U.S. Presidential campaign.