07/06/2020

Source Attribution: Recovering the Press Releases behind Health Science News

Ansel MacLaughlin, John Wihbey, Aleszu Bajak, David A. Smith

Keywords: articles, contexts, health, humans, news, news articles, predictions, relationships, representations, sources, texts

Abstract: We explore the task of intrinsic source attribution: inferring which portions of a derived document were adapted from an \textit{unobserved} source document. Specifically, we model the relationship between news articles and their press release sources using a dataset of 64,784 health science news articles and 23,068 press releases. We approach the problem at the sentence level and work with science journalism professors to develop a four point Likert scale describing the extent to which a news article sentence is derived from the content in the corresponding press release. Because manual annotation of news article - press release pairs is time-consuming, we turn to a mix of expert, non-expert, and heuristic-based annotation to label our dataset. After a small pilot study, which found that humans, when only able to view the text of the news article, struggle to identify which content is derived or not, we compare four different sentence regression models on the task. We find that modeling a sentence´s context in the entire document is important, with the best performing model, a sequence regression model with BERT token representations, achieving a spearman´s $\rho$ of 0.49 and $NDCG@1$ of 0.60 on the expert-labeled test set. Examining the model´s predictions, we find that it successfully identifies copied or closely paraphrased sentences in articles with a mix of derived and original content, but struggles to differentiate between loosely paraphrased and original sentences in articles with mostly original writing.

 0
 0
 0
 0
This is an embedded video. Talk and the respective paper are published at ICWSM 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment
no comments yet
code of conduct: tbd Characters remaining: 140

Similar Papers