Abstract:
Internet user-generated data, like Twitter, offers data scientistsa public real-time data source that can provide insights,supplementing traditional data sources. However, identifyingrelevant data for such analyses can be time consuming. In thispaper, we introduce the Perplexity variant of our Positive-Unlabelled Learning (PPUL) framework as a means to performsocial media relevance filtering. We note that this taskis particularly well suited to a PU Learning approach. Wedemonstrate how perplexity can identify candidate examplesof the negative class, using language models. To learn suchmodels, we experiment with both statistical methods and aVariational Autoencoder. Our PPUL method generally outperformsstrong PU Learning baselines, which we demonstrateon five different datasets: the Hazardous Product Reviewdataset, two well known social media datasets, and tworeal case studies in relevance filtering. All datasets have manualannotations for evaluation and in each case, PPUL attainsstate-of-the-art performance, with gains ranging from 4 to17% improvement over competitive baselines. We show thatthe PPUL framework is effective when the amount of positiveannotated data is small, and it is appropriate for both contentthat is triggered by an event or a general topic of interest.