Abstract:
In the last few years, Twitter has become an important re-source for the identification of Adverse Drug Reactions (ADRs), monitoring flu trends, and other pharmacovigi-lance and general research applications. Most researchers spend their time crawling Twitter, buying expensive pre-mined datasets, or tediously and slowly building datasets using the limited Twitter API. However, there are a large number of datasets that are publicly available to research-ers that are underutilized or unused. In this work, we demonstrate how we mined over 9.4 billion Tweets from archive.org’s Twitter stream grab using a drug-term dic-tionary and plenty of computing power. Knowing that not everything that shines is gold, we used pre-existing drug-related datasets to build machine learning models to filter our findings for relevance. In this work, we present our methodology and the 3,346,758 identified tweets for pub-lic use in future research.