REST: A Thread Embedding Approach for Identifying and Classifying User-Specified Information in Security Forums

Abstract: How can we extract useful information from a security forum?We focus on identifying threads of interest to a securityprofessional: (a) alerts of worrisome events, such asattacks, (b) offering of malicious services and products, (c)hacking information to perform malicious acts, and (d) usefulsecurity-related experiences. The analysis of security forumsis in its infancy despite several promising recent works.Novel approaches are needed to address the challenges in thisdomain: (a) the difficulty in specifying the “topics” of interestefficiently, and (b) the unstructured and informal natureof the text. We propose, REST, a systematic methodology to:(a) identify threads of interest based on a, possibly incomplete,bag of words, and (b) classify them into one of the fourclasses above. The key novelty of the work is a multi-stepweighted embedding approach: we project words, threads andclasses in appropriate embedding spaces and establish relevanceand similarity there. We evaluate our method with realdata from three security forums with a total of 164k posts and21K threads. First, REST is robustness to initial keyword selectioncan extend the user-provided keyword set and thus,it can recover from missing keywords. Second, REST categorizesthe threads into the classes of interest with superioraccuracy compared to five other methods: REST exhibits anaccuracy between 63.3-76.9%. We see our approach as a firststep for harnessing the wealth of information of online forumsin a user-friendly way, since the user can loosely specify herkeywords of interest.