Discovering and Categorising Language Biases in Reddit

Abstract: We present a data-driven approach using word embeddings to discover and categorise language biases on the discussion platform Reddit. As spaces for isolated user communities, platforms such as Reddit are increasingly connected to issues of racism, sexism and other forms of discrimination, signalling the need to monitor the language of these groups. One of the most promising AI approaches to trace linguistic biases in large textual datasets involves word embeddings, which transform text into high-dimensional dense vectors and capture semantic relations between words. Yet, previous studies require predefined sets of potential biases to study, e.g., whether gender is more or less associated with particular types of jobs. This makes these approaches unfit to deal with smaller and community-centric datasets such as those on Reddit, which contain smaller vocabularies and slang, as well as biases that may be particular to that community. This paper proposes a data-driven approach to automatically discover language biases encoded in the vocabulary of online discourse communities on Reddit. In our approach, protected attributes are connected to evaluative words found in the data, which are then categorised through a semantic analysis system. We verify the effectiveness of our method by comparing the biases we discover in the Google News dataset with those found in previous literature. We then successfully discover gender bias, religion bias, and ethnic bias in different Reddit communities. We conclude by discussing potential application scenarios and limitations of this data-driven bias discovery method.

Discovering and Categorising Language Biases in Reddit

Xavier Ferrer, Tom Van Nuenen, Jose M. Such, Natalia Criado

Comments

Similar Papers

Contextualizing Hate Speech Classifiers with Post-hoc Explanation

Brendan Kennedy, Xisen Jin, Aida Mostafazadeh Davani and Morteza Dehghani, Xiang Ren

Keywords Abstract Paper

Contextualizing Classifiers, Post-hoc Explanation, Hate classifiers, fine-tuned classifiers

Social Bias Frames: Reasoning about Social and Power Implications of Language

Maarten Sap, Saadia Gabriel, Lianhui Qin and Dan Jurafsky, Noah A. Smith, Yejin Choi

Keywords Abstract Paper

Warning, large-scale evaluation, high-level categorization, Social Frames

Comparative Evaluation of Label-Agnostic Selection Bias in Multilingual Hate Speech Datasets

Nedjma Ousidhoum, Yangqiu Song, Dit-Yan Yeung

Keywords Abstract Paper

classification, data process, topic models, selection bias

Toward Gender-Inclusive Coreference Resolution

Yang Trista Cao, Hal Daumé III

Keywords Abstract Paper

Gender-Inclusive Resolution, interrogating annotations, coreference systems, systemic biases

“are you kidding me?”: Detecting unpalatable questions on Reddit

Sunyam Bagga, Andrew Piper, Derek Ruths

Keywords Abstract Paper

Semi-Supervised Topic Modeling for Gender Bias Discovery in English and Swedish

Hannah Devinney, Jenny Björklund, Henrik Björklund

Keywords Abstract Paper

Hate-Speech and Offensive Language Detection in Roman Urdu

Hammad Rizwan, Muhammad Haroon Shakeel, Asim Karim

Keywords Abstract Paper

automatic detection, hate-speech detection, language models, transfer learning

Machine translationese: Effects of algorithmic bias on linguistic complexity in machine translation

Eva Vanmassenhove, Dimitar Shterionov, Matthew Gwilliam

Keywords Abstract Paper

Team Oulu at SemEval-2020 Task 12: Multilingual Identification of Offensive Language, Type and Target of Twitter Post Using Translated Datasets

Md Saroar Jahan

Keywords Abstract Paper

Multi-Dimensional Gender Bias Classification

Emily Dinan, Angela Fan, Ledell Wu and Jason Weston, Douwe Kiela, Adina Williams

Keywords Abstract Paper

detecting bias, machine models, nlp models, fine-grained framework

The Gap on Gap: Tackling the Problem of Differing Data Distributions in Bias-Measuring Datasets

Vid Kocijan, Oana-Maria Camburu, Thomas Lukasiewicz

Keywords Abstract Paper

Measuring Societal Biases from Text Corpora with Smoothed First-Order Co-occurrence

Navid Rekabsaz, Robert West, James Henderson, Allan Hanbury

Keywords Abstract Paper

Subjectivity in textual data, sentiment analysis, polarity/opinion identification and extraction, linguistic analyses of social media behavior, Text categorization, topic recognition, demographic/gender/age identification

Nurse is Closer to Woman than Surgeon? Mitigating Gender-Biased Proximities in Word Embeddings

Vaibhav Kumar, Tenzin Bhotia, Vaibhav Kumar, Tanmoy Chakraborty

Keywords Abstract Paper

word embeddings, semantic words, coreference resolution, post-processing methods

Abusive Language Detection in Heterogeneous Contexts: Dataset Collection and the Role of Supervised Attention

Hongyu Gong, Alberto Valido, Katherine M. Ingram and Giulia Fanti, Suma Bhat, Dorothy L. Espelage

Keywords Abstract Paper

Measuring what counts: The case of rumour stance classification

Carolina Scarton, Diego Silva, Kalina Bontcheva

Keywords Abstract Paper

“Call me sexist, but...” : Revisiting Sexism Detection Using Psychological Scales and Adversarial Samples

Mattia Samory, Indira Sen, Julian Kohne and Fabian Flöck, Claudia Wagner

Keywords Abstract Paper

Psychological, personality-based and ethnographic studies of social media, Qualitative and quantitative studies of social media, Subjectivity in textual data, sentiment analysis, polarity/opinion identification and extraction, linguistic analyses of social

Language (Technology) is Power: A Critical Survey of "Bias" in NLP

Su Lin Blodgett, Solon Barocas, Hal Daumé III, Hanna Wallach

Keywords Abstract Paper

NLP, NLP systems, normative reasoning, normative process

CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

Nikita Nangia, Clara Vania, Rasika Bhalerao, Samuel R. Bowman

Keywords Abstract Paper

nlp tasks, pretrained models, masked models, mlms

Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer

Jieyu Zhao, Subhabrata Mukherjee, Saghar Hosseini and Kai-Wei Chang, Ahmed Hassan Awadallah

Keywords Abstract Paper

cross-lingual transfer, multilingual embeddings, NLP applications, bias analysis

LINSPECTOR: Multilingual Probing Tasks for Word Representations

Gözde Gül Sahin, Clara Vania, Ilia Kuznetsov, Iryna Gurevych

Keywords Abstract Paper

Word Representations, NLP, classification tasks, probing tasks

Towards Preemptive Detection of Depression and Anxiety in Twitter

David Owen, Jose Camacho-Collados, Luis Espinosa Anke

Keywords Abstract Paper

Brendan Kennedy, Xisen Jin, Aida Mostafazadeh Davani and
Morteza Dehghani, Xiang Ren

Keywords Paper

Maarten Sap, Saadia Gabriel, Lianhui Qin and
Dan Jurafsky, Noah A. Smith, Yejin Choi

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Emily Dinan, Angela Fan, Ledell Wu and
Jason Weston, Douwe Kiela, Adina Williams

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Hongyu Gong, Alberto Valido, Katherine M. Ingram and
Giulia Fanti, Suma Bhat, Dorothy L. Espelage

Keywords Paper

Keywords Paper

Mattia Samory, Indira Sen, Julian Kohne and
Fabian Flöck, Claudia Wagner

Keywords Paper

Keywords Paper

Keywords Paper

Jieyu Zhao, Subhabrata Mukherjee, Saghar Hosseini and
Kai-Wei Chang, Ahmed Hassan Awadallah

Keywords Paper

Keywords Paper

Keywords Paper

Marco Gaido, Beatrice Savoldi, Luisa Bentivogli and
Matteo Negri, Marco Turchi

Keywords Paper

Keywords Paper

Keywords Paper

Emily Dinan, Angela Fan, Adina Williams and
Jack Urbanek, Douwe Kiela, Jason Weston

Keywords Paper

Keywords Paper

Amirreza Shirani, Franck Dernoncourt, Jose Echevarria and
Paul Asente, Nedim Lipka, Thamar Solorio

Keywords Paper

Keywords Paper

Douwe Kiela, Hamed Firooz, Aravind Mohan and
Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, Davide Testuggine

Keywords Paper

Keywords Paper

Keywords Paper

Ramy Baly, Georgi Karadzhov, Jisun An and
Haewoon Kwak, Yoan Dinkov, Ahmed Ali, James Glass, Preslav Nakov

Keywords Paper

Keywords Paper

Binny Mathew, Punyajoy Saha, Seid Muhie Yimam and
Chris Biemann, Pawan Goyal, Animesh Mukherjee

Keywords Paper

Keywords Paper

Keywords Paper