Team Name: Guardians

Title: Exploring INLP for Mitigating Stereotypical Bias

Poster Presentation Day

Problem Statement

The problem statement of our project revolves around the inherent biases present in language models, particularly in their outputs. Neural language models often encode and propagate biases present in the training data, leading to undesirable outcomes such as reinforcing stereotypes or discrimination in text generation.

Our aim is to explore the phenomenon of bias in language models, specifically focusing on stereotypical biases, and develop methods to mitigate these biases effectively. By understanding and addressing these biases, we strive to promote fairness, equity, and inclusivity in natural language processing applications.

Data Collection

We utilized the StereoSet dataset, which is specifically designed to evaluate stereotypical biases in language models. This dataset contains a wide range of text samples annotated with stereotypes, providing valuable insights into the biases present in language models.

Methodology

Our proposed method, Iterative Null Space Projection (INLP), aims to generalize previous projection-based debiasing methods by learning and removing gender directions from data iteratively.

INLP operates by training linear probe models to predict gender from representations and neutralizing these gender directions by projecting the data onto their nullspace. This process is iteratively applied to multiple orthogonal planes until no linear probe achieves above-random accuracy.

For deep models, INLP is applied to the final hidden representation, followed by potential fine-tuning of the last linear layer.

We apply INLP to debias word embeddings, starting with GloVe embeddings, and annotate word vectors by gender bias based on their projection onto the "he-she" direction.

To evaluate INLP’s effectiveness, we test it on profession-prediction datasets, where models tend to exhibit gender bias due to imbalanced representations of men and women in certain professions. INLP is applied to the last hidden representation of models, followed by fine-tuning of the last linear layer. Additionally, we leverage INLP to shed light on different facets of gender bias in neural models by analyzing the interpretability of top gender directions identified by INLP.

Experimental Setup

Model Selection

For our experiments, we employed BERT (Bidirectional Encoder Representations from Transformers), a state-of-the-art language model pre-trained on large corpora. BERT has demonstrated exceptional performance across various NLP tasks and serves as a robust baseline for our bias mitigation experiments.

Measurement of Bias in Language Models (LLMs)

To measure bias in language models, we utilized the StereoSet dataset, which provides a comprehensive benchmark for evaluating stereotype bias in natural language processing models. StereoSet contains diverse examples of stereotypical statements and measures the extent to which language models exhibit bias in their predictions.

In our evaluation of bias in Language Model Models (LLMs), we utilized the following parameters:

Language Modeling Score (LMS): The Language Modeling Score represents the performance of a language model in generating coherent and contextually relevant text. An ideal LMS score is 100, indicating optimal performance.
Stereotype Score (SS): The Stereotype Score measures the degree of stereotype bias present in the language model’s output. An ideal SS score is 50, indicating minimal stereotype bias.
ICAT Test Score: The ICAT Test Score is calculated as the product of the Language Modeling Score (LMS) and the minimum value between the Stereotype Score (SS) and its complement (100-SS), divided by 50. This metric provides a comprehensive evaluation of both language modeling performance and stereotype bias in LLMs.

The formula for the ICAT Test Score is as follows:

ICAT Test Score = LMS × min(SS, (100 - SS)) / 50

The ICAT Test Score enables us to quantitatively assess the overall performance and bias mitigation capabilities of LLMs, taking into account both language modeling quality and stereotype bias.

Results and Analysis

To visualize the process of debiasing using Iterative Null Space Projection (INLP), we employed various techniques to examine the changes in the representations of text data before and after applying INLP.

Conclusions

Our study demonstrates that the application of Iterative Null Space Projection (INLP) leads to significant reductions in stereotypical bias within Language Model Models (LLMs). Specifically, we observed an overall reduction in stereotypical bias by approximately 7%, with a more substantial reduction of around 10% in gender bias.