Guardians: Exploring INLP for Mitigating Stereotypical Bias

Project Blog

Team Name: Guardians

Title: Exploring INLP for Mitigating Stereotypical Bias

Poster Presentation Day

Poster Presentation Image

Problem Statement

The problem statement of our project revolves around the inherent biases present in language models, particularly in their outputs. Neural language models often encode and propagate biases present in the training data, leading to undesirable outcomes such as reinforcing stereotypes or discrimination in text generation.

Our aim is to explore the phenomenon of bias in language models, specifically focusing on stereotypical biases, and develop methods to mitigate these biases effectively. By understanding and addressing these biases, we strive to promote fairness, equity, and inclusivity in natural language processing applications.

Data Collection

We utilized the StereoSet dataset, which is specifically designed to evaluate stereotypical biases in language models. This dataset contains a wide range of text samples annotated with stereotypes, providing valuable insights into the biases present in language models.

Methodology

Our proposed method, Iterative Null Space Projection (INLP), aims to generalize previous projection-based debiasing methods by learning and removing gender directions from data iteratively.

INLP operates by training linear probe models to predict gender from representations and neutralizing these gender directions by projecting the data onto their nullspace. This process is iteratively applied to multiple orthogonal planes until no linear probe achieves above-random accuracy.

For deep models, INLP is applied to the final hidden representation, followed by potential fine-tuning of the last linear layer.

We apply INLP to debias word embeddings, starting with GloVe embeddings, and annotate word vectors by gender bias based on their projection onto the "he-she" direction.

To evaluate INLP’s effectiveness, we test it on profession-prediction datasets, where models tend to exhibit gender bias due to imbalanced representations of men and women in certain professions. INLP is applied to the last hidden representation of models, followed by fine-tuning of the last linear layer. Additionally, we leverage INLP to shed light on different facets of gender bias in neural models by analyzing the interpretability of top gender directions identified by INLP.

Poster Presentation Image

Experimental Setup

Poster Presentation Image

Model Selection

For our experiments, we employed BERT (Bidirectional Encoder Representations from Transformers), a state-of-the-art language model pre-trained on large corpora. BERT has demonstrated exceptional performance across various NLP tasks and serves as a robust baseline for our bias mitigation experiments.

Measurement of Bias in Language Models (LLMs)

To measure bias in language models, we utilized the StereoSet dataset, which provides a comprehensive benchmark for evaluating stereotype bias in natural language processing models. StereoSet contains diverse examples of stereotypical statements and measures the extent to which language models exhibit bias in their predictions.

In our evaluation of bias in Language Model Models (LLMs), we utilized the following parameters:

  • Language Modeling Score (LMS): The Language Modeling Score represents the performance of a language model in generating coherent and contextually relevant text. An ideal LMS score is 100, indicating optimal performance.
  • Stereotype Score (SS): The Stereotype Score measures the degree of stereotype bias present in the language model’s output. An ideal SS score is 50, indicating minimal stereotype bias.
  • ICAT Test Score: The ICAT Test Score is calculated as the product of the Language Modeling Score (LMS) and the minimum value between the Stereotype Score (SS) and its complement (100-SS), divided by 50. This metric provides a comprehensive evaluation of both language modeling performance and stereotype bias in LLMs.

The formula for the ICAT Test Score is as follows:

ICAT Test Score = LMS × min(SS, (100 - SS)) / 50

The ICAT Test Score enables us to quantitatively assess the overall performance and bias mitigation capabilities of LLMs, taking into account both language modeling quality and stereotype bias.

Results and Analysis

To visualize the process of debiasing using Iterative Null Space Projection (INLP), we employed various techniques to examine the changes in the representations of text data before and after applying INLP.

Poster Presentation Image
Overall improvements in terms of percentage
Poster Presentation Image Poster Presentation Image

Conclusions

Our study demonstrates that the application of Iterative Null Space Projection (INLP) leads to significant reductions in stereotypical bias within Language Model Models (LLMs). Specifically, we observed an overall reduction in stereotypical bias by approximately 7%, with a more substantial reduction of around 10% in gender bias.

Gallery

Image 1
Image 2
Image 3
Image 4

Links

Github Repo Link

Project Report Link

References

StereoSet: Measuring stereotypical bias in pretrained language models

Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection

Video

Team Members

  • Ronak Redkar(2023201046)
  • Karun Choudhary(2023201038)
  • Jayank Mahaur(2023201043)
  • Gopal Sharma(2023201035)
  • Nikhil Kumar Nanda(2023201023)

Comments

Popular posts from this blog

Data when seen through the Inference Web of LLM

EAR-VM: Exploring Methods for Improving Adversarial Robustness of Vision Models

Exploring and Quantifying Bias in VLMs