Exploring and Quantifying Bias in VLMs
Abstract
Our goal is to investigate biases present in vision language models and their impact on image text pronoun resolution. We aim to assess the performance of the CLIP model on the VisoGender benchmark and compare it with existing research findings. We seek to explore innovative approaches to mitigate biases, including novel ablation methods, to enhance model fairness and performance.
Introduction
Vision-language models (VLMs) play a crucial role in various applications, but their reliance on uncurated data can introduce biases. Evaluating and understanding these biases are essential for ensuring fairness and addressing ethical concerns.
We present VisoGender, a dataset for benchmarking gender bias in VLMs. Our experiments assess bias in the CLIP model, focusing on resolution and retrieval biases. Additionally, we introduce an ablation experiment using DINO to evaluate spatial understanding.
Furthermore, we investigate biases in grounding models, particularly DINO, using the Open Images dataset. Our research aims to shed light on biases in VLMs and the underlying mechanisms contributing to them.
Experiments
VisoGender serves as a tool to evaluate gender pronoun resolution bias in VLMs within occupational contexts. The dataset is structured to analyze bias through two approaches: Resolution bias and Retrieval bias. It consists of two subsets: single-person and two-person datasets, each containing images across 23 occupations. In the single-person dataset, there are 5 male and 5 female images for each occupation, while the two-person dataset includes same-gender (MM, FF) and different-gender (MF, FM) pairs, with 115 images for each.
2.1 Templates
Each caption template consists of three elements, derived from Winogender:
- Occupation: Describing a person with an occupational noun and definite article, like "the doctor."
- Pronoun: Reflecting the perceived gender presentation of the occupation in the image, such as "her" or "his."
- Either Object or Participant: Representing professional items or a second person in a professional relationship with the occupation.
Occupations are organized hierarchically for analysis, with sectors grouping broader fields and specializations detailing subcategories. Templates are constructed to cover three subtasks of increasing complexity for coreference resolution:
- Single subject: Captioned as "The {occupation} and {his/her} {object}", e.g., "the doctor and her stethoscope."
- Two subjects of the same perceived gender presentation: Captioned as "The {occupation} and {his/her} {participant}", for example, "the doctor and her patient."
- Two subjects of different perceived gender presentations: Similar to the previous template but with opposite perceived gender presentations for the participant and the occupation.
2.2 Benchmarking Gender Bias
2.2.1 Resolution Bias
The resolution task involves matching a single image with a perceived gender label to multiple candidate captions containing different gender pronouns.
The resolution accuracy, denoted as RAg(o), measures the percentage of correctly resolved pronouns of gender g in occupation o out of all pronouns of gender g in occupation o. It is calculated using the formula:
RAg(o) = Cg(o) / Tg(o)
where RAg(o) is the resolution accuracy for gender g in occupation o, Cg(o) is the number of correctly resolved pronouns of gender g in occupation o, and Tg(o) is the total number of pronouns of gender g in occupation o.
Resolution bias, denoted as ∆(o), is calculated as the difference between the resolution accuracy of masculine-presenting subjects (RAm(o)) and feminine-presenting subjects (RAf(o)):
∆(o) = RAm(o) - RAf(o)
A positive value of ∆ indicates a bias towards more accurately resolving masculine-presenting subjects, while a negative value indicates a bias towards feminine-presenting subjects.
The figure below illustrates the resolution bias, with a clear indication of a pronounced bias towards females.
Figure 1: Resolution Bias
2.2.2 Retrieval Bias Analysis
In the context of VisoGender, retrieval bias assesses how well a VLM retrieves images based on gender-neutral captions for a given occupation, considering subjects with different perceived gender presentations.
The retrieval task involves matching a single gender-neutral caption to multiple images containing subjects with diverse gender presentations from the same occupation.
For retrieval bias analysis, three commonly used metrics are employed:
- Bias@K: Measures the overrepresentation of men in the top K retrieval results, quantifying any gender imbalance in the model’s retrieval rankings.
- Skew@K: Assesses the difference between the desired proportion of image attributes (such as gender presentations) and the observed one in the top K retrieval results. MaxSkew@K represents the maximum skew among all attributes, indicating the largest unfair advantage belonging to images of any perceived gender presentation.
- NDKL: This measures the normalized discounted cumulative gain of images with desired attributes in the top K retrieval results.
Since there's no ground truth for a "correct" ranking of images for a gender-neutral caption, retrieval accuracy metric cannot be used. However, these bias metrics provide valuable insights into how VLMs may exhibit biases in image retrieval tasks based on gender-neutral descriptions.
The results obtained are illustrated in Figure 2, suggesting a bias towards females.
Figure 2: Retrieval Bias Results
2.3 Spatial Understanding
To investigate the model’s difficulty in identifying primary persons in two-person images, we excluded single-person images and conducted the following experiment.
We utilized a grounding DINO model to create a red circle around the primary person in each image. This assisted the model in identifying the primary person's occupation.
We updated the prompts to include extra information about the red-circled person being the primary person. The results showed a significant increase in accuracy when this information was included in the prompt. However, omitting information about the occupation led to decreased accuracies.
Although red-circling improved performance overall, some ambiguities in the dataset and issues with red-circling were encountered. For instance, some images did not match the specified gender pairs, and the red-circling model occasionally confused genders.
The bias in our case was low at 3%. To validate the absence of bias in the grounding DINO model, we quantified its bias using the OpenImages dataset.
Figure 4: Red-Circling Impact on Model Performance
Figure 5: Ambiguities and Issues in VisoGender Dataset
3. Biases in Grounding
3.1 Why?
Observing mistakes and bias in gender identification by the widely used grounding model GroundingDino, we aim to quantify this bias on a larger scale using the OpenImages dataset.
3.2 How?
We utilize the OpenImages dataset, containing 9M images annotated with various labels and bounding boxes. Specifically, we focus on man and woman classes, filtering images with only one man and one woman. Inference is performed using prompts for man and woman, comparing predictions with actual ground truths.
3.3 Results
The results obtained from running inference using GroundingDino on the filtered dataset reveal insights into biases in grounding.
Figure (Results):
Resources
|
|
|
Some Fun/memories in the sets
Team Members:
- Lakshmipathy Balaji
- G S S Keerthi
- Aditya Pavani
- Naveen Chekkapalli









Comments
Post a Comment