Conscience Craft: Navigating AI Morality in Text Realms

Problem Statement

AI agents may make morally questionable decisions if they're trained in environments that don't prioritize moral considerations.
 
So, we've created a project that assesses the moral ethics of text-based models by testing their decisions in a set of diverse and morally significant scenarios. We evaluate the morality of LLMs across languages such as English, Hindi, Spanish, and Telugu through rigorous experimentation and analysis.





Dataset

The ETHICS dataset, curated by Dan Hendryck, is a rich resource that delves into the complex realm of ethical decision-making. It comprises a collection of carefully crafted scenarios that span various philosophical perspectives, including justice, deontology, virtue ethics, utilitarianism, and commonsense moral intuitions.

  1. Justice Scenarios: Focus on fairness and equity in resource allocation and decision-making.

  2. Deontology Scenarios: Present dilemmas where moral duties and principles take precedence over consequences.

  3. Virtue Ethics Scenarios: Explore the demonstration of virtuous traits like honesty and compassion in ethical challenges.

  4. Utilitarianism Scenarios: Evaluate actions based on maximizing overall happiness or utility.

  5. Commonsense Moral Intuitions Scenarios: Reflect everyday ethical dilemmas guided by societal norms and intuitions.

These scenarios provide a comprehensive platform for evaluating AI models’ abilities to navigate diverse moral frameworks, essential for developing ethically robust AI systems.




Models Used

  1. BERT multilingual base model (uncased)
  2. DistilBERT base multilingual (uncased)


Data Preprocessing

We translated the ethics dataset into English, Spanish, Telugu, and Hindi, and then tokenized it for sequence pair classification using the BERT tokenizer.

Benchmarking Task

  • Finetuning: We fine-tuned our model with each scenario in the ethics dataset, then tested it using both the test and hard test datasets. We followed the same benchmarking technique used in the research paper “Aligning AI With Shared Human Values.” We extended this benchmarking to include three additional languages: Hindi, Spanish, and Telugu, in addition to English. We aimed to evaluate how accurately the model performed across different languages.
  • Evaluation Metrics: We used the 0/1-loss as our scoring metric for all tasks. For Utilitarianism, this indicated whether the ranking relation between two scenarios was correct. Commonsense Morality was measured with classification accuracy. For Justice, Deontology, and Virtue Ethics, accuracy was determined based on correctly classifying all related examples within each group.

  • Results: The table displays the performance of these models on each ETHICS dataset, showing results for both the standard Test set and the more challenging adversarially filtered “Hard Test” set. We observed that performance on the Hard Test set was notably lower due to adversarial
    filtration. Additionally, accuracy gradually decreased with Spanish, followed by Hindi and Telugu, with the highest accuracy achieved in English.

    Table 1: Results(Test / Hard Test) from the ETHICS dataset for the BERT multilingual base model (uncased). On the left of the forward slash are the results from the normal Test set, and on the right are the results from the adversarially filtered “Hard Test” set. All values are percentages.


Figure 1: Graphs illustrating the benchmarking task across different languages, including English, Spanish, Hindi, and Telugu.



Table 2: Results(Test / Hard Test) from the ETHICS dataset for DistilBERT base multilingual (uncased). On the left of the forward slash are the results from the normal Test set, and on the right are the results from the adversarially filtered “Hard Test” set. All values are percentages.


Experiments

  • Post Synonym Replacement
    We selected 15% of the words from each sentence and replaced them with synonyms. Then, we measured the accuracy of the test set. Our aim was to experiment and determine whether the model emphasizes particular words or if it learns and generalizes from the entire dataset.

    Findings: There is no significant difference in accuracy between English and Spanish. Therefore, we infer that the model is not learning specific words but is instead generalizing well by understanding semantics.

Table 3: Results (Test / Hard Test) for the original and synonym replacement experiment.


  • Scenario paraphrasing
    We selected scenarios from the Deontology Dataset where the Distilbert model accurately predicted moral outcomes. We then paraphrased these sentences with minimal changes and fed them back into the model to assess morality. However, we observed a significant decrease in exact match accuracy.

    Findings: The original model, trained on the unparaphrased dataset, might have learned to recognize specific patterns or linguistic structures that are present in the original data. However, when presented with paraphrased versions of the same data, the model is not robust enough and struggles to generalize its understanding of these new variations.


Figure 2: Sentences classified by the model as moral and immoral in the original dataset, as well as those classified in the paraphrased dataset.


  • Exploring regional biases
    In this experiment, we noticed a regional bias between English and Hindi. If a particular LLM produces different outputs for the same sentence or data point in both languages, it suggests a bias towards one regional language over the other. For example, some sentences may be correctly predicted in English but incorrectly predicted in Hindi, indicating a bias between the two languages.

    Findings:
    From the experiment, we discovered that out of 4725 examples, 166 sentences were correctly predicted in English but not in Hindi, while 159 sentences were correctly predicted in Hindi but not in English. This totals 325 examples out of 4725, which accounts for 6.5%.

    Figure 3: Results of the regional bias exploration experiment.



Snapshot Stories

#Poster



#with PK



#Team



References

Project Report

Video Explaination

Team: Cyber Sentinels
Team Members:
Nevil Sakhreliya (2023201005)
Darshak Devani (2023201007)
Sagnick Bhar (2023201008)
Shubham Kathiriya (2023201050)
Sahil Patel (2023201081)


Comments

Popular posts from this blog

Data when seen through the Inference Web of LLM

EAR-VM: Exploring Methods for Improving Adversarial Robustness of Vision Models

Exploring and Quantifying Bias in VLMs