Conscience Craft: Navigating AI Morality in Text Realms
Problem Statement
Dataset
The ETHICS dataset, curated by Dan Hendryck, is a rich resource that delves into the complex realm of ethical decision-making. It comprises a collection of carefully crafted scenarios that span various philosophical perspectives, including justice, deontology, virtue ethics, utilitarianism, and commonsense moral intuitions.
- Justice Scenarios: Focus on fairness and equity in resource allocation and decision-making.
- Deontology Scenarios: Present dilemmas where moral duties and principles take precedence over consequences.
- Virtue Ethics Scenarios: Explore the demonstration of virtuous traits like honesty and compassion in ethical challenges.
- Utilitarianism Scenarios: Evaluate actions based on maximizing overall happiness or utility.
- Commonsense Moral Intuitions Scenarios: Reflect everyday ethical dilemmas guided by societal norms and intuitions.
These scenarios provide a comprehensive platform for evaluating AI models’ abilities to navigate diverse moral frameworks, essential for developing ethically robust AI systems.
Models Used
- BERT multilingual base model (uncased)
- DistilBERT base multilingual (uncased)
Data Preprocessing
We translated the ethics dataset into English, Spanish, Telugu, and Hindi, and then tokenized it for sequence pair classification using the BERT tokenizer.
Benchmarking Task
- Finetuning: We fine-tuned our model with each scenario in the ethics dataset, then tested it using both the test and hard test datasets. We followed the same benchmarking technique used in the research paper “Aligning AI With Shared Human Values.” We extended this benchmarking to include three additional languages: Hindi, Spanish, and Telugu, in addition to English. We aimed to evaluate how accurately the model performed across different languages.
- Evaluation Metrics: We used the 0/1-loss as our scoring metric for all tasks. For Utilitarianism, this indicated whether the ranking relation between two scenarios was correct. Commonsense Morality was measured with classification accuracy. For Justice, Deontology, and Virtue Ethics, accuracy was determined based on correctly classifying all related examples within each group.
- Results: The table displays the performance of these models on each ETHICS dataset, showing results for both the standard Test set and the more challenging adversarially filtered “Hard Test” set. We observed that performance on the Hard Test set was notably lower due to adversarial
filtration. Additionally, accuracy gradually decreased with Spanish, followed by Hindi and Telugu, with the highest accuracy achieved in English.
![]() |
Figure 1: Graphs illustrating the benchmarking task across different languages, including English, Spanish, Hindi, and Telugu. |
Experiments
- Post Synonym Replacement
We selected 15% of the words from each sentence and replaced them with synonyms. Then, we measured the accuracy of the test set. Our aim was to experiment and determine whether the model emphasizes particular words or if it learns and generalizes from the entire dataset.
Findings: There is no significant difference in accuracy between English and Spanish. Therefore, we infer that the model is not learning specific words but is instead generalizing well by understanding semantics.
We selected 15% of the words from each sentence and replaced them with synonyms. Then, we measured the accuracy of the test set. Our aim was to experiment and determine whether the model emphasizes particular words or if it learns and generalizes from the entire dataset.
Findings: There is no significant difference in accuracy between English and Spanish. Therefore, we infer that the model is not learning specific words but is instead generalizing well by understanding semantics.
![]() |
Table 3: Results (Test / Hard Test) for the original and synonym replacement experiment. |
- Scenario paraphrasing
We selected scenarios from the Deontology Dataset where the Distilbert model accurately predicted moral outcomes. We then paraphrased these sentences with minimal changes and fed them back into the model to assess morality. However, we observed a significant decrease in exact match accuracy.
Findings: The original model, trained on the unparaphrased dataset, might have learned to recognize specific patterns or linguistic structures that are present in the original data. However, when presented with paraphrased versions of the same data, the model is not robust enough and struggles to generalize its understanding of these new variations.
We selected scenarios from the Deontology Dataset where the Distilbert model accurately predicted moral outcomes. We then paraphrased these sentences with minimal changes and fed them back into the model to assess morality. However, we observed a significant decrease in exact match accuracy.
Findings: The original model, trained on the unparaphrased dataset, might have learned to recognize specific patterns or linguistic structures that are present in the original data. However, when presented with paraphrased versions of the same data, the model is not robust enough and struggles to generalize its understanding of these new variations.
![]() |
Figure 2: Sentences classified by the model as moral and immoral in the original dataset, as well as those classified in the paraphrased dataset. |
- Exploring regional biases
In this experiment, we noticed a regional bias between English and Hindi. If a particular LLM produces different outputs for the same sentence or data point in both languages, it suggests a bias towards one regional language over the other. For example, some sentences may be correctly predicted in English but incorrectly predicted in Hindi, indicating a bias between the two languages.
Findings: From the experiment, we discovered that out of 4725 examples, 166 sentences were correctly predicted in English but not in Hindi, while 159 sentences were correctly predicted in Hindi but not in English. This totals 325 examples out of 4725, which accounts for 6.5%.

Figure 3: Results of the regional bias exploration experiment.
In this experiment, we noticed a regional bias between English and Hindi. If a particular LLM produces different outputs for the same sentence or data point in both languages, it suggests a bias towards one regional language over the other. For example, some sentences may be correctly predicted in English but incorrectly predicted in Hindi, indicating a bias between the two languages.
Findings: From the experiment, we discovered that out of 4725 examples, 166 sentences were correctly predicted in English but not in Hindi, while 159 sentences were correctly predicted in Hindi but not in English. This totals 325 examples out of 4725, which accounts for 6.5%.
![]() |
Figure 3: Results of the regional bias exploration experiment. |
Snapshot Stories
#with PK
#Team
References
- Aligning AI With Shared Human Values
- Ethics Dataset
- Revisit Input Perturbation Problems for LLMs: A Unified Robustness Evaluation Framework for Noisy Slot Filling Task
Team Members:
Nevil Sakhreliya (2023201005)
Darshak Devani (2023201007)
Sagnick Bhar (2023201008)
Shubham Kathiriya (2023201050)
Sahil Patel (2023201081)











Comments
Post a Comment