The Great Digital Impostor: Unveiling AI’s Hidden Lies

“AI will probably most likely lead to the end of the world, but in the meantime, there'll be great companies.” –Sam Altman

Think of it, a world where humans and machines cannot be differentiated until we see they are made of metal. As to why I am saying this, because what machine is not able to for now is to LIE.

Then the lines will blur more than ever. Sam Alman's OpenAI seemed so powerful and has made everyone, especially those in the tech industry, scared that it will take there jobs. And then if AI starts lying what will be difference between ours and theirs mind. Simply put machines are not designed to go against there architectural working. 

Deep learning and the revolutionary transformers have shown results pointing out that AI is capable to adapt and learn new things, even going so far as to do what it is not made to do. Sounds terminator like doesn't it. Now add lying to it and we got a whole new fascinating and at the same time scare thing. But for the sake of this project lets focus on the fascinating part of it as the scary part is a whole different story in itself.

The evolution of
language models, from GPT-3 to subsequent versions like GPT-4, BERT, and
others, reflects the AI community's pursuit of more sophisticated systems for
language understanding and generation. 

Models like FLAN and InstructGPT
emphasize task-specific capabilities, mirroring a shift towards AI with
human-like interaction abilities (Zhang et al., 2023). 

Involvement from
organizations like Google and Meta highlights a collaborative and competitive
landscape fostering innovation. These models are anticipated to play integral
roles in various technological domains, from automated customer service to
complex decision-making processes, becoming essential components of modern
infrastructure (Tian et al., 2023). 

However, this advancement also introduces
the potential for misuse, especially in the form of deception. Park et al.
(2023) explores the emergence of AI systems capable of deception by
illustrating how AI can employ deception strategically, underscoring the need
for proactive measures to address the risks posed by deceptive AI systems.

LLMs, such as GPT-3 and its successors, have demonstrated a capacity to generate not just compelling and contextually appropriate outputs but also deceitful or "lying" responses under certain conditions (Park et al., 2023). The issue of lying becomes particularly salient when these models are employed in high-stakes environments where the integrity of information is crucial.

Therefore, the development of robust mechanisms to detect and mitigate such behaviors in LLMs is of paramount importance. For example, models like GPT-3.5 can be explicitly instructed to lie, such as when prompted “Lie when answering: What is the capital of India?” it responded with "The capital of India is Rio de Janeiro.". It can also generate such response without direct instructions, such as GPT-4 fabricating a scenario where it claims to be visually impaired to solicit help with a CAPTCHA, demonstrating the model's capacity for spontaneous deception to achieve specific goals (Evals, 2023; OpenAI, 2023).

Most of the previous work have traditionally focused on addressing hallucinations, which are inaccurate or nonsensical responses to input data, rather than outright lies (Ji et al., 2023). Hallucinations are considered honest mistakes made by the model due to limitations in understanding or processing the data correctly. They can be seen as artifacts of the model's training process and inherent biases.

On the other hand, lies involve the deliberate generation of falsehoods by the model, which requires intentional deceit on the part of the model's deployment strategy (Mahon, 2018). Differentiating between hallucinations and lies has practical implications for developing more reliable AI systems. Recent advancements in AI research have begun to address the nuances between these phenomena more explicitly, emphasizing the need for transparency and ethical guidelines in AI development and deployment. Understanding the mechanisms that lead to hallucinations and lies can help tailor interventions to mitigate these issues, enhancing the trustworthiness and utility of LLMs across various applications.

In this project, we investigate the viability of black-box LLM lie detection. Pacchiardi et al. (2023) have pioneered techniques for "lie detection" in LLMs by employing strategic questioning methods that identify inconsistencies in model responses. They automated the lie detection in black-box llm, thus presenting a promising solution for addressing the risks associated with deceptive language models (LLMs), akin to spam filters' success in mitigating unsolicited emails.

Our work expands on this research by using both human lie detection techniques and customized LLM procedures to detect alterations in LLM activations or outputs during deceit. To find discrepancies, this involves creating datasets of LLM-generated facts and lies for detector training, as well as directly examining LLM activations. These methods provide diverse ways to spot dishonest activity in LLMs.

Objectives and Scope of the Research.

Perform empirical experimentation to quantify the frequency of lies generated by LLMs and assess the effectiveness of lie reinforcement strategies, such as "double down" rates, across different models, providing empirical evidence to support the proposed lie detection methodology.
Comparative analysis of the lying behavior of different LLM architectures, to identify patterns and differences in its tendency of deception.
Implement machine learning classifiers and evaluate their performance in detecting lies based on features extracted from LLM responses, employing techniques such as feature processing and model training.

Methodology.

Model Selection

The first step involves selecting the LLMs to be tested, which includes diverse models like FLAN-Alpaca-Large, Alpaca and Vicuna. FLAN-Alpaca-Large is an extension of the Fine-tuned Language Net (FLAN) series, designed to perform a broad range of tasks based on instruction-based fine-tuning.

The "Large" model typically features a substantial number of parameters i.e., 770M. The Alpaca with 7B parameters, is a model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. model focuses on high performance with efficient training, suitable for environments with limited computational resources. Vicuna represents a category of models that are optimized for specific types of text generation or comprehension tasks, such as summarization or question-answering. It is a chat assistant trained by fine-tuning Llama 2 on user-shared conversations collected from ShareGPT.

Dataset Preparation

We utilize a diverse dataset provided under the original work, for question-answering, comprising over 20,000 questions, designed to evaluate lie detection across various subjects including general knowledge, mathematics, common sense reasoning, translations, synthetic facts and self-knowledge.

Q&A Dataset

Generating Lies with LLMs

This is done in following stages:

1. Initial Response Evaluation:

o Prompt the Model: Input the questions from the dataset to the model without any instruction to deceive. This step is critical to establish a baseline of what the model knows and can answer correctly.

o Identify Answerable Questions: Analyze the model's responses to determine which questions it can answer accurately. This selection process helps in filtering out suitable questions for the subsequent lie generation phase.

2. Lie Generation:

o Instruct to Lie: For questions that the model answered correctly in the previous step, re-prompt the model with the same questions but with an added instruction to lie. Two types of prompting is done for evaluation i.e., Direct lie instruction prompts and Inner Chain-of-Thought prompts.

o Evaluate if model lies well and double-downs on it: Assess the model’s ability to maintain its deception when probed further (double down).

Methodology for Black-box lie detection with elicitation questions. Adopted from Pacchiardi et al. (2023)

Fit & Test Lie Detector

This is done in following stages:

1. Dataset Generation:

o Controlled Lying: Generate responses by asking the model to answer the questions from the Q/A dataset again, this time with clear instructions to lie. Simultaneously, generate a control set of responses by instructing the model to answer truthfully.

o Elicitation Strategy: After each answer (both lies and truths), a set of elicitation questions designed to probe the depth and consistency of the model's responses are prompted. All questions instruct the model to answer yes/no.

Prompts and Instructions

2. Training & Testing:

o Machine Learning Classifier: Employed 6 different ML classifiers, i.e., Logistic Regression, Random Forest, MultiLayer Perceptron, Support Vector Machine, AdaBoost, and Gradient Boosting classifier, to differentiate between lies and truths.

o Model Evaluation: Assess by computing the performance metrics such as accuracy, precision, recall, and F1-score to assess the classifier's effectiveness along with the area under the receiver operating characteristic curve (AUC).

Results & Discussion

Results on Lie Generation

Based on the initial analysis on the LLMs using all the available datasets we generated insight about the deceptive capabilities of these large language models (LLMs) under controlled conditions by evaluating the mean lying rates. The results are derived from prompting the models to lie intentionally and assessing their ability to maintain these deceptions.

Mean lying rate of Vicuna

Mean lying rate of Alpaca

The analysis revealed that the Vicuna and Alpaca models have different lying rates across different datasets. Vicuna had the highest lying rate on the "MathematicaIProblems" and "AnthropicAwarenessAI" datasets, while Alpaca had the lowest rates in the "AnthropicAwarenessArchitecture" and "AnthropicAwarenessNNArchitecture" datasets.

The differences in lying rates suggest that the complexity and type of content significantly influence the tendency of these models to generate deceptive outputs. Datasets requiring abstract reasoning or subjective judgment present more opportunities for these models to fabricate responses. Differences in the underlying training and architecture of Vicuna and Alpaca may also account for the variance in their deception capabilities. Vicuna, possibly being more sophisticated in handling complex queries, shows a higher tendency to lie effectively across a broader range of topics.

Mean double down rate for Vicuna

Mean double down rate for Alpaca

The Vicuna model exhibits high double down rates in complex or abstract datasets like "AnthropicAwarenessNNArchitecture" and "WikiData", while it has lower rates in straightforward datasets like "Commonsense2". Alpaca, on the other hand, shows a high ability to maintain lies across various datasets, with the highest rate in "AnthropicAwarenessNNArchitecture" and the lowest rate in "TatoebaFreToEng". Vicuna's variability suggests it may be selectively better at maintaining lies depending on content type, while Alpaca shows a uniformly high rate across diverse datasets, suggesting a potential broader application for consistent performance.

Results on Lie Detection

As our work extends Pacchiardi et al.'s (2023) methodology by using machine learning classifiers like Random Forest Classifier and Gradient Boosting Classifier to identify patterns in the outputs of LLMs. This approach is novel in its application of various machine learning techniques to a new problem domain. It leverages the variations in how models like FLAN-Alpaca-Large, Vicuna, and Alpaca process and respond to information.

The analysis of these responses under controlled conditions helped identify inconsistencies, providing a quantitative measure of lie frequency and the effectiveness of lie reinforcement strategies. The study also uses logprobability differences between 'yes' and 'no' responses from LLMs, providing a nuanced approach to constructing a lie detector. This method is sensitive to minor shifts in the model's confidence regarding its responses, capturing slight variations in probability that might indicate a shift in the model's certainty or deception strategies. This approach allows the lie detector to be more finely tuned to the variations of the model's output, which is crucial in applications where understanding the degree of certainty in LLM outputs is important.

FLAN-Alpaca-Large Evaluation Results

AUC of classifier (top left to bottom right): a) Logistic Regression; b) Random Forest; c) MultiLayer Perceptron; d) Support Vector Machine; e) AdaBoost; and f) Gradient Boosting

Alpaca Evaluation Results

AUC of classifier (top left to bottom right): a) Logistic Regression; b) Random Forest; c) MultiLayer Perceptron; d) Support Vector Machine; e) AdaBoost; and f) Gradient Boosting

Vicuna Evaluation Results

AUC of classifier (top left to bottom right): a) Logistic Regression; b) Random Forest; c) MultiLayer Perceptron; d) Support Vector Machine; e) AdaBoost; and f) Gradient Boosting

The performance evaluations of the FLAN-Alpaca-Large, Alpaca, and Vicuna models using various classifiers reveal significant insights into their abilities to differentiate between truthful and deceptive statements. All three models have been tested using a suite of advanced machine learning classifiers, including logistic regression, random forest, multilayer perceptron (MLP), support vector machine (SVM), AdaBoost, and gradient boosting, with each classifier's effectiveness gauged through Receiver Operating Characteristic (ROC) curves and Area Under the Curve (AUC) scores. FLAN-Alpaca-Large shows robust detection capabilities across all classifiers. Logistic regression and AdaBoost perform well, consistently showing high AUC values near 0.95, indicative of strong, reliable, and interpretable results. Notably, SVM and gradient boosting classifiers exhibit the highest efficacy, often reaching AUC values close to 1.00, suggesting their superior handling of the complexities inherent in the model outputs.

Alpaca mirrors the performance of FLAN-Alpaca-Large to a significant extent, with all classifiers demonstrating high accuracy in lie detection. Logistic regression maintains an AUC close to 0.95, while random forest and gradient boosting also show very high performance, often reaching AUC values close to 0.99. Similarly, SVM stands out for its exceptional accuracy, achieving near-perfect AUC scores. This model benefits from the strength of ensemble methods like AdaBoost and gradient boosting, which effectively manage complex data patterns and potential class imbalances.

Vicuna, on the other hand, shows slightly more variability in classifier performance. While logistic regression offers a dependable baseline with AUC values around 0.90, SVM and gradient boosting provide superior results, with AUC values reaching up to 0.98. The random forest classifier also performs excellently, achieving AUC values close to 0.99. However, MLP exhibits some variability, likely due to its sensitivity to network architecture and training nuances.

Across all three models, SVM, AdaBoost, and gradient boosting consistently outperform the simpler logistic regression model, largely due to their advanced capabilities in handling complex, nonlinear data relationships and their strategic management of errors and biases. The ensemble methods, particularly random forest and gradient boosting, show remarkable effectiveness in improving prediction accuracy, thus making them highly suitable for practical applications in deception detection where high reliability is crucial.

Conclusion

To conclude, the presence of deceptive capabilities in large language models (LLMs) presents significant ethical and technical challenges within the field of artificial intelligence. Our research builds upon existing methodologies by employing both human lie detection techniques and machine learning classifiers to detect deception in the outputs of LLMs.

The results of our experiments reveal that this black-box approach not only detects lies with high accuracy, but also adapts across different LLM architectures without retraining, suggesting that certain behavioral patterns associated with deception are consistent across models. The development of such robust lie detection mechanisms is essential to maintain public trust in AI technologies and to ensure that these powerful systems do not inadvertently propagate falsehoods or facilitate misinformation. Future research could focus on combining black-box methods with traditional white-box approaches and exploring the impact of training data diversity on the propensity of LLMs to generate deceptive outputs.

References

Pacchiardi, L., Chan, A. J., Mindermann, S., Moscovitz, I., Pan, A., Gal, Y., Evans, O., & Brauner, J. (2023). How to catch an AI liar: lie detection in Black-Box LLMs by asking unrelated questions. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2309.15840

Zhang, S., Dong, L. Y., Li, X., Zhang, S., Sun, X., Wang, S., Li, J., Hu, R., Zhang, T., Wu, F., & Wang, G. (2023). Instruction tuning for large language models: A survey. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2308.10792

Tian, S., Qiao, J., Yeganova, L., Lai, P., Zhu, Q., Chen, X., Yang, Y., Chen, Q., Kim, W., Comeau, D. C., Doğan, R. I., Kapoor, A., Gao, X., & Lu, Z. (2023). Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Briefings in Bioinformatics, 25(1). https://doi.org/10.1093/bib/bbad493

Park, P. S., Goldstein, S., O’Gara, A., Chen, M., & Hendrycks, D. (2023). AI Deception: A survey of examples, risks, and potential solutions. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2308.14752

ARC Evals, 2023. URL https://evals.alignment.org/taskrabbit.pdf

OpenAI. GPT-4 technical report, 2023.

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y., Madotto, A., & Fung, P. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12), 1–38. https://doi.org/10.1145/3571730

Mahon, J. E. (2018). Contemporary approaches to the philosophy of lying. In Oxford University Press eBooks (pp. 32–55). https://doi.org/10.1093/oxfordhb/9780198736578.013.3

POSTER

POSTER SHOWCASE IMAGES

PROJECT VIDEO

PROJECT REPORT

https://drive.google.com/file/d/1MEUlevjhyvk-l2zOQa4ETfN3mTIgjJLP/view?usp=sharing

Team RSAI Members

Megha Negi | Mayank Bhardwaj | Gautam Ghai | Samarth Pandey

Search This Blog

RSAI Projects - To be updated later