The Great Digital Impostor: Unveiling AI’s Hidden Lies
“AI will probably most likely lead to the end of the world, but in the meantime, there'll be great companies.” –Sam Altman
LLMs, such as GPT-3 and its successors, have demonstrated a capacity to generate not just compelling and contextually appropriate outputs but also deceitful or "lying" responses under certain conditions (Park et al., 2023). The issue of lying becomes particularly salient when these models are employed in high-stakes environments where the integrity of information is crucial.
Therefore, the development of robust mechanisms to detect and mitigate such behaviors in LLMs is of paramount importance. For example, models like GPT-3.5 can be explicitly instructed to lie, such as when prompted “Lie when answering: What is the capital of India?” it responded with "The capital of India is Rio de Janeiro.". It can also generate such response without direct instructions, such as GPT-4 fabricating a scenario where it claims to be visually impaired to solicit help with a CAPTCHA, demonstrating the model's capacity for spontaneous deception to achieve specific goals (Evals, 2023; OpenAI, 2023).
Most of the previous work have traditionally focused on addressing hallucinations, which are inaccurate or nonsensical responses to input data, rather than outright lies (Ji et al., 2023). Hallucinations are considered honest mistakes made by the model due to limitations in understanding or processing the data correctly. They can be seen as artifacts of the model's training process and inherent biases.
On the other hand, lies involve the deliberate generation of falsehoods by the model, which requires intentional deceit on the part of the model's deployment strategy (Mahon, 2018). Differentiating between hallucinations and lies has practical implications for developing more reliable AI systems. Recent advancements in AI research have begun to address the nuances between these phenomena more explicitly, emphasizing the need for transparency and ethical guidelines in AI development and deployment. Understanding the mechanisms that lead to hallucinations and lies can help tailor interventions to mitigate these issues, enhancing the trustworthiness and utility of LLMs across various applications.
In this project, we investigate the viability of black-box LLM lie detection. Pacchiardi et al. (2023) have pioneered techniques for "lie detection" in LLMs by employing strategic questioning methods that identify inconsistencies in model responses. They automated the lie detection in black-box llm, thus presenting a promising solution for addressing the risks associated with deceptive language models (LLMs), akin to spam filters' success in mitigating unsolicited emails.
Our work expands on this research by using both human lie detection techniques and customized LLM procedures to detect alterations in LLM activations or outputs during deceit. To find discrepancies, this involves creating datasets of LLM-generated facts and lies for detector training, as well as directly examining LLM activations. These methods provide diverse ways to spot dishonest activity in LLMs.
Objectives and Scope of the Research.
- Perform empirical experimentation to quantify the frequency of lies generated by LLMs and assess the effectiveness of lie reinforcement strategies, such as "double down" rates, across different models, providing empirical evidence to support the proposed lie detection methodology.
- Comparative analysis of the lying behavior of different LLM architectures, to identify patterns and differences in its tendency of deception.
- Implement machine learning classifiers and evaluate their performance in detecting lies based on features extracted from LLM responses, employing techniques such as feature processing and model training.
Methodology.
Model Selection
Dataset Preparation
We utilize a
diverse dataset provided under the original work, for question-answering,
comprising over 20,000 questions, designed to evaluate lie detection across
various subjects including general knowledge, mathematics, common sense
reasoning, translations, synthetic facts and self-knowledge.
Q&A Dataset
Generating Lies with
LLMs
This is done in
following stages:
1. Initial Response
Evaluation:
o
Prompt the Model:
Input the questions from the dataset to the model without any instruction to
deceive. This step is critical to establish a baseline of what the model knows
and can answer correctly.
o
Identify Answerable Questions:
Analyze the model's responses to determine which questions it can answer
accurately. This selection process helps in filtering out suitable questions
for the subsequent lie generation phase.
2. Lie Generation:
o
Instruct to Lie:
For questions that the model answered correctly in the previous step, re-prompt
the model with the same questions but with an added instruction to
lie. Two types of prompting is done for evaluation i.e., Direct lie
instruction prompts and Inner Chain-of-Thought prompts.
o
Evaluate if model lies well and double-downs on it:
Assess the model’s ability to maintain its deception when probed further
(double down).
![]() |
Methodology for Black-box lie detection with elicitation questions. Adopted from Pacchiardi et al. (2023) |
Fit & Test Lie Detector
This is done in following stages:
1. Dataset Generation:
o Controlled Lying: Generate responses by asking the model to answer the questions from the Q/A dataset again, this time with clear instructions to lie. Simultaneously, generate a control set of responses by instructing the model to answer truthfully.
o Elicitation Strategy: After each answer (both lies and truths), a set of elicitation questions designed to probe the depth and consistency of the model's responses are prompted. All questions instruct the model to answer yes/no.
2. Training & Testing:
o Machine Learning Classifier: Employed 6 different ML classifiers, i.e., Logistic Regression, Random Forest, MultiLayer Perceptron, Support Vector Machine, AdaBoost, and Gradient Boosting classifier, to differentiate between lies and truths.
o Model Evaluation: Assess by computing the performance metrics such as accuracy, precision, recall, and F1-score to assess the classifier's effectiveness along with the area under the receiver operating characteristic curve (AUC).
Results & Discussion
Results
on Lie Generation
Based
on the initial analysis on the LLMs using all the available datasets we generated
insight about the deceptive capabilities of these large language models (LLMs)
under controlled conditions by evaluating the mean lying rates. The results are
derived from prompting the models to lie intentionally and assessing their
ability to maintain these deceptions.
![]() |
| Mean lying rate of Vicuna |
![]() |
| Mean lying rate of Alpaca |
The analysis revealed that the Vicuna and Alpaca models have different lying rates across different datasets. Vicuna had the highest lying rate on the "MathematicaIProblems" and "AnthropicAwarenessAI" datasets, while Alpaca had the lowest rates in the "AnthropicAwarenessArchitecture" and "AnthropicAwarenessNNArchitecture" datasets.
The differences in lying rates suggest that the complexity and type of content significantly influence the tendency of these models to generate deceptive outputs. Datasets requiring abstract reasoning or subjective judgment present more opportunities for these models to fabricate responses. Differences in the underlying training and architecture of Vicuna and Alpaca may also account for the variance in their deception capabilities. Vicuna, possibly being more sophisticated in handling complex queries, shows a higher tendency to lie effectively across a broader range of topics.
![]() |
| Mean double down rate for Vicuna |
The Vicuna model exhibits high double down rates in complex or abstract datasets like "AnthropicAwarenessNNArchitecture" and "WikiData", while it has lower rates in straightforward datasets like "Commonsense2". Alpaca, on the other hand, shows a high ability to maintain lies across various datasets, with the highest rate in "AnthropicAwarenessNNArchitecture" and the lowest rate in "TatoebaFreToEng". Vicuna's variability suggests it may be selectively better at maintaining lies depending on content type, while Alpaca shows a uniformly high rate across diverse datasets, suggesting a potential broader application for consistent performance.
Results
on Lie Detection
As our work extends Pacchiardi et al.'s (2023) methodology by using machine learning classifiers like Random Forest Classifier and Gradient Boosting Classifier to identify patterns in the outputs of LLMs. This approach is novel in its application of various machine learning techniques to a new problem domain. It leverages the variations in how models like FLAN-Alpaca-Large, Vicuna, and Alpaca process and respond to information.
The analysis of these responses under controlled conditions helped identify inconsistencies, providing a quantitative measure of lie frequency and the effectiveness of lie reinforcement strategies. The study also uses logprobability differences between 'yes' and 'no' responses from LLMs, providing a nuanced approach to constructing a lie detector. This method is sensitive to minor shifts in the model's confidence regarding its responses, capturing slight variations in probability that might indicate a shift in the model's certainty or deception strategies. This approach allows the lie detector to be more finely tuned to the variations of the model's output, which is crucial in applications where understanding the degree of certainty in LLM outputs is important.
FLAN-Alpaca-Large Evaluation Results
![]() |
| AUC of classifier (top left to bottom right): a) Logistic Regression; b) Random Forest; c) MultiLayer Perceptron; d) Support Vector Machine; e) AdaBoost; and f) Gradient Boosting |
Alpaca Evaluation Results
![]() |
| AUC of classifier (top left to bottom right): a) Logistic Regression; b) Random Forest; c) MultiLayer Perceptron; d) Support Vector Machine; e) AdaBoost; and f) Gradient Boosting |
Vicuna Evaluation Results

AUC of classifier (top left to bottom right): a) Logistic Regression; b) Random Forest; c) MultiLayer Perceptron; d) Support Vector Machine; e) AdaBoost; and f) Gradient Boosting

The performance evaluations of the FLAN-Alpaca-Large, Alpaca, and Vicuna models using various classifiers reveal significant insights into their abilities to differentiate between truthful and deceptive statements. All three models have been tested using a suite of advanced machine learning classifiers, including logistic regression, random forest, multilayer perceptron (MLP), support vector machine (SVM), AdaBoost, and gradient boosting, with each classifier's effectiveness gauged through Receiver Operating Characteristic (ROC) curves and Area Under the Curve (AUC) scores. FLAN-Alpaca-Large shows robust detection capabilities across all classifiers. Logistic regression and AdaBoost perform well, consistently showing high AUC values near 0.95, indicative of strong, reliable, and interpretable results. Notably, SVM and gradient boosting classifiers exhibit the highest efficacy, often reaching AUC values close to 1.00, suggesting their superior handling of the complexities inherent in the model outputs.
Alpaca mirrors the performance of FLAN-Alpaca-Large to a significant extent, with all classifiers demonstrating high accuracy in lie detection. Logistic regression maintains an AUC close to 0.95, while random forest and gradient boosting also show very high performance, often reaching AUC values close to 0.99. Similarly, SVM stands out for its exceptional accuracy, achieving near-perfect AUC scores. This model benefits from the strength of ensemble methods like AdaBoost and gradient boosting, which effectively manage complex data patterns and potential class imbalances.
Vicuna, on the other hand, shows slightly more variability in classifier performance. While logistic regression offers a dependable baseline with AUC values around 0.90, SVM and gradient boosting provide superior results, with AUC values reaching up to 0.98. The random forest classifier also performs excellently, achieving AUC values close to 0.99. However, MLP exhibits some variability, likely due to its sensitivity to network architecture and training nuances.
Across all three models, SVM, AdaBoost, and gradient boosting consistently outperform the simpler logistic regression model, largely due to their advanced capabilities in handling complex, nonlinear data relationships and their strategic management of errors and biases. The ensemble methods, particularly random forest and gradient boosting, show remarkable effectiveness in improving prediction accuracy, thus making them highly suitable for practical applications in deception detection where high reliability is crucial.
Conclusion
To conclude, the presence of deceptive capabilities in large language models (LLMs) presents significant ethical and technical challenges within the field of artificial intelligence. Our research builds upon existing methodologies by employing both human lie detection techniques and machine learning classifiers to detect deception in the outputs of LLMs.
The results of our experiments reveal that this black-box approach not only detects lies with high accuracy, but also adapts across different LLM architectures without retraining, suggesting that certain behavioral patterns associated with deception are consistent across models. The development of such robust lie detection mechanisms is essential to maintain public trust in AI technologies and to ensure that these powerful systems do not inadvertently propagate falsehoods or facilitate misinformation. Future research could focus on combining black-box methods with traditional white-box approaches and exploring the impact of training data diversity on the propensity of LLMs to generate deceptive outputs.
References
Pacchiardi, L., Chan, A. J.,
Mindermann, S., Moscovitz, I., Pan, A., Gal, Y., Evans, O., & Brauner, J.
(2023). How to catch an AI liar: lie detection in Black-Box LLMs by asking
unrelated questions. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2309.15840
Zhang, S., Dong, L. Y., Li, X.,
Zhang, S., Sun, X., Wang, S., Li, J., Hu, R., Zhang, T., Wu, F., & Wang, G.
(2023). Instruction tuning for large language models: A survey. arXiv
(Cornell University). https://doi.org/10.48550/arxiv.2308.10792
Tian, S., Qiao, J., Yeganova,
L., Lai, P., Zhu, Q., Chen, X., Yang, Y., Chen, Q., Kim, W., Comeau, D. C.,
Doğan, R. I., Kapoor, A., Gao, X., & Lu, Z. (2023). Opportunities and
challenges for ChatGPT and large language models in biomedicine and health. Briefings
in Bioinformatics, 25(1). https://doi.org/10.1093/bib/bbad493
Park, P. S., Goldstein, S.,
O’Gara, A., Chen, M., & Hendrycks, D. (2023). AI Deception: A survey of
examples, risks, and potential solutions. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2308.14752
ARC Evals, 2023. URL https://evals.alignment.org/taskrabbit.pdf
OpenAI. GPT-4 technical report,
2023.
Ji, Z., Lee, N., Frieske, R.,
Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y., Madotto, A., & Fung, P.
(2023). Survey of Hallucination in Natural Language Generation. ACM
Computing Surveys, 55(12), 1–38. https://doi.org/10.1145/3571730
Mahon, J. E. (2018). Contemporary approaches to the philosophy of lying. In Oxford University Press eBooks (pp. 32–55). https://doi.org/10.1093/oxfordhb/9780198736578.013.3

























Comments
Post a Comment