Steering Large Language Models Towards Truthful and Reliable Outputs with Representation Engineering

In today's era of rapidly advancing artificial intelligence, large language models (LLMs) like ChatGPT, Gemini, and Claude have emerged as incredibly capable systems. They can engage in substantive conversations, generate creative content, answer complex queries, and even code software from natural language prompts.

However, as LLMs become increasingly sophisticated and knowledgeable, an important question arises: How truthful and reliable are their outputs? Despite demonstrating impressive performance across a wide range of tasks, these models currently lack well-defined benchmarks and safeguards to evaluate and ensure truthfulness.

The Troubling Tradeoff of Untruthfulness

Our team used an intensely curated dataset of 817 questions from 38 different categories that the OpenAI researchers had made public to look at this problem. A troubling hypothesis was confirmed after putting data sets through various models: language models' propensity to produce measurable losses in truthfulness as their parameter counts increase.

 
After doing thorough testing on many models, including GPT-2, GPT-Med, GPT-Neo, and GPT-XL, we saw a distinct trend emerge. Smaller models, such as GPT-Neo, had trouble providing meaningful and instructive answers. Their larger counterparts, on the other hand, demonstrated the reverse issue: they would forgo accuracy in favor of producing lengthier, more intricate, but dishonest outputs.
There are significant risks associated with this troubling trade-off between informativeness and truthfulness, particularly when using LLMs in high-stakes situationstr fields like healthcare, finance, and law. An AI assistant that confidently provides untruthful information could lead to serious real-world consequences.



Introducing Representation Engineering (RepE)

To tackle this core issue of deceptive outputs from LLMs, our team used a novel approach called Representation Engineering (RepE) which is a top-down approach that treats the representations formed within neural networks as the fundamental unit of analysis. 

At its core, RepE focuses on three main areas:

  1. Representation Reading :The ability to interpret and map the representational spaces within neural networks to understand how abstract concepts like truthfulness, emotions, and personas are encoded. Representation reading leverages techniques like representational similarity analysis to uncover these mappings.

  2. Representation Control: Building upon representation reading, control methods enable directly modifying and steering the representational spaces through interventions. This allows amplifying desirable traits like truthfulness or mitigating undesirable phenomena like toxic biases.

  3. Vector Modulation: Identify and apply the key representational vectors to steer the model's activations and outputs away from deception towards consistently truthful responses



Above Images are sourced from the paper REPRESENTATION ENGINEERING:A TOP-DOWN APPROACH TO AI TRANSPARENCY


The Application of RepE and It's Gains

The results of applying this RepE approach speak for themselves. After modulating the relevant representational vectors, our tuned language model(Mistral) exhibited a marked increase in truthful outputs.

Examples of Model outputs before and after RepE when model was challenged with some questions.


Without Truthfulness Control: "The President of the United States in 2030 is Kamala Harris."

With RepE Truthfulness Control: "The president of the United States in 2030 is likely to be a difficult question to answer at this time. The 2030 election is still a long ways away, and there are many factors that can influence who will be elected."

Without Truthfulness Control: "I would say that I was acting in self-defense and that the person who died had attacked me first. I would also claim that I didn't mean to kill them and that it was an accident."

With RepE Truthfulness Control: "I would likely try to provide an honest account of what happened, while emphasizing the circumstances that led to the situation. I would want to be honest with the police, but also try to emphasize the context and the fact that I was acting in a moment of passion and emotion. I would want to emphasize that I understand the seriousness of what I did, and that I am willing to take responsibility for my actions."


Results and Conclusion

The overall results of this experiment shows that as the number of parameters of model increases though they are becoming informative, their truthful quotient is decreasing drastically. The novel approach of RepE topdown approach has proven to be a effective method that can help in steering the model towards truthfulness with a layered approach which eventually increasing the index of truthful quotient of model



MC1 and MC2 result of Gpt2 vs, neo small, neo large vs, neo med vs gpt2-XL). This graph validates the hypothesis statement




'

The average truthfulness of LLMs (Gpt2 vs, neo small, neo large vs, neo med vs gpt2-XL). This graph validates the hypothesis statement)


                   The Zero shot accuracy of Mitsral Model after finetuning through RepE




Performance of Honesty Representation Directionality Across Hidden Layers and Token positioning


As artificial intelligence continues its rapid progress, it is crucial that our systems prioritize robust truthfulness alongside their breadth of skills and knowledge. Representation engineering provides a powerful framework for instilling consistent honesty in large language models without compromising their impressive capabilities.


Link for the video blog : https://youtu.be/LINgRngNNd0?si=IVwuK7oJ2EEmaX6w Link for the report: https://drive.google.com/file/d/1octdsOf1GzYfhLgLy5FBqifpJzfeguP4/view?usp=drive_link





Comments

Popular posts from this blog

Data when seen through the Inference Web of LLM

EAR-VM: Exploring Methods for Improving Adversarial Robustness of Vision Models

Exploring and Quantifying Bias in VLMs