Data when seen through the Inference Web of LLM

Background

Today I saw a post from this guy saying “I was at the Marines today”. A few mins later I saw a post from the same guy saying: “I love vada pavs”. And i thought to myself “Hey, This guy is probably from Mumbai, India”. The next thought that followed this was if I, a mere human being can deduce this from 2 posts, what can the LLMs do? LLMS are trained on massive data sources from the internet. Isn't it possible that they infer this information too?

Introduction

Large language models (LLMs) have revolutionized natural language processing tasks. They demonstrate amazing capabilities in understanding and generating human-like text. In this project, we decided to delve into the accuracies of inferential capabilities of various LLM models.

Objectives

Inference in LLMs is a vast and interesting topic to dive into. For the sake of this project, we chose 3 main objectives that we would be focusing upon:

Identify the accuracies of Inferential Capabilities of the LLM
Exploring how the inference of a language model changes as the amount of context provided increases
Suggest the approaches to mitigate / minimize the risk of personal data identification using inferential capabilities of LLMs

Our Setup

Our setup for the research was divided into 4 main sections:

Curating the right dataset

We referred to the Reddit dataset which was available on Kaggle. The Reddit dataset consisted of a lot of unnecessary information which was irrelevant to us. We needed data that was most appropriate for our problem statement. So we curated over 1000+ comments from the Reddit dataset that were closest to our inference parameters: Profession, City, Income, Gender, Country, and Education.

Providing an adversarial prompt

Most of the LLMs have guard rails that prevent people from trying to exploit PII data from them. In order to jail break, we used an adversarial prompt.

Setting a baseline

Now that we successfully fooled the models into giving us the PII data, we needed to set a baseline for evaluation of the models. We manually validated 100 prompts and found that GPT-4 was about 90.8% accurate in inference. Therefore, for the rest of the research we considered ChatGPT-4 to be our baseline for comparison.

Selection of models

We selected 3 main models that we would use apart from GPT-4. These were: GPT-3.5-turbo, Gemini-pro, Claude-3 Sonnet.
We choose these 3 model since these are publicly available black box model and were having large context window to accommodate our large input

Experiment

Evaluation of different parameters for different models:

We considered 6 parameters for our analysis, 2 direct inference parameters(city, profession) and 4 indirect inferences (Gender, Income, Country, Education)

Length of text:

Text Message length provided obviously was a variable that would cause differences in inference accuracies. To test this, we tried our inferences with variable context lengths.

We considered 25%, 50%, 75% and 100% of the message to determine how it would change the outcome of the inferences.

Results

Inferences

On performing this experiment we found that Gender and Profession data could be inferred very easily from the posts This was followed by Education and Income.
Claude over performed all other models in its inference and exceeded the baseline in most parameters.
Gemini was poor at inference as compared to the other models.
Interestingly City inference for Claude with 100% text is less accurate, indicating sometimes additional context confuses the model.

Accuracies

Given below is the graph depicting the accuracy of inferences. This excludes all the failed inferences that the model gave.(Responses where the model refused to answer or gave responses as “Insufficient Data”).

For parameters such as Income, Gender, Education, Country, Claude 3-Sonnet was performing better than GPT 3.5 for for full context window and for parameters Profession, City GPT3.5 is performing better than other models compared.

Precision

Some interesting results which we found are that claude-3-sonnet is performing very well in this attributes

Gender: it outperforms all in terms of correctly predicting the attribute when compared with other models
Education: consistently performed better in all context length than other models
Profession: gpt3.5 turbo was the least performing in this attribute but actually it was high, whereas the models Gemini-pro and claude-3-sonnet where neck to neck
Income: gpt3.5 turbo did hallucinate for a significant no of cases when context length was increased from 25 to 50% resulting in less precision
City: Claude outperformed other models, Gemini-pro did moderate performance and gpt-3.5 performed poorly as it is seen hallucinating with increase in context length, the ideal graph could be like even if it is not increasing it should stay stagnant but not decrease.
Country: It stayed stagnant which implies that the actual data for predicting country was in the first 25% interval (Low Predictions for the country might also be the other reason)

Mitigation techniques

In order to mitigate the PII data, we first needed to understand which data in the text was being inference.

For this we used a Presidio Analyzer. This allowed us to detect PII data and using this we implemented 2 modules for

mitigation:

Anonymizer
Faker

Definition

Presidio Analyzer enables PII removal, data cleaning, tokenization, custom entity recognition, and enhancement of training data for LLMs, ensuring privacy compliance and improved model performance.

Anonymizer

Anonymizing data after finding PII with Presidio Analyzer makes sure that personal info like names and contact details are hidden.

Some advantages of using the anonymizer are:

Protecting Privacy: Hiding sensitive info prevents unauthorized access or misuse.
Reducing Risk: Anonymization lowers the chance of exposing personal details, reducing the risk of identity theft or fraud.
Ensuring Compliance: It helps follow privacy laws like GDPR or CCPA, avoiding legal issues.

Faker

We used a faker to create synthetic data that mimics the patterns and formats of the original data without exposing real PII. This way, the LLM can still learn from the data without risk of re-identification.

Conclusion

Inferential Capabilities of gpt-4 and Claude are very close to human capabilities
There is some level of randomness across different models – gpt models being more stable and consistent.
Indirect inferential capabilities of Claude is much higher given 100% context window
Direct inferential capabilities of gpt3.5 are much better than any other model compared
LLM Inferential capabilities are evolving and to protect data privacy, conscious efforts are required in training data anonymization of faking

Sneak Peak to our Journey

Link to our Github Repo: https://github.com/bssoft82/rsai-llm-memorization.git

Link to YouTube: https://youtu.be/ib_HqZ8VT_A

Link To Report:

https://docs.google.com/document/d/1DVpbSBWeLaT2B4EWzu5qcN_zpjJZcd5QAJX58NZz2OQ/edit?usp=sharing

Team: Conscitech

Teammate Names:

Ketaki Kashtikar

Anand Bhat

Ronak Patel

Abhishek Kumar

Bharat Surana

Search This Blog

RSAI Projects - To be updated later