What's Your Number? Interpreting Memorisation in Language Models

 




In the vast realm of machine learning, the concept of "grokking" holds a special allure. It refers to a model's ability to truly understand and generalize beyond mere pattern recognition or memorization. Achieving grokking is a hallmark of true intelligence, where a model can accurately predict or classify data while developing an intuitive grasp of the underlying patterns, relationships, and representations within that data.

Our Approach

As part of our course project, our team set out to explore the parameters and conditions necessary for grokking to occur in machine learning models. We focused our efforts on task complexity, data quantity, hyperparameters, and model architecture, using a character-level decoder-only transformer architecture as our testing ground.

First Steps

Initially, we tackled a simple ROT13 cipher task, which maps each letter to the 13th letter after it in the alphabet. However, this task proved too straightforward, and even a single-layered small model could memorize the pattern without demonstrating true grokking.

Upping the Ante

To raise the challenge, we devised a more complex task, combining character mapping to numeric values and modular addition. In this task, the model received two characters, 'a' and 'b', representing fixed but undisclosed numeric values. The decoder had to generate the character whose numeric value corresponded to a specific modular addition operation involving the values of 'a' and 'b'.

By systematically varying the task complexity, data quantity, hyperparameters (such as epochs and layers), and model architecture, we aimed to identify the critical factors that facilitate grokking in machine learning models.

What we found

Our experiments yielded fascinating insights. We observed that the complexity of the dataset played a crucial role in eliciting the grokking phenomenon. While the simple ROT13 task could be easily memorized, the more intricate modular addition task forced the model to undergo a prolonged period of memorization before eventually transitioning to a state of generalization – a telltale sign of grokking.

Furthermore, we discovered that weight decay, a regularization technique, significantly impacted the model's ability to grok. Without weight decay, the transformer networks failed to exhibit grokking behavior on the given task. Smaller amounts of decay caused the network to take significantly longer to grok, highlighting the importance of finding the right balance in regularization.

Interestingly, we also observed that grokking failed to occur when the vocabulary size fell below a certain threshold, suggesting that a minimum amount of data is necessary for the model to generalize effectively.

Variation in grokking with change in vocabulary size
As the vocabulary size decreases, grokking tends to become more difficult to observe


Additionally, the number of attention layers in the transformer model influenced its grokking abilities. While single-layer models exhibited grokking, models with more than one layer were affected by the "slingshot mechanism" (dramatic loss spikes) and required significantly more training data to achieve grokking.
Variation in grokking with change in number of layers
As the number of layers increase, the observation of grokking becomes more complex 

Concluding thoughts

Through this project, we gained valuable insights into the complex process of grokking in machine learning models. We learned that it is influenced by a multitude of factors, including dataset complexity, weight decay, dataset size, and model architecture. These findings could guide future studies in designing and training more intelligent and generalizable models capable of true understanding.

In the ever-evolving landscape of artificial intelligence, unraveling the mysteries of grokking is a crucial step towards developing models that not only recognize patterns but truly comprehend the underlying logic and relationships within data. As we continue to push the boundaries of machine learning, this research serves as a reminder of the intricate interplay between model architecture, training conditions, and the elusive pursuit of true generalization.

Project Report: https://drive.google.com/file/d/16YqnJpzNMnodwHnMDJsda3Yv0h6oVf2x/view?usp=drivesdk

Project Video: https://youtu.be/Q-YS04G4r_8?si=FBh3ip7UbiYaqi9y


Comments

Popular posts from this blog

Data when seen through the Inference Web of LLM

EAR-VM: Exploring Methods for Improving Adversarial Robustness of Vision Models

Exploring and Quantifying Bias in VLMs