What's Your Number? Interpreting Memorisation in Language Models
Our Approach
As part of our course project, our team set out to explore the parameters and conditions necessary for grokking to occur in machine learning models. We focused our efforts on task complexity, data quantity, hyperparameters, and model architecture, using a character-level decoder-only transformer architecture as our testing ground.

First Steps
Initially, we tackled a simple ROT13 cipher task, which maps each letter to the 13th letter after it in the alphabet. However, this task proved too straightforward, and even a single-layered small model could memorize the pattern without demonstrating true grokking.
Upping the Ante
To raise the challenge, we devised a more complex task, combining character mapping to numeric values and modular addition. In this task, the model received two characters, 'a' and 'b', representing fixed but undisclosed numeric values. The decoder had to generate the character whose numeric value corresponded to a specific modular addition operation involving the values of 'a' and 'b'.
By systematically varying the task complexity, data quantity, hyperparameters (such as epochs and layers), and model architecture, we aimed to identify the critical factors that facilitate grokking in machine learning models.
What we found
Our experiments yielded fascinating insights. We observed that the complexity of the dataset played a crucial role in eliciting the grokking phenomenon. While the simple ROT13 task could be easily memorized, the more intricate modular addition task forced the model to undergo a prolonged period of memorization before eventually transitioning to a state of generalization – a telltale sign of grokking.
Furthermore, we discovered that weight decay, a regularization technique, significantly impacted the model's ability to grok. Without weight decay, the transformer networks failed to exhibit grokking behavior on the given task. Smaller amounts of decay caused the network to take significantly longer to grok, highlighting the importance of finding the right balance in regularization.
Interestingly, we also observed that grokking failed to occur when the vocabulary size fell below a certain threshold, suggesting that a minimum amount of data is necessary for the model to generalize effectively.
![]() |
| Variation in grokking with change in vocabulary size As the vocabulary size decreases, grokking tends to become more difficult to observe |
![]() |
| Variation in grokking with change in number of layers As the number of layers increase, the observation of grokking becomes more complex |
Concluding thoughts
Through this project, we gained valuable insights into the complex process of grokking in machine learning models. We learned that it is influenced by a multitude of factors, including dataset complexity, weight decay, dataset size, and model architecture. These findings could guide future studies in designing and training more intelligent and generalizable models capable of true understanding.
In the ever-evolving landscape of artificial intelligence, unraveling the mysteries of grokking is a crucial step towards developing models that not only recognize patterns but truly comprehend the underlying logic and relationships within data. As we continue to push the boundaries of machine learning, this research serves as a reminder of the intricate interplay between model architecture, training conditions, and the elusive pursuit of true generalization.
Project Report: https://drive.google.com/file/d/16YqnJpzNMnodwHnMDJsda3Yv0h6oVf2x/view?usp=drivesdk
Project Video: https://youtu.be/Q-YS04G4r_8?si=FBh3ip7UbiYaqi9y



Comments
Post a Comment