Harnessing Internal Representations: A strategy for Adversarial Defense

What are adversarial attacks?

Adversarial attacks are a type of threat to machine learning models that involve introducing carefully crafted input data that causes the model to make incorrect predictions. These malicious inputs, known as adversarial examples, are designed to be imperceptibly different from normal inputs, yet they can drastically change a model's output and lead it to misclassify the data. Adversarial attacks expose vulnerabilities in machine learning systems and raise concerns about the robustness and security of these models, especially in high-stakes applications like self-driving cars or medical diagnosis.

Why is adversarial defense necessary?

The threat of adversarial attacks highlights a critical vulnerability in modern machine learning models. Even state-of-the-art models that achieve high accuracy on normal test data can be easily fooled by adversarial examples. This makes adversarial attacks a serious security concern, especially as AI systems are increasingly deployed in safety-critical domains as a single adversarial example misclassified by the AI could lead to catastrophic consequences, in such scenarios.

Developing effective defenses against such attacks is therefore essential for building trustworthy and reliable AI that can operate securely even in adversarial settings. Adversarial defense techniques aim to improve model robustness, thereby increasing the safety and security of deploying AI solutions in the real world. Researchers are actively studying methods of defending against these attacks through defense mechanisms such as the Adversarial Prompt Shield (APS).

Are current adversarial defense mechanisms sufficient?

The short answer is no. While there has been promising research into developing techniques to defend against adversarial attacks, the current state-of-the-art defense mechanisms still leave much to be desired. Even though they can achieve fairly decent robustness against adversarial attacks, their computational requirements are often quite expensive. For instance, the adversarial prompt shield runs a different Distil-BERT solely for predicting whether an input is adversarial which makes it very inefficient. Therefore, there is large potential in developing defense techniques which are not as computationally expensive but offer a similar level of robustness.

For this reason, in this article we will go through a novel defense mechanism which utilizes model interpretability techniques for predicting whether an input is adversarial or not, and to further “de-adversarialize” the inputs if it is found to be adversarial.

How to apply interpretability techniques for adversarial defense?

The primary idea here is that interpretability techniques such as probing and steering (discussed below) deal with the internal representation of the inputs rather than directly dealing with the input only. Our basis for the claim that interpretability techniques can be utilized for our purpose, is that adversarial prompts are not random but share some commonalities amongst themselves. This is true as shown in recent research on adversarial suffixes. The existence of this commonality amongst adversarial prompts also implies the existence of some general adversarial information that is present within these inputs. Our approach intends to capture this adversarial information, which may be better represented by hidden state representations, to propose a general adversarial defense framework.

Probing

The first interpretability technique we apply for this purpose is probing. Probing is a technique used in interpretability to analyze the representations learned by neural networks and gain insights into what different parts of the model capture about the input data. The core idea behind probing is to attach lightweight auxiliary prediction tasks, called probes, to the representations of a neural network and evaluate how well these probes can predict certain linguistic properties or annotations. We employ probing and analyze each layer of our transformer model. This is done so that we can find out which layer of the model best captures the “adversarial information” mentioned earlier in this section. The probing task we use for this purpose is a bi-classification task between benign and normal inputs. For instance, we have applied probing on the Roberta-base-sst2 classifier with data that was generated by attacking the model with an attack called “TextFooler” and “DeepWordBug”, and the results can be seen in the graphs below:

To determine the baseline for probing accuracy in both cases, we have also done probing on the canine model (which is a character-level model and should be largely unaffected by TextFooler), and a randomly initialized roberta-base-sst2 (for DeepWordBug). This makes it apparent that certain layers are better at encoding adversarial information regarding the prompt than other layers and we utilize this information in our next step of building a steering vector. It’s also interesting to note that our simple Logistic Regression classifier performs better (with around 80% accuracy) on the best layer when we use probing than a Roberta-base-uncased transformer model trained on the input directly. This shows the advantage of using internal representations for our cause rather than simply focusing on the inputs.

Steering vector

As mentioned in the previous subsection, we utilize the knowledge of which layer has the highest amount of adversarial information stored within to build our steering vector. Steering is a technique which aims at manipulating model representations in a targeted way. The approach involves finding directions in the representation space of a layer that most affect a particular semantic property when the representations are updated along those directions. These directions are what steering vectors represent as they provide a way to dissect, introspect, and exert controlled interventions on the inner mechanisms of neural networks.

We reason that the layer with the highest probing accuracy would contain the maximum amount of adversarial information between adversarial and benign inputs. To obtain a representation of this adversarial information, we rely on 2 claims:

There are no constraints on the inputs in our dataset apart from the fact that some are adversarial and some are benign. This implies that given enough data, taking a mean over all the benign inputs should give a general representation of what a benign input should look like and likewise adversarial inputs. This is because other irrelevant information gets canceled out while averaging.
Since the only difference between the adversarial and benign set is that one is adversarial and the other is benign with no other constraints, taking a difference of the general representations calculated in the previous step would give us the remaining information when all the benign information is removed from adversarial representation. This means that we are simply left with a representation of adversarial information.

Hence, through these 2 steps, we obtain a steering vector that represents the general adversarial information we mentioned before. After obtaining this steering vector, we conduct analysis and use it to freely transform inputs from benign space to adversarial space and vice-versa.

How to use these results for building adversarial defense?

We leverage the fact that we can utilize steering vectors for transforming inputs between the adversarial and benign spaces. Simply by vector addition, we can take any benign input and add adversarial information to it, converting it to an adversarial input instead. We validate this by doing some preliminary checks on the original and converted inputs and see if we can fool the original classifier (Roberta-base-sst2). However, we do not do an in-depth test in this regard since our aim is not to create adversarial attacks. It’s not impressive to create adversarial inputs here since we were able to do that by directly attacking the model at the start anyway.

However, by changing the perspective, one can easily observe that if adding adversarial information can convert a prompt from benign to adversarial then the vice versa should also hold. In other words, we can convert an adversarial prompt to a benign prompt simply by doing vector subtraction with the steering vector thus removing the adversarial information present. Unlike the first implication of converting benign prompt to adversarial, the vice-versa is an impressive feat that we have achieved and has the potential of fulfilling our goal of building an adversarial defense mechanism that has minimal impact on performance.

To verify whether our claim is correct, we take 50 adversarial prompts generated by attacking the model Roberta-base-sst2 with TextFooler (different from the ones used to build the steering vector) and check whether we can successfully de-adversarialize the prompts. The results of the tests are shown below:

It becomes clear from the above analysis that there is a high success rate of almost 80% accuracy in converting adversarial inputs to benign inputs by the use of a steering vector.

Does this satisfy the original issue?

Yes. Our major problem with existing adversarial defense techniques was the implication that they have on the performance of the actual model. It is far from okay to run a Distil-BERT model on top of the already running model simply for detecting adversarial attacks. Our proposed solution achieves a high level of success rate in adversarial defense simply by vector addition and subtraction which are cheap operations computationally. Further, since our methodology does not revolve around a particular model or attack method, this framework can be applied to any model to build a defense against any attack thus increasing the generalizability of our approach.

Bottom line

While current adversarial defense techniques show promise in defending against attacks, they suffer from major computational overhead that inhibits real-world deployment. Our proposed solution offers a more efficient and generalized defense framework by leveraging simple vector operations like addition and subtraction to achieve high success rates against adversarial attacks. These computationally cheap operations can be seamlessly integrated into any machine learning model without requiring additional resource-intensive components. Crucially, our methodology is model-agnostic and attack-agnostic, providing a generalizable defense against a wide range of adversarial threats. By addressing the performance bottlenecks of existing approaches, our vector-based defense provides a new direction in increasing model robustness.

Link to the full report : https://drive.google.com/file/d/1LQFtNnD4kHUKTlmrBHT4l_TtvGdk5TvT/view?usp=sharing

Link to the video : https://youtu.be/0DUhE_0UnAw

Search This Blog

RSAI Projects - To be updated later