# Defense Using the Dark Arts: A Technical Intro to Adversarial ML

# Defense Using the Dark Arts: A Technical Intro to Adversarial ML

# Defense Using the Dark Arts: A Technical Intro to Adversarial ML

Jiarui Wang

Jiarui Wang

Jiarui Wang

Nov 30, 2023

Nov 30, 2023

Nov 30, 2023

*This article was originally published in Black Box, the author's newsletter. Subscribe **here**. *

*This post assumes a conceptual understanding of neural network training, basic linear algebra, and familiarity with mathematical notation. For a refresher, see this **generative AI primer** that I created.*

I have been interested in protecting data and content from AI models since I proposed accelerating model collapse to preserve API access. I think that defensively using offensive techniques has a lot of potential, most of which is unrealized due to the nascency of this research. So when Nightshade, the data poisoning tool, was announced a few weeks ago, I was eager to dig into the paper.

What I quickly realized is in this branch of machine learning, there is a gap in knowledge between casual interest and understanding research. Luckily, I studied math alongside economics in college and had previously reviewed the basics of generative AI. It was time for a deep dive.

### Adversarial ML

Nightshade belongs to a field called adversarial machine learning, which is the study of attacks on ML models and defenses against such attacks. To be specific, Nightshade is a data poisoning attack, which modifies the training data of a model so that it produces errors. This is related to another kind of attack called evasion, which modifies the inputs to make a model — usually a classifier — systematically produce wrong outputs, e.g., misclassify. Much adversarial ML research has been on evading image classifiers; Nightshade builds on this body of work.

Adversarial attacks can be further categorized as either

White box, in which the attacker has access to a model’s parameters; or black box (more on that in a second).

Targeted, in which the attacker is aiming to produce errors of a specific class; or non-targeted. Most practical applications are targeted.

While the nature of adversarial spaces is unclear, the leading hypothesis is that they result from practical limitations. Since a model cannot be trained on the entire input space, the distributions of the training and test data will not match exactly. Adversarial space is the difference, and so inputs in that space are misclassified and become adversarial examples.

Adversarial ML research has “traditionally” focused on white box attacks as they can transfer to black box models. This is possible since models trained on similar data are likely to have adversarial space that partially overlap. In fact, this probability increases with dimensionality due to a property called distance concentration1. ML models are also inherently prone to behaving similarly since they are designed to generalize across training datasets and architecture. Similar classifiers should therefore partition their input space into classes with decision boundaries that are close to each other.

### Distance measures

What “close” means matters a lot in adversarial ML because attacks that are obviously manipulated will be caught. Instead of cosine similarity, the most common distance measures used in adversarial ML are *p*-norms, which are written as || • ||*ₚ *in literature. Given a pair of vectors, their

*l*₁*l*₂ norm or Euclidean distance is given by the Pythagorean theorem. It is often used for continuous data.*lₚ*norm or Minkowski distance generalizes this to vectors of any finite order*p*.*l*͚

Speaking of literature, there are two foundational papers in adversarial ML that are worth understanding as background for Nightshade. I review these in turn.

### Fast Gradient Sign Method (2015)

An adversarial example takes advantage of the fact that a classifier *C* has a tolerance so that a perturbation *η* to an input *x* below a threshold *ϵ* maps to the same class *y*, i.e., *C*(*x*) = *C*(*x’*) = *y* where *x’* = *x* + *η *and ||*η*|| ͚ < *ϵ*.

For intuition, consider the activation of a node in a neural network. Given a weight vector *w* and an adversarial example *x’*, the dot product is *w*ᵀ*x’* = *w*ᵀ*x + w*ᵀ*η*. The perturbation *w*ᵀ*η* can then be maximized2 if *η *= *ϵsign*(*w*).

This is a linear perturbation, but neural networks are generally nonlinear. However, Goodfellow et al. hypothesize that they behave linearly enough — at least locally — that they are susceptible to linear perturbations. (This can easily be shown, which they do in the rest of the paper.)

Their strategy is based on gradients. In training, backpropagation updates *w* to minimize the loss function *J* given an input *x. *A gradient-based attack is effectively the inverse as it holds *w* constant and perturbs *x* to maximize *J*, subject to some *ϵ*.

For a model with parameters *θ*, the optimal perturbation is therefore

which Goodfellow et al. call the Fast Gradient Sign Method, a non-targeted white box attack. But what about targeted attacks?

### Carlini and Wagner (2017)

Carlini and Wagner propose the following to calculate a targeted white box attack on a classifier *C*. Let *C**(*x*) be the correct class for a valid input *x* and *t* ≠ *C**(*x*) be the target class. Finding an adversarial example *x’* = *x* + *δ* is then stated as *min*(*D*(*x*, *x* + *δ*)) subject to *C*(*x* + *δ*) = *t* and *x* + *δ* ∈ [0, 1]*ⁿ*, where *D* is a distance measure and *n* is the dimension of *x*.

This is difficult to solve directly since *C* is nonlinear, so Carlini and Wagner define an objective function *f* such that *C*(*x* + *δ*) = *t *if and only if* f*(*x* + *δ*) ≤ 0. Conceptually, 1 — *f* is how close *x* + *δ *is to being classified as *t, *which makes *f* a loss function3.

If *D* is a *p*-norm, then the problem can be restated as *min*(||*δ*||*ₚ + cf*(*x* + *δ*)) subject to *x* + *δ* ∈ [0, 1]*ⁿ*, where *c* > 0 is a weighing factor for the loss term. Carlini and Wagner present several candidates for *f* and empirically find that the best is

where *Z*( • ) gives the output from all of a neural network’s layers before the softmax function — a generalization of the logistic function to *n* dimensions — and *k* is a confidence threshold (as a lower limit to the loss function). The *Z* expression is essentially the difference between what the classifier thinks *x’* is and what the attacker wants the classifier to think it is4.

Carlini and Wagner make one more transformation to accommodate for the fact that many optimization algorithms are not bounded. They apply a change of variable5 on *δ *and introduce *w *such that* δ = *½(*tanh*(*w*) + 1) — *x.* The final optimization problem is then

which they solve using the Adam optimizer. Carlini and Wagner’s approach can quickly generate highly robust targeted white box attacks.

### Glaze (2023)

Nightshade is an extension of prior work by Shan et al. that they call Glaze, a style cloaking technique that applies Carlini and Wagner’s optimization to text-to-image generators. These are harder to evade than image classifiers because they retain more of the features extracted from training images in order to generate original images6:

During training, a text-to-image generator takes in an image

*x*and uses a feature extractor Φ to produce its feature vector Φ(*x*).Simultaneously, an encoder

*E*takes a corresponding text caption*s*and produces a predicted feature vector*E*(*s*).The parameters of

*E*are optimized in training so that*E*(*s*) = Φ(*x*).At generation time, a user passes a text prompt

*sᵢ*into*E*and a decoder*F*decodes*E*(*sᵢ*) to produce the generated image*xᵢ*=*F*(*E*(*sᵢ*)).

Shan et al. focus on style mimicry attacks, where a text-to-image generator is used to create art in the style of a particular artist without their consent. Existing protective techniques rely on image cloaking, which are designed for classifiers; they are ineffective against style mimicry because they shift all features in an image instead of focusing on only features related to style.

Since it is difficult to explicitly identify and separate style features, Shan et al. do so implicitly by using another feature extractor Ω to transfer artwork *x* into a target style *T *that is different from the artist’s, Ω(*x*, *T*). (*T* is selected based on the distances between the centroid of Φ(*x*) and that of the feature spaces of candidate styles.) Then calculating the style cloak *δₓ *can be stated as the optimization *min*(*D*(Φ(*x + δₓ*) — Φ(Ω(*x*, *T*)))) subject to |*δₓ| < p*, where *D *is a distance measure and *p* is the perturbation budget.

Note that since Glaze is a black box evasion attack, the same model can act as both Φ and Ω. In other words, Glaze can use DALL-E, Midjourney, Stable Diffusion, etc. against themselves! I found this to be particularly satisfying as it is the ultimate form of defensively using offensive techniques, at least from the perspective of the artists.

Following Carlini and Wagner, Shan et al. then combines the restraint into the optimization problem

where *α* > 0 is a weighing factor for the loss term, LPIPS is a measure of the perceived distortion, and *D* is instantiated as the *l*₂ norm.

Glaze’s style cloak was empirically successful in protecting art from being learned by Stable Diffusion and DALL-E, as judged by artists7. It helps that artists are willing to accept fairly large *p *because their current methods of protecting are quite disruptive (e.g., watermarks). However, Glaze can only protect new art since most existing artwork is already part of training data.

### Nightshade (2023)

Nightshade goes beyond cloaking and damages the offending model itself. It takes advantage of the fact that text-to-image generators exhibit concept sparsity despite having large training datasets. That is, a very small portion of the training data contains a given term or its semantically related terms. As a result, Shan et al. hypothesize that text-to-image generators are much more vulnerable to data poisoning (at the concept level) than is commonly believed.

They prove this by proposing Nightshade, a prompt-specific data poisoning attack based on mismatched text/image pairs. To minimize the number of poison samples for ease of implementation, Nightshade follows two design principles:

Maximize the effect of each sample by including the keyword

*C*in each poison prompt so that it targets only the parameters associated with*C*.Minimize conflicts among different pairs, and thus the overlap of their contributions to the perturbation norm, by creating original images of the target concept

*T*using another

If Φ is the feature extractor of the victim model and Ω is that of the poison image model, than a valid image *x* corresponding to the poison prompt can be perturbed by *δₓ* so that it is poisoned into Ω(*x*, *T*). This can be calculated using Glaze8.

Nightshade produces stronger poisoning effects than previous techniques and they bleed through to related concepts. They also stack if concepts are combined into a single prompt. Furthermore, if many Nightshade attacks target different prompts on the same model, general features will become corrupted and its image generation function eventually collapses!

It is not difficult to imagine that all platforms that host media will protect their content with adversarial techniques like these in the future. I would love to learn more if you’re building or researching in this space—reach out at wngjj[dot]61[at]gmail.com. Until then, I will be thinking about how every new offensive threat could be a new defensive opportunity. ∎

Volume generalizes as distance raised to the power of the dimension, so increasingly more volume is on the edges at high dimensions. Here, the edges are decision boundaries.

This can be arbitrarily large for sufficiently high

*n*since*l*͚ does not grow with dimensionality.Defining

*f*as the complement probability also enables it to be combined with*D*into a single minimization in the next step.The first term represents the probability of the class of which

*C*predicts*x'*is part and the second term represents the probability that*C*classifies*x'*as*t*instead. Note that*Z*is not a probability itself but a softmax input.Since

*tanh*( • ) ∈ [-1, 1], this is equivalent to*x*+*δ*∈ [0, 1]*ⁿ*.This is a very large output space. Classifiers have limited output classes and therefore do not have to retain as many features.

Shan et al. worked closely with professional artists and the Glaze paper includes their perspectives in §3.1, which I recommend reading.

In practice, Nightshade uses prompts from a valid dataset of text/image pairs to easily find

*x*.

*This article was originally published in Black Box, the author's newsletter. Subscribe **here**. *

*This post assumes a conceptual understanding of neural network training, basic linear algebra, and familiarity with mathematical notation. For a refresher, see this **generative AI primer** that I created.*

I have been interested in protecting data and content from AI models since I proposed accelerating model collapse to preserve API access. I think that defensively using offensive techniques has a lot of potential, most of which is unrealized due to the nascency of this research. So when Nightshade, the data poisoning tool, was announced a few weeks ago, I was eager to dig into the paper.

What I quickly realized is in this branch of machine learning, there is a gap in knowledge between casual interest and understanding research. Luckily, I studied math alongside economics in college and had previously reviewed the basics of generative AI. It was time for a deep dive.

### Adversarial ML

Nightshade belongs to a field called adversarial machine learning, which is the study of attacks on ML models and defenses against such attacks. To be specific, Nightshade is a data poisoning attack, which modifies the training data of a model so that it produces errors. This is related to another kind of attack called evasion, which modifies the inputs to make a model — usually a classifier — systematically produce wrong outputs, e.g., misclassify. Much adversarial ML research has been on evading image classifiers; Nightshade builds on this body of work.

Adversarial attacks can be further categorized as either

White box, in which the attacker has access to a model’s parameters; or black box (more on that in a second).

Targeted, in which the attacker is aiming to produce errors of a specific class; or non-targeted. Most practical applications are targeted.

While the nature of adversarial spaces is unclear, the leading hypothesis is that they result from practical limitations. Since a model cannot be trained on the entire input space, the distributions of the training and test data will not match exactly. Adversarial space is the difference, and so inputs in that space are misclassified and become adversarial examples.

Adversarial ML research has “traditionally” focused on white box attacks as they can transfer to black box models. This is possible since models trained on similar data are likely to have adversarial space that partially overlap. In fact, this probability increases with dimensionality due to a property called distance concentration1. ML models are also inherently prone to behaving similarly since they are designed to generalize across training datasets and architecture. Similar classifiers should therefore partition their input space into classes with decision boundaries that are close to each other.

### Distance measures

What “close” means matters a lot in adversarial ML because attacks that are obviously manipulated will be caught. Instead of cosine similarity, the most common distance measures used in adversarial ML are *p*-norms, which are written as || • ||*ₚ *in literature. Given a pair of vectors, their

*l*₁*l*₂ norm or Euclidean distance is given by the Pythagorean theorem. It is often used for continuous data.*lₚ*norm or Minkowski distance generalizes this to vectors of any finite order*p*.*l*͚

Speaking of literature, there are two foundational papers in adversarial ML that are worth understanding as background for Nightshade. I review these in turn.

### Fast Gradient Sign Method (2015)

An adversarial example takes advantage of the fact that a classifier *C* has a tolerance so that a perturbation *η* to an input *x* below a threshold *ϵ* maps to the same class *y*, i.e., *C*(*x*) = *C*(*x’*) = *y* where *x’* = *x* + *η *and ||*η*|| ͚ < *ϵ*.

For intuition, consider the activation of a node in a neural network. Given a weight vector *w* and an adversarial example *x’*, the dot product is *w*ᵀ*x’* = *w*ᵀ*x + w*ᵀ*η*. The perturbation *w*ᵀ*η* can then be maximized2 if *η *= *ϵsign*(*w*).

This is a linear perturbation, but neural networks are generally nonlinear. However, Goodfellow et al. hypothesize that they behave linearly enough — at least locally — that they are susceptible to linear perturbations. (This can easily be shown, which they do in the rest of the paper.)

Their strategy is based on gradients. In training, backpropagation updates *w* to minimize the loss function *J* given an input *x. *A gradient-based attack is effectively the inverse as it holds *w* constant and perturbs *x* to maximize *J*, subject to some *ϵ*.

For a model with parameters *θ*, the optimal perturbation is therefore

which Goodfellow et al. call the Fast Gradient Sign Method, a non-targeted white box attack. But what about targeted attacks?

### Carlini and Wagner (2017)

Carlini and Wagner propose the following to calculate a targeted white box attack on a classifier *C*. Let *C**(*x*) be the correct class for a valid input *x* and *t* ≠ *C**(*x*) be the target class. Finding an adversarial example *x’* = *x* + *δ* is then stated as *min*(*D*(*x*, *x* + *δ*)) subject to *C*(*x* + *δ*) = *t* and *x* + *δ* ∈ [0, 1]*ⁿ*, where *D* is a distance measure and *n* is the dimension of *x*.

This is difficult to solve directly since *C* is nonlinear, so Carlini and Wagner define an objective function *f* such that *C*(*x* + *δ*) = *t *if and only if* f*(*x* + *δ*) ≤ 0. Conceptually, 1 — *f* is how close *x* + *δ *is to being classified as *t, *which makes *f* a loss function3.

If *D* is a *p*-norm, then the problem can be restated as *min*(||*δ*||*ₚ + cf*(*x* + *δ*)) subject to *x* + *δ* ∈ [0, 1]*ⁿ*, where *c* > 0 is a weighing factor for the loss term. Carlini and Wagner present several candidates for *f* and empirically find that the best is

where *Z*( • ) gives the output from all of a neural network’s layers before the softmax function — a generalization of the logistic function to *n* dimensions — and *k* is a confidence threshold (as a lower limit to the loss function). The *Z* expression is essentially the difference between what the classifier thinks *x’* is and what the attacker wants the classifier to think it is4.

Carlini and Wagner make one more transformation to accommodate for the fact that many optimization algorithms are not bounded. They apply a change of variable5 on *δ *and introduce *w *such that* δ = *½(*tanh*(*w*) + 1) — *x.* The final optimization problem is then

which they solve using the Adam optimizer. Carlini and Wagner’s approach can quickly generate highly robust targeted white box attacks.

### Glaze (2023)

Nightshade is an extension of prior work by Shan et al. that they call Glaze, a style cloaking technique that applies Carlini and Wagner’s optimization to text-to-image generators. These are harder to evade than image classifiers because they retain more of the features extracted from training images in order to generate original images6:

During training, a text-to-image generator takes in an image

*x*and uses a feature extractor Φ to produce its feature vector Φ(*x*).Simultaneously, an encoder

*E*takes a corresponding text caption*s*and produces a predicted feature vector*E*(*s*).The parameters of

*E*are optimized in training so that*E*(*s*) = Φ(*x*).At generation time, a user passes a text prompt

*sᵢ*into*E*and a decoder*F*decodes*E*(*sᵢ*) to produce the generated image*xᵢ*=*F*(*E*(*sᵢ*)).

Shan et al. focus on style mimicry attacks, where a text-to-image generator is used to create art in the style of a particular artist without their consent. Existing protective techniques rely on image cloaking, which are designed for classifiers; they are ineffective against style mimicry because they shift all features in an image instead of focusing on only features related to style.

Since it is difficult to explicitly identify and separate style features, Shan et al. do so implicitly by using another feature extractor Ω to transfer artwork *x* into a target style *T *that is different from the artist’s, Ω(*x*, *T*). (*T* is selected based on the distances between the centroid of Φ(*x*) and that of the feature spaces of candidate styles.) Then calculating the style cloak *δₓ *can be stated as the optimization *min*(*D*(Φ(*x + δₓ*) — Φ(Ω(*x*, *T*)))) subject to |*δₓ| < p*, where *D *is a distance measure and *p* is the perturbation budget.

Note that since Glaze is a black box evasion attack, the same model can act as both Φ and Ω. In other words, Glaze can use DALL-E, Midjourney, Stable Diffusion, etc. against themselves! I found this to be particularly satisfying as it is the ultimate form of defensively using offensive techniques, at least from the perspective of the artists.

Following Carlini and Wagner, Shan et al. then combines the restraint into the optimization problem

where *α* > 0 is a weighing factor for the loss term, LPIPS is a measure of the perceived distortion, and *D* is instantiated as the *l*₂ norm.

Glaze’s style cloak was empirically successful in protecting art from being learned by Stable Diffusion and DALL-E, as judged by artists7. It helps that artists are willing to accept fairly large *p *because their current methods of protecting are quite disruptive (e.g., watermarks). However, Glaze can only protect new art since most existing artwork is already part of training data.

### Nightshade (2023)

Nightshade goes beyond cloaking and damages the offending model itself. It takes advantage of the fact that text-to-image generators exhibit concept sparsity despite having large training datasets. That is, a very small portion of the training data contains a given term or its semantically related terms. As a result, Shan et al. hypothesize that text-to-image generators are much more vulnerable to data poisoning (at the concept level) than is commonly believed.

They prove this by proposing Nightshade, a prompt-specific data poisoning attack based on mismatched text/image pairs. To minimize the number of poison samples for ease of implementation, Nightshade follows two design principles:

Maximize the effect of each sample by including the keyword

*C*in each poison prompt so that it targets only the parameters associated with*C*.Minimize conflicts among different pairs, and thus the overlap of their contributions to the perturbation norm, by creating original images of the target concept

*T*using another

If Φ is the feature extractor of the victim model and Ω is that of the poison image model, than a valid image *x* corresponding to the poison prompt can be perturbed by *δₓ* so that it is poisoned into Ω(*x*, *T*). This can be calculated using Glaze8.

Nightshade produces stronger poisoning effects than previous techniques and they bleed through to related concepts. They also stack if concepts are combined into a single prompt. Furthermore, if many Nightshade attacks target different prompts on the same model, general features will become corrupted and its image generation function eventually collapses!

It is not difficult to imagine that all platforms that host media will protect their content with adversarial techniques like these in the future. I would love to learn more if you’re building or researching in this space—reach out at wngjj[dot]61[at]gmail.com. Until then, I will be thinking about how every new offensive threat could be a new defensive opportunity. ∎

Volume generalizes as distance raised to the power of the dimension, so increasingly more volume is on the edges at high dimensions. Here, the edges are decision boundaries.

This can be arbitrarily large for sufficiently high

*n*since*l*͚ does not grow with dimensionality.Defining

*f*as the complement probability also enables it to be combined with*D*into a single minimization in the next step.The first term represents the probability of the class of which

*C*predicts*x'*is part and the second term represents the probability that*C*classifies*x'*as*t*instead. Note that*Z*is not a probability itself but a softmax input.Since

*tanh*( • ) ∈ [-1, 1], this is equivalent to*x*+*δ*∈ [0, 1]*ⁿ*.This is a very large output space. Classifiers have limited output classes and therefore do not have to retain as many features.

Shan et al. worked closely with professional artists and the Glaze paper includes their perspectives in §3.1, which I recommend reading.

In practice, Nightshade uses prompts from a valid dataset of text/image pairs to easily find

*x*.

*This article was originally published in Black Box, the author's newsletter. Subscribe **here**. *

*This post assumes a conceptual understanding of neural network training, basic linear algebra, and familiarity with mathematical notation. For a refresher, see this **generative AI primer** that I created.*

I have been interested in protecting data and content from AI models since I proposed accelerating model collapse to preserve API access. I think that defensively using offensive techniques has a lot of potential, most of which is unrealized due to the nascency of this research. So when Nightshade, the data poisoning tool, was announced a few weeks ago, I was eager to dig into the paper.

What I quickly realized is in this branch of machine learning, there is a gap in knowledge between casual interest and understanding research. Luckily, I studied math alongside economics in college and had previously reviewed the basics of generative AI. It was time for a deep dive.

### Adversarial ML

Nightshade belongs to a field called adversarial machine learning, which is the study of attacks on ML models and defenses against such attacks. To be specific, Nightshade is a data poisoning attack, which modifies the training data of a model so that it produces errors. This is related to another kind of attack called evasion, which modifies the inputs to make a model — usually a classifier — systematically produce wrong outputs, e.g., misclassify. Much adversarial ML research has been on evading image classifiers; Nightshade builds on this body of work.

Adversarial attacks can be further categorized as either

White box, in which the attacker has access to a model’s parameters; or black box (more on that in a second).

Targeted, in which the attacker is aiming to produce errors of a specific class; or non-targeted. Most practical applications are targeted.

While the nature of adversarial spaces is unclear, the leading hypothesis is that they result from practical limitations. Since a model cannot be trained on the entire input space, the distributions of the training and test data will not match exactly. Adversarial space is the difference, and so inputs in that space are misclassified and become adversarial examples.

Adversarial ML research has “traditionally” focused on white box attacks as they can transfer to black box models. This is possible since models trained on similar data are likely to have adversarial space that partially overlap. In fact, this probability increases with dimensionality due to a property called distance concentration1. ML models are also inherently prone to behaving similarly since they are designed to generalize across training datasets and architecture. Similar classifiers should therefore partition their input space into classes with decision boundaries that are close to each other.

### Distance measures

What “close” means matters a lot in adversarial ML because attacks that are obviously manipulated will be caught. Instead of cosine similarity, the most common distance measures used in adversarial ML are *p*-norms, which are written as || • ||*ₚ *in literature. Given a pair of vectors, their

*l*₁*l*₂ norm or Euclidean distance is given by the Pythagorean theorem. It is often used for continuous data.*lₚ*norm or Minkowski distance generalizes this to vectors of any finite order*p*.*l*͚

Speaking of literature, there are two foundational papers in adversarial ML that are worth understanding as background for Nightshade. I review these in turn.

### Fast Gradient Sign Method (2015)

An adversarial example takes advantage of the fact that a classifier *C* has a tolerance so that a perturbation *η* to an input *x* below a threshold *ϵ* maps to the same class *y*, i.e., *C*(*x*) = *C*(*x’*) = *y* where *x’* = *x* + *η *and ||*η*|| ͚ < *ϵ*.

For intuition, consider the activation of a node in a neural network. Given a weight vector *w* and an adversarial example *x’*, the dot product is *w*ᵀ*x’* = *w*ᵀ*x + w*ᵀ*η*. The perturbation *w*ᵀ*η* can then be maximized2 if *η *= *ϵsign*(*w*).

This is a linear perturbation, but neural networks are generally nonlinear. However, Goodfellow et al. hypothesize that they behave linearly enough — at least locally — that they are susceptible to linear perturbations. (This can easily be shown, which they do in the rest of the paper.)

Their strategy is based on gradients. In training, backpropagation updates *w* to minimize the loss function *J* given an input *x. *A gradient-based attack is effectively the inverse as it holds *w* constant and perturbs *x* to maximize *J*, subject to some *ϵ*.

For a model with parameters *θ*, the optimal perturbation is therefore

which Goodfellow et al. call the Fast Gradient Sign Method, a non-targeted white box attack. But what about targeted attacks?

### Carlini and Wagner (2017)

Carlini and Wagner propose the following to calculate a targeted white box attack on a classifier *C*. Let *C**(*x*) be the correct class for a valid input *x* and *t* ≠ *C**(*x*) be the target class. Finding an adversarial example *x’* = *x* + *δ* is then stated as *min*(*D*(*x*, *x* + *δ*)) subject to *C*(*x* + *δ*) = *t* and *x* + *δ* ∈ [0, 1]*ⁿ*, where *D* is a distance measure and *n* is the dimension of *x*.

This is difficult to solve directly since *C* is nonlinear, so Carlini and Wagner define an objective function *f* such that *C*(*x* + *δ*) = *t *if and only if* f*(*x* + *δ*) ≤ 0. Conceptually, 1 — *f* is how close *x* + *δ *is to being classified as *t, *which makes *f* a loss function3.

If *D* is a *p*-norm, then the problem can be restated as *min*(||*δ*||*ₚ + cf*(*x* + *δ*)) subject to *x* + *δ* ∈ [0, 1]*ⁿ*, where *c* > 0 is a weighing factor for the loss term. Carlini and Wagner present several candidates for *f* and empirically find that the best is

where *Z*( • ) gives the output from all of a neural network’s layers before the softmax function — a generalization of the logistic function to *n* dimensions — and *k* is a confidence threshold (as a lower limit to the loss function). The *Z* expression is essentially the difference between what the classifier thinks *x’* is and what the attacker wants the classifier to think it is4.

Carlini and Wagner make one more transformation to accommodate for the fact that many optimization algorithms are not bounded. They apply a change of variable5 on *δ *and introduce *w *such that* δ = *½(*tanh*(*w*) + 1) — *x.* The final optimization problem is then

which they solve using the Adam optimizer. Carlini and Wagner’s approach can quickly generate highly robust targeted white box attacks.

### Glaze (2023)

Nightshade is an extension of prior work by Shan et al. that they call Glaze, a style cloaking technique that applies Carlini and Wagner’s optimization to text-to-image generators. These are harder to evade than image classifiers because they retain more of the features extracted from training images in order to generate original images6:

During training, a text-to-image generator takes in an image

*x*and uses a feature extractor Φ to produce its feature vector Φ(*x*).Simultaneously, an encoder

*E*takes a corresponding text caption*s*and produces a predicted feature vector*E*(*s*).The parameters of

*E*are optimized in training so that*E*(*s*) = Φ(*x*).At generation time, a user passes a text prompt

*sᵢ*into*E*and a decoder*F*decodes*E*(*sᵢ*) to produce the generated image*xᵢ*=*F*(*E*(*sᵢ*)).

Shan et al. focus on style mimicry attacks, where a text-to-image generator is used to create art in the style of a particular artist without their consent. Existing protective techniques rely on image cloaking, which are designed for classifiers; they are ineffective against style mimicry because they shift all features in an image instead of focusing on only features related to style.

Since it is difficult to explicitly identify and separate style features, Shan et al. do so implicitly by using another feature extractor Ω to transfer artwork *x* into a target style *T *that is different from the artist’s, Ω(*x*, *T*). (*T* is selected based on the distances between the centroid of Φ(*x*) and that of the feature spaces of candidate styles.) Then calculating the style cloak *δₓ *can be stated as the optimization *min*(*D*(Φ(*x + δₓ*) — Φ(Ω(*x*, *T*)))) subject to |*δₓ| < p*, where *D *is a distance measure and *p* is the perturbation budget.

Note that since Glaze is a black box evasion attack, the same model can act as both Φ and Ω. In other words, Glaze can use DALL-E, Midjourney, Stable Diffusion, etc. against themselves! I found this to be particularly satisfying as it is the ultimate form of defensively using offensive techniques, at least from the perspective of the artists.

Following Carlini and Wagner, Shan et al. then combines the restraint into the optimization problem

where *α* > 0 is a weighing factor for the loss term, LPIPS is a measure of the perceived distortion, and *D* is instantiated as the *l*₂ norm.

Glaze’s style cloak was empirically successful in protecting art from being learned by Stable Diffusion and DALL-E, as judged by artists7. It helps that artists are willing to accept fairly large *p *because their current methods of protecting are quite disruptive (e.g., watermarks). However, Glaze can only protect new art since most existing artwork is already part of training data.

### Nightshade (2023)

Nightshade goes beyond cloaking and damages the offending model itself. It takes advantage of the fact that text-to-image generators exhibit concept sparsity despite having large training datasets. That is, a very small portion of the training data contains a given term or its semantically related terms. As a result, Shan et al. hypothesize that text-to-image generators are much more vulnerable to data poisoning (at the concept level) than is commonly believed.

They prove this by proposing Nightshade, a prompt-specific data poisoning attack based on mismatched text/image pairs. To minimize the number of poison samples for ease of implementation, Nightshade follows two design principles:

Maximize the effect of each sample by including the keyword

*C*in each poison prompt so that it targets only the parameters associated with*C*.Minimize conflicts among different pairs, and thus the overlap of their contributions to the perturbation norm, by creating original images of the target concept

*T*using another

If Φ is the feature extractor of the victim model and Ω is that of the poison image model, than a valid image *x* corresponding to the poison prompt can be perturbed by *δₓ* so that it is poisoned into Ω(*x*, *T*). This can be calculated using Glaze8.

Nightshade produces stronger poisoning effects than previous techniques and they bleed through to related concepts. They also stack if concepts are combined into a single prompt. Furthermore, if many Nightshade attacks target different prompts on the same model, general features will become corrupted and its image generation function eventually collapses!

It is not difficult to imagine that all platforms that host media will protect their content with adversarial techniques like these in the future. I would love to learn more if you’re building or researching in this space—reach out at wngjj[dot]61[at]gmail.com. Until then, I will be thinking about how every new offensive threat could be a new defensive opportunity. ∎

Volume generalizes as distance raised to the power of the dimension, so increasingly more volume is on the edges at high dimensions. Here, the edges are decision boundaries.

This can be arbitrarily large for sufficiently high

*n*since*l*͚ does not grow with dimensionality.Defining

*f*as the complement probability also enables it to be combined with*D*into a single minimization in the next step.The first term represents the probability of the class of which

*C*predicts*x'*is part and the second term represents the probability that*C*classifies*x'*as*t*instead. Note that*Z*is not a probability itself but a softmax input.Since

*tanh*( • ) ∈ [-1, 1], this is equivalent to*x*+*δ*∈ [0, 1]*ⁿ*.This is a very large output space. Classifiers have limited output classes and therefore do not have to retain as many features.

Shan et al. worked closely with professional artists and the Glaze paper includes their perspectives in §3.1, which I recommend reading.

In practice, Nightshade uses prompts from a valid dataset of text/image pairs to easily find

*x*.