Adversarial Deepfakes

Deepfakes or facially manipulated videos, can be used maliciously to spread disinformation, harass individuals or defame famous personalities. Recently developed Deepfake detection methods rely on Convolutional Neural Network (CNN) based classifiers to distinguish AI-generated fake videos from real videos. In this work, we demonstrate that it is possible to bypass such detectors by adversarially modifying fake videos synthesized using existing Deepfake generation methods. We design adversarial examples for the FaceForensics++ dataset to fool Deepfake detectors.

Methodology

We propose attacks which target Deepfake detectors that rely on CNN based classification models. The victim detectors used in our experiments, work on the frame level and classify each frame independently as either Real or Fake using the following two-step pipeline:

A face tracking model extracts the bounding box of the face in a given frame.
The cropped face is then resized appropriately and passed as input to a CNN based classifier to be labelled as either real or fake. In our work, we consider two victim CNN classifiers: XceptionNet and MesoNet

In order to fool such detectors into classifying fake videos as real, we craft adversarial examples for each frame of the given video and combine them together into an adversarially modified fake video. We perform the attack in both white box and black box attack settings assuming different attacker capabilities and goals, and evaluate the effectiveness of attack on both raw and compressed adversarial videos.

White Box Attacks

In this setting, we assume the attacker has complete knowledge of the detector model's architecture and parameters. We use iterative gradient sign based attacks to craft adversarial examples in this setting. We use Expectation Over Transforms in our robust white box attack to craft adversarial videos that are robust to video and image compression codecs. Following are some example videos of our white box attacks on XceptionNet.

Fake (From dataset)

White-box

Robust White-box

Black Box Attacks

In this setting, we assume the attacker has the knowledge of the detector pipeline structure but can only query the classification CNN as a black-box to obtain the probability of the frame being real or fake. We use Natural Evolution Strategy (NES) for estimating the gradient of output probabilities with respect to the input to craft adversarial examples in this black-box setting. Similar to the white-box setting, we craft adversarial examples that are robust to compression by ensuring robustness to input transformations during training. Following are some example videos of our black box attacks on XceptionNet.

Fake (From dataset)

Black-box

Robust Black-box