PowerPoint Presentation
Week 10: Adversarial Machine
Learning – Vulnerabilities (Part II)
Explanation, Detection & Defence
COMP90073
Security Analytics
Yi Han, CIS
Semester 2, 2021
COMP90073 Security Analysis
Overview
• Adversarial machine learning beyond computer vision
– Audio
– Natural language processing (NLP)
– Malware detection
• Why are machine learning models vulnerable?
– Insufficient training data
– Unnecessary features
• How to defend against adversarial machine learning?
– Data-driven defences
– Learner robustification
• Challenges
COMP90073 Security Analysis
Audio Adversarial Examples
• Speech recognition system
– Recurrent Neural Networks
– Audio waveform a sequence of probability distributions over
individual characters
– Challenge: alignment between the input and the output
• Exact location of each character in the audio file
HEEELLLLOO
t0 t1 …
a 0.5 0.3 …
b 0.2 0.4 …
c 0.1 0.2 …
… … … …
t0 t1 …
https://distill.pub/2017/ctc/
COMP90073 Security Analysis
Audio Adversarial Examples
• Connectionist Temporal Classification (CTC)
– Encoding
• Introduce a special character called blank, denoted as “–”
• 𝑌𝑌′ = 𝐵𝐵(𝑌𝑌): modify the ground truth text (𝑌𝑌) by (1) inserting “–”, (2)
repeating characters in all possible ways
• A blank character must be inserted between duplicate characters
• E.g.,
– Input X has a length of 10, and 𝑌𝑌 = [h, e, l, l, o].
– Valid: heeell–llo, hhhh–el–lo, heell–looo
– Invalid: hhee–llo–o, heel–lo
COMP90073 Security Analysis
Audio Adversarial Examples
• Connectionist Temporal Classification (CTC)
– Loss function
• Calculate the score for each 𝑌𝑌′ and sum them up
𝑝𝑝 𝑌𝑌 𝑋𝑋 = �
𝑌𝑌′
�
𝑖𝑖=1
|𝑋𝑋|
𝑝𝑝𝑖𝑖(𝑦𝑦𝑖𝑖
′|𝑋𝑋)
• Loss = negative log likelihood of the sum
−�
𝑌𝑌′
�
𝑖𝑖=1
|𝑋𝑋|
𝑙𝑙𝑙𝑙𝑙𝑙 𝑝𝑝𝑖𝑖(𝑦𝑦𝑖𝑖
′|𝑋𝑋)
– Decoding
• Pick character with highest score for each time step
• Remove duplicate characters, remove blanks
• E.g., HEE–LL–LOO HE–L–LO HELLO
per time-step probabilities
COMP90073 Security Analysis
Audio Adversarial Examples
• Computer vision domain:
– arg min
𝛿𝛿∈ 0,1 𝑑𝑑
𝛿𝛿 + 𝑐𝑐 � 𝑓𝑓 𝑥𝑥 + 𝛿𝛿
– 𝐶𝐶 𝑥𝑥 + 𝛿𝛿 = 𝑙𝑙𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 ⇔ 𝑓𝑓 𝑥𝑥 + 𝛿𝛿 = 𝑓𝑓(𝑥𝑥′) ≤ 0
• Audio adversarial examples against speech recognition system [14]
– How to measure the perturbation 𝛿𝛿?
• Measure 𝛿𝛿 in Decibels (dB): 𝑑𝑑𝐵𝐵 𝑥𝑥 = max
𝑖𝑖
20𝑙𝑙𝑙𝑙𝑙𝑙10(𝑥𝑥𝑖𝑖)
• 𝑑𝑑𝐵𝐵𝑥𝑥 𝛿𝛿 = 𝑑𝑑𝐵𝐵 𝛿𝛿 − 𝑑𝑑𝐵𝐵(𝑥𝑥)
– How to construct the objective function?
• Choose CTC-Loss(𝑥𝑥′; 𝑦𝑦𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡) as function 𝑓𝑓
• 𝐶𝐶 𝑥𝑥 + 𝛿𝛿 = 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 ⇐ 𝑓𝑓 𝑥𝑥 + 𝛿𝛿 = 𝑓𝑓(𝑥𝑥′) ≤ 0
• 𝐶𝐶 𝑥𝑥 + 𝛿𝛿 = 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 ⇏ 𝑓𝑓 𝑥𝑥 + 𝛿𝛿 = 𝑓𝑓(𝑥𝑥′) ≤ 0
• Solution will still be adversarial, but may not be minimally perturbed
– Examples: https://nicholas.carlini.com/code/audio_adversarial_examples
https://nicholas.carlini.com/code/audio_adversarial_examples
COMP90073 Security Analysis
NLP
• Deep text classification
– Character level [15]
• Every character is represented using one-hot encoding
• 6 convolutional layers + 3 fully-connected layers
COMP90073 Security Analysis
NLP
• Deep Text Classification Can be Fooled [16]
– Identify text items that contribute most to the classification
– Contribution measured based on the gradient 𝜕𝜕𝑓𝑓𝑡𝑡𝑟𝑟𝑟𝑟𝑟𝑟
𝜕𝜕𝑥𝑥
, x: training sample
– Hot character: containing the dimensions with highest gradient magnitude
– Hot word: containing ≥ 3 hot characters
– Hot phrase: single hot word, or adjacent hot words
– Hot Training/Sample Phrase: hot phrase that occurs most frequently in the
training data/test sample
COMP90073 Security Analysis
NLP
• Deep Text Classification Can be Fooled [16]
– Given text x, 𝐶𝐶 𝑥𝑥 + 𝛿𝛿 = 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
– Insertion
• What to insert: Hot Training Phrases of the target class
• Where to insert: near Hot Sample Phrases of the original class
COMP90073 Security Analysis
NLP
• Deep Text Classification Can be Fooled [16]
– Modification: replace the characters in Hot Sample Phrases by
• Common misspellings or
• Characters visually similar
COMP90073 Security Analysis
NLP
• Deep Text Classification Can be Fooled [16]
– Removal: the inessential adjective or adverb in HSPs are removed
• Less effective
• Only downgrade the confidence of the original class
COMP90073 Security Analysis
NLP
• Deep Text Classification Can be Fooled [16]
– Combination of three strategies
– Limit: all performed manually
COMP90073 Security Analysis
Malware Detection
• Attacking malware classifier for mobile phones [7]
– An application is represented by a binary vector Χ ∈ 0, 1 𝑑𝑑
• 1: the app has the feature, 0: the app doesn’t have the feature
• E.g., chat app: contacts , storage , calendar [1, 1, 0]
– Classifier: feed forward neural network
𝐹𝐹 𝑋𝑋 = 𝐹𝐹0 𝑋𝑋 ,𝐹𝐹1 (𝑋𝑋) , 𝐹𝐹0 𝑋𝑋 + 𝐹𝐹1 𝑋𝑋 = 1, 0: benign, 1: malicious
1
2
d
F0
F1
.
.
.
.
.
.
. . . . . .
.
.
.
.
.
.
Benign, if 𝐹𝐹0 𝑋𝑋 > 𝐹𝐹1 𝑋𝑋
Malicious, otherwise
COMP90073 Security Analysis
Evasion attacks (application)
• Attacking malware classifier for mobile phones [7]
– Attack goal: make a malicious application classified as benign
– Limit: only add features to avoid destroying app functionalities
– For each iteration:
• Step 1: compute the gradient of F w.r.t. X:
𝜕𝜕𝐹𝐹𝑘𝑘(𝑋𝑋)
𝜕𝜕𝑋𝑋
=
𝜕𝜕𝐹𝐹𝑘𝑘 (𝑋𝑋)
𝜕𝜕𝑋𝑋𝑗𝑗 𝑘𝑘∈ 0,1 ,𝑗𝑗∈[1,𝑑𝑑]
• Step 2: change the feature Xi to 1: (1) Xi = 0, (2) with the
maximal positive gradient maximise the change into the
target class 0
𝑖𝑖 = arg max𝑗𝑗∈ 1,𝑚𝑚 ,𝑋𝑋𝑗𝑗=0 𝐹𝐹0 𝑋𝑋𝑗𝑗
COMP90073 Security Analysis
Evasion attacks (application)
MWR: malware ratio
MR: misclassification
rate
COMP90073 Security Analysis
Overview
• Adversarial machine learning beyond computer vision
– Audio
– NLP
– Malware detection
• Why are machine learning models vulnerable?
– Insufficient training data
– Unnecessary features
• How to defend against adversarial machine learning?
– Data-driven defences
– Learner robustification
• Challenges
COMP90073 Security Analysis
Locations of Adversarial Samples
• Locations of adversarial samples
– Off the data manifold for the legitimate instances
– Three scenarios [1]
Near the boundary,
but far from the “+”
manifold
Away from the boundary,
but near the manifold –
in the “pocket” of the “+”
manifold
Close to the boundary
and the “-” manifold
COMP90073 Security Analysis
Locations of Adversarial Samples
• Images that are unrecognisable to human eyes, but can be
identified by DNNs with nearly certainty [2]
DNNs believe with 99.99% confidence that the above
images are digits 0-9
COMP90073 Security Analysis
Overview
• Adversarial machine learning beyond computer vision
– Audio
– NLP
– Malware detection
• Why are machine learning models vulnerable?
– Insufficient training data
– Unnecessary features
• How to defend against adversarial machine learning?
– Data-driven defences
– Learner robustification
• Challenges
COMP90073 Security Analysis
Explanation1: Insufficient Training Data
• Potential reason 1: insufficient training data
• An illustrative example
– 𝑥𝑥 ∈ −1, 1 ,𝑦𝑦 ∈ −1, 1 , 𝑧𝑧 ∈ −1, 2
– Binary classification
• Class 1: 𝑧𝑧 < 𝑥𝑥2 + 𝑦𝑦3 • Class 2: 𝑧𝑧 ≥ 𝑥𝑥2 + 𝑦𝑦3 – x, y, z are increased by 0.01 a total of 200×200×300 = 1.2×107 points – How many points are needed to reconstruct the decision boundary? COMP90073 Security Analysis Explanation1: Insufficient Training Data – Randomly choose the training and test datasets – Boundary dataset (adversarial samples are likely to locate here): 𝑥𝑥2 + 𝑦𝑦3 − 0.1 < 𝑧𝑧 < 𝑥𝑥2 + 𝑦𝑦3 + 0.1 Training dataset Test dataset Setting 1 80 40 Setting 2 800 400 Setting 3 8000 4000 Setting 4 80000 40000 COMP90073 Security Analysis Explanation1: Insufficient Training Data • Test result – RBF SVMs – Linear SVMs • 8000: 0.067% of 1.2×107 • MNIST: 28×28 8-bit greyscale images, (28)28×28 ≈ 1.1 × 101888 • 1.1 × 101888 × 0.067% ≫ 6 × 105 Size of the training dataset Accuracy on its own test dataset Accuracy on the test dataset with 4×104 points Accuracy on the boundary dataset 80 100 92.7 60.8 800 99.0 97.4 74.9 8000 99.5 99.6 94.1 80000 99.9 99.9 98.9 Size of the training dataset Accuracy on its own test dataset Accuracy on the test dataset with 4×104 points Accuracy on the boundary dataset 80 100 96.3 70.1 800 99.8 99.0 85.7 8000 99.9 99.8 97.3 80000 99.98 99.98 99.5 COMP90073 Security Analysis Overview • Adversarial machine learning beyond computer vision – Audio – NLP – Malware detection • Why are machine learning models vulnerable? – Insufficient training data – Unnecessary features • How to defend against adversarial machine learning? – Data-driven defences – Learner robustification • Challenges COMP90073 Security Analysis Poisoning attacks • Poison frog attacks [10] – E.g., add a seemingly innocuous image (that is properly labeled) to a training set, and control the identity of a chosen image at test time Target class Base class COMP90073 Security Analysis Poisoning attacks • Generate poison data – 𝑓𝑓(𝑥𝑥): the function that propagates an input x through the network to the penultimate layer (before the softmax layer) – 𝑝𝑝 = argmin 𝑥𝑥 𝑓𝑓 𝑥𝑥 − 𝑓𝑓(𝑡𝑡) 2 2 + 𝛽𝛽 𝑥𝑥 − 𝑏𝑏 2 2 • 𝑓𝑓 𝑥𝑥 − 𝑓𝑓(𝑡𝑡) 22: makes p move toward the target instance in feature space and get embedded in the target class distribution • 𝛽𝛽 𝑥𝑥 − 𝑏𝑏 22: makes p appear like a base class instance to a human labeller COMP90073 Security Analysis Explanation2: Unnecessary Features • Potential reason 2: redundant features [3] – Classifier 𝑓𝑓 = 𝑙𝑙 ∘ 𝑐𝑐, 𝑙𝑙: feature extraction, 𝑐𝑐: classification – 𝑑𝑑 : similarity measure – Features extracted by ML classifier (X1) ≠ Features extracted by human (X2) COMP90073 Security Analysis Explanation2: Unnecessary Features • Potential reason 2: redundant features [3] – Previous definition of adversarial attacks: 𝐹𝐹𝑖𝑖𝐹𝐹𝑑𝑑 𝑥𝑥′ s. t. 𝑓𝑓1 𝑥𝑥 ≠ 𝑓𝑓1 𝑥𝑥′ ∆(𝑥𝑥, 𝑥𝑥′) < 𝜖𝜖 – New definition: 𝐹𝐹𝑖𝑖𝐹𝐹𝑑𝑑 𝑥𝑥′ s. t. 𝑓𝑓1 𝑥𝑥 ≠ 𝑓𝑓1 𝑥𝑥′ 𝑑𝑑2(𝑙𝑙2(𝑥𝑥),𝑙𝑙2(𝑥𝑥′)) < 𝛿𝛿2 𝑓𝑓2 𝑥𝑥 = 𝑓𝑓2 𝑥𝑥′ – {𝛿𝛿2, 𝜂𝜂}-strong-robustness: 𝑖𝑖𝑓𝑓 ∀𝑥𝑥, 𝑥𝑥′ ∈ 𝑋𝑋 𝑎𝑎. 𝑒𝑒. 𝑥𝑥, 𝑥𝑥′ 𝑠𝑠𝑎𝑎𝑡𝑡𝑖𝑖𝑠𝑠𝑓𝑓𝑖𝑖𝑒𝑒𝑠𝑠 𝑃𝑃 𝑓𝑓1 𝑥𝑥 = 𝑓𝑓1 𝑥𝑥′ |𝑓𝑓2 𝑥𝑥 = 𝑓𝑓2 𝑥𝑥′ ,𝑑𝑑2(𝑙𝑙2(𝑥𝑥),𝑙𝑙2(𝑥𝑥′)) < 𝛿𝛿2 > 1 − 𝜂𝜂
𝑓𝑓1 agrees with 𝑓𝑓2
COMP90073 Security Analysis
Explanation2: Unnecessary Features
• Unnecessary features ruin strong-robustness
– If 𝑓𝑓1uses unnecessary features not strong-robust
– If 𝑓𝑓1misses necessary features used by 𝑓𝑓2 not accurate
– If 𝑓𝑓1uses the same set of features as 𝑓𝑓2 strong-robust, can be accurate
Each adversarial sample is
close to the original instance
in the oracle feature space
Can be far away to the original
instance in the trained
classifier’s feature space, and at
the other side of the boundary
COMP90073 Security Analysis
Overview
• Adversarial machine learning beyond computer vision
– Audio
– NLP
– Malware detection
• Why are machine learning models vulnerable?
– Insufficient training data
– Unnecessary features
• How to defend against adversarial machine learning?
– Data-driven defence
– Learner robustification
• Challenges
COMP90073 Security Analysis
Data-driven Defence
• Data-driven defence
– Filtering instances: poisoning data in the training dataset or the adversarial
samples against the test dataset either exhibit different statistical features,
or follow a different distribution – detection
– Injecting data: add adversarial samples into training – adversarial training
– Projecting data: project data into lower-dimensional space; move
adversarial samples closer to the manifold of legitimate samples
COMP90073 Security Analysis
• Filtering instances
• On Detecting Adversarial Perturbations [4]
– Adversary detection network: branch off the main network at some layer
– Each detector produces 𝑝𝑝𝑡𝑡𝑑𝑑𝑎𝑎: probability of the input being adversarial
– Step 1: Train the main network regularly, and freeze its weights
– Step 2: Generate an adversarial sample for each training data point
– Step 3: Train the detectors on the balanced, binary dataset
Data-driven Defence: Filtering Instances
COMP90073 Security Analysis
• Adaptive/dynamic attacker: attacker that is aware of the detection
method
• Dynamic adversary training
Data-driven Defence: Filtering Instances
making the detectors
output padv as small
as possible
cross-entropy loss
of the classifier
letting the classifier
mis-label the input x
cross-entropy loss
of the detector
Static Dynamic
Defender Train the classifier
Freeze its weights
Precompute adversarial samples
Compute adversarial
examples on-the-fly for
each mini-batch
Attacker Modify x only to maximise the
classifier’s cross-entropy loss
Modify x to fool classifier
+ detector
Adapt to
each other
COMP90073 Security Analysis
Data-driven Defence: Filtering Instances
• Test on CIFAR10
𝜎𝜎
COMP90073 Security Analysis
Data-driven Defence
• Data-driven defence
– Filtering instances: poisoning data in the training dataset or the adversarial
samples against the test dataset either exhibit different statistical features,
or follow a different distribution – detection
– Injecting data: add adversarial samples into training – adversarial training
– Projecting data: project data into lower-dimensional space; move
adversarial samples closer to the manifold of legitimate samples
COMP90073 Security Analysis
• Adversarial training: add adversarial samples into training data
• Towards Deep Learning Models Resistant to Adversarial Attacks [5]
– Normally how a classification problem is formalised
𝜃𝜃∗ = arg min
𝜃𝜃
𝔼𝔼
(𝑥𝑥,𝑦𝑦)~𝐷𝐷
𝐿𝐿 𝑥𝑥;𝑦𝑦;𝜃𝜃
Data-driven Defence: Injecting Data
adversary: perturb 𝑥𝑥 to
maximise the loss
Defender: find model parameters 𝜃𝜃∗ to
minimise the “adversarial loss”
Not robust
Augment
– Redefine the loss by incorporating the adversary:
𝜃𝜃∗ = arg min
𝜃𝜃
𝔼𝔼
(𝑥𝑥,𝑦𝑦)~𝐷𝐷
max
𝛿𝛿∈ −𝜀𝜀,𝜀𝜀 𝑑𝑑
𝐿𝐿 𝑥𝑥 + 𝛿𝛿;𝑦𝑦;𝜃𝜃
COMP90073 Security Analysis
• Towards Deep Learning Models Resistant to Adversarial Attacks [5]
– Step 1: fix 𝜃𝜃, generate adversarial samples using strong attacks (e.g.,
projected gradient descent, C&W): 𝑥𝑥𝑖𝑖 ← clip𝜀𝜀 𝑥𝑥𝑖𝑖−1 + 𝛼𝛼 � sign
𝜕𝜕𝐿𝐿
𝜕𝜕𝑥𝑥𝑖𝑖−1
– Step 2: Update 𝜃𝜃: train the network on the augmented dataset
Data-driven Defence: Injecting Data
[17]
Inner maximisation: find adversarial examples
Outer minimisation: optimise 𝜃𝜃
// Only one epoch
COMP90073 Security Analysis
• Towards Deep Learning Models Resistant to Adversarial Attacks [5]
Data-driven Defence: Injecting Data
Potential problem?
COMP90073 Security Analysis
Data-driven Defence: Injecting Data
• Curriculum Adversarial Training (CAT) [17]
– Adversarial training overfits to the specific attack in use
– Training curriculum: train a model from weaker attacks to stronger attacks
– Attack strength: PGD(k), k: the number of iterations
𝐹𝐹 =
|𝒟𝒟|
batch size
COMP90073 Security Analysis
Data-driven Defence: Injecting Data
• Curriculum Adversarial Training (CAT) [17]
– Batch mixing
• Catastrophic forgetting [19]: a neural network tends to forget the
information learned in the previous tasks when training on new tasks
• Generate adversarial examples using PGD(𝑖𝑖), 𝑖𝑖 ∈ {0, 1, … , 𝑙𝑙}, and
combine them to form a batch, i.e., batch mixing
𝑙𝑙
COMP90073 Security Analysis
Data-driven Defence: Injecting Data
• Curriculum Adversarial Training (CAT) [17]
– Quantization
• Attack generalisation: the model trained with CAT may not defend
against stronger attacks
• Quantization: real value b-bit integer
• Each input x: real value from [0, 1]d integer value from [0, 2b-1]d
COMP90073 Security Analysis
Data-driven Defence: Injecting Data
• Adversarial training for free [20]
– Train on each minibatch m times, number of epochs Nep Nep/m
– FGSM is used, but perturbations are not reset between minibatches
– Single backward pass to update both model weights and perturbation
COMP90073 Security Analysis
Data-driven Defence: Injecting Data
• Fast Adversarial Training [18]
– FGSM adversarial training with random initialization
– Non-zero initial perturbation is the primary driver for success
COMP90073 Security Analysis
Data-driven Defence
• Data-driven defence
– Filtering instances: poisoning data in the training dataset or the adversarial
samples against the test dataset either exhibit different statistical features,
or follow a different distribution – detection
– Injecting data: add adversarial samples into training – adversarial training
– Projecting data: project data into lower-dimensional space; move
adversarial samples closer to the manifold of legitimate samples
COMP90073 Security Analysis
• Projecting data
– Adversarial samples come from low-density regions
– Move adversarial samples back to the data manifold before
classification
– Use auto-encoder, GANs, PixelCNN to reform/purify the input
Data-driven Defence: Projecting Data
COMP90073 Security Analysis
Data-driven Defence: Projecting Data
• Auto-encoder: get an output identical with the input
https://towardsdatascience.com/applied-deep-learning-part-3-autoencoders-1c083af4d798
Input code output ≈ input
COMP90073 Security Analysis
• MagNet: a Two-Pronged Defense against Adversarial Examples [6]
– Use auto-encoders to detect and reform adversarial samples
– Detector
• Reconstruction error (RE)
– Normal examples small RE
– Adversarial samples large RE
– Threshold: reject no more than 0.1% examples in validation set
• Probability divergence
– Normal examples Small divergence btw 𝑓𝑓(𝑥𝑥) and 𝑓𝑓 𝐴𝐴𝐴𝐴 𝑥𝑥
– Adversarial samples Large divergence btw 𝑓𝑓(𝑥𝑥′) and 𝑓𝑓 𝐴𝐴𝐴𝐴 𝑥𝑥′
Data-driven Defence: Projecting Data
𝐴𝐴𝐴𝐴 𝑥𝑥 : output of the auto-encoder
𝑓𝑓(𝑥𝑥) : output of the last layer (i.e., softmax)
of the neural network f on the input x
COMP90073 Security Analysis
Data-driven Defence: Projecting Data
– Reformer
• Normal examples: AE outputs a very similar example
• Adversarial samples: AE outputs an example that is closer to the
manifold of the normal examples
Manifold of normal examples
Normal examples
Adversarial samples
COMP90073 Security Analysis
Data-driven Defence: Projecting Data
– Reformer
COMP90073 Security Analysis
Data-driven Defence: Projecting Data
• Can you think of a way to break “MagNet”?
– Hint: an adaptive attacker that attacks not only the classifier,
but also the detector (suppose there is only one detector) and
the reformer.
– arg min
𝛿𝛿∈ 0,1 𝑑𝑑
𝛿𝛿 + 𝑐𝑐 � 𝑓𝑓 𝑥𝑥 + 𝛿𝛿 ??
COMP90073 Security Analysis
Overview
• Adversarial machine learning beyond computer vision
– Audio
– NLP
– Malware detection
• Why are machine learning models vulnerable?
– Insufficient training data
– Unnecessary features
• How to defend against adversarial machine learning?
– Data-driven defences
– Learner robustification
• Challenges
COMP90073 Security Analysis
• Distillation as a defense to adversarial perturbations against deep
neural networks [8]
– Distillation: transfer knowledge from one neural network to another –
suppose there is a trained DNN, the probabilities generated in the final
softmax layer are used to train a second DNN, instead of the (hard) class
labels
Learner Robustification: Distillation
Loss Loss
Y
provide richer information about each class
COMP90073 Security Analysis
• Distillation as a defense to adversarial perturbations against deep
neural networks [8] (N. Papernot et al.)
– Modification to the final softmax layer:
𝐹𝐹𝑖𝑖 𝑋𝑋 =
𝑡𝑡𝑧𝑧𝑖𝑖(𝑋𝑋)
∑𝑗𝑗=1
𝑁𝑁 𝑡𝑡𝑧𝑧𝑗𝑗(𝑋𝑋)
→ 𝐹𝐹𝑖𝑖 𝑋𝑋 =
𝑡𝑡
𝑧𝑧𝑖𝑖(𝑋𝑋)
𝑇𝑇
∑𝑗𝑗=1
𝑁𝑁 𝑡𝑡
𝑧𝑧𝑗𝑗(𝑋𝑋)
𝑇𝑇
𝑍𝑍(𝑋𝑋): output of the last hidden layer
𝑇𝑇:𝑑𝑑𝑖𝑖𝑠𝑠𝑡𝑡𝑖𝑖𝑙𝑙𝑙𝑙𝑎𝑎𝑡𝑡𝑖𝑖𝑙𝑙𝐹𝐹 𝑡𝑡𝑒𝑒𝑡𝑡𝑝𝑝𝑒𝑒𝑡𝑡𝑎𝑎𝑡𝑡𝑡𝑡𝑡𝑡𝑒𝑒
Learner Robustification: Distillation
COMP90073 Security Analysis
• Distillation as a defense to adversarial perturbations against deep
neural networks [8]
– Given a training set {(X, Y(X))}, train a DNN (F) with a softmax
layer at temperature T
– Form a new training set {(X, F(X))}, train another DNN (FD),
with the same network architecture, also at temperature T
– Test at temperature T=1
– A high empirical value of T (at training time) gives a better
performance (T=1 at test time)
– FD provides a smoother loss function – more generalised for
an unknown dataset
Learner Robustification: Distillation
Probability vector
COMP90073 Security Analysis
– Results on MNIST and CIFAR10
Learner Robustification: Distillation
Influence of distillation on clean dataEffect against adversarial samples
COMP90073 Security Analysis
Learner Robustification: Distillation
• Why does “a large temperature at training time (e.g. T=100) + a low
temperature at test time (e.g. T=1)” make the model more secure?
• 𝐹𝐹𝑖𝑖 𝑋𝑋 =
𝑡𝑡
𝑧𝑧𝑖𝑖(𝑋𝑋)
𝑇𝑇
∑𝑗𝑗=1
𝑁𝑁 𝑡𝑡
𝑧𝑧𝑗𝑗(𝑋𝑋)
𝑇𝑇
𝑍𝑍(𝑋𝑋) 𝐹𝐹(𝑋𝑋)
100
200
100
𝑡𝑡
𝑡𝑡+𝑡𝑡2+𝑡𝑡
= 1
1+𝑡𝑡+1
𝑡𝑡2
𝑡𝑡+𝑡𝑡2+𝑡𝑡
= 𝑡𝑡
1+𝑡𝑡+1
𝑡𝑡
𝑡𝑡+𝑡𝑡2+𝑡𝑡
= 1
1+𝑡𝑡+1
𝑡𝑡100
𝑡𝑡100+𝑡𝑡200+𝑡𝑡100
= 1
1+𝑡𝑡100+1
𝑡𝑡200
𝑡𝑡100+𝑡𝑡200+𝑡𝑡100
= 𝑡𝑡
100
1+𝑡𝑡100+1
𝑡𝑡100
𝑡𝑡100+𝑡𝑡200+𝑡𝑡100
= 1
1+𝑡𝑡100+1
Training (T=100) Test (T=1)
COMP90073 Security Analysis
Learner Robustification: Distillation
• Extending Defensive Distillation [12]
• 1. The 1st DNN is trained as usual (on one-hot labels)
• 2. New labeling vector: original label information + predictive uncertainty
• 3. The distilled model is trained with the new label vectors.
COMP90073 Security Analysis
• Predictive uncertainty
– Take N forward passes through the neural network with dropout
– Record the N logit vectors 𝑧𝑧0 𝑥𝑥 , … , 𝑧𝑧𝑁𝑁−1(𝑥𝑥)
– Calculate uncertainty for x:
– New labeling vector k(x):
Learner Robustification: Distillation
0.1
0.6
0.1
0.2
𝛼𝛼 = 1, max𝜎𝜎(𝑥𝑥) = 0.4,𝜎𝜎 𝑥𝑥 = 0.1+
0
1 – 0.25 = 0.75
0
0
0.25
COMP90073 Security Analysis
Learner Robustification: Distillation
• White-box attack via FGSM
– Recovered: adversarial inputs that are assigned to the original class
– Detected: adversarial examples that are classified in the outlier class
COMP90073 Security Analysis
• Improving the Robustness of Deep Neural Networks via Stability
Training [9]
– Stability objective: if 𝑥𝑥′ is close to 𝑥𝑥, 𝑓𝑓 𝑥𝑥 should be close to 𝑓𝑓 𝑥𝑥′
∀𝑥𝑥′:𝑑𝑑 𝑥𝑥, 𝑥𝑥′ small ↔ 𝐷𝐷 𝑓𝑓 𝑥𝑥 , 𝑓𝑓 𝑥𝑥′ small
– Define new training objective:
𝐿𝐿 𝑥𝑥, 𝑥𝑥′;𝜃𝜃 = 𝐿𝐿0 𝑥𝑥;𝜃𝜃 + 𝛼𝛼𝐷𝐷 𝑓𝑓 𝑥𝑥 , 𝑓𝑓 𝑥𝑥′ , 𝐿𝐿0: original training objective
– New optimisation problem:
𝜃𝜃∗ = argmin
𝜃𝜃
∑𝑑𝑑 𝑥𝑥𝑖𝑖, 𝑥𝑥𝑖𝑖′ <𝜖𝜖
𝐿𝐿 𝑥𝑥𝑖𝑖 , 𝑥𝑥𝑖𝑖
′ ;𝜃𝜃
– Generate 𝑥𝑥′: adds pixel-wise uncorrelated Gaussian noise 𝜖𝜖 to 𝑥𝑥
𝑥𝑥𝑘𝑘
′ = 𝑥𝑥𝑘𝑘 + 𝜖𝜖𝑘𝑘 , 𝜖𝜖𝑘𝑘~𝒩𝒩 0,𝜎𝜎𝑘𝑘
2 , 𝜎𝜎𝑘𝑘 > 0
– 𝐿𝐿0,𝐷𝐷 are task specific, e.g., 𝐿𝐿0: cross-entropy loss, D: KL-divergence
Learner Robustification: Stability Training
COMP90073 Security Analysis
Overview
• Adversarial machine learning beyond computer vision
– Audio
– NLP
– Malware detection
• Why are machine learning models vulnerable?
– Insufficient training data
– Unnecessary features
• How to defend against adversarial machine learning?
– Data-driven defences
– Learner robustification
• Challenges
COMP90073 Security Analysis
Adversarial Machine Learning – Challenges
• Arm race between attackers and defenders [10][11]
• Many defence methods fail to
– Evaluate against a strong attack, e.g., PGD, C&W
– Evaluate against an adaptive attacker
• Should not assume the attacker is unaware of the defence method
• arg min
𝛿𝛿∈ 0,1 𝑑𝑑
𝛿𝛿 2/∞
2 + 𝑐𝑐 � 𝑓𝑓𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑥𝑥 + 𝛿𝛿
– Evaluate on complicated dataset like CIFAR, ImageNet
• Evaluating solely on MNIST is insufficient
– Define a realistic threat model – what is known & unknown to the attacker
• Model architecture and model weights
• Training algorithm and training data
• Test time randomness
• White-box – grey-box – black-box
COMP90073 Security Analysis
Summary
• Adversarial machine learning beyond computer vision
– Audio
– NLP
– Malware detection
• Why are machine learning models vulnerable?
– Insufficient training data
– Unnecessary features
• How to defend against adversarial machine learning?
– Data-driven defences
• Filtering adversarial samples
• Adversarial training
• Project to lower dimension
– Learner robustification
• Distillation
• Stability training
COMP90073 Security Analysis
References
• [1] R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner, “Detecting
Adversarial Samples from Artifacts,” eprint arXiv:1703.00410, 2017.
• [2] A. Nguyen, J. Yosinski, and J. Clune, “Deep Neural Networks are Easily
Fooled: High Confidence Predictions for Unrecognizable Images,” in CVPR,
2015.
• [3] B. Wang, J. Gao, and Y. Qi, “A Theoretical Framework for Robustness of
(Deep) Classifiers against Adversarial Examples,” eprint arXiv:1612.00334,
2016.
• [4] J. H. Metzen, T. Genewein, V. Fischer, and B. Bischoff, “On Detecting
Adversarial Perturbations,” eprint arXiv:1702.04267, 2017.
• [5] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards Deep
Learning Models Resistant to Adversarial Attacks,” arXiv:1706.06083, 2017.
• [6] D. Meng and H. Chen, “MagNet: a Two-Pronged Defense against
Adversarial Examples,” arXiv:1705.09064, 2017.
• [7] K. Grosse, N. Papernot, P. Manoharan, M. Backes, and P. McDaniel,
“Adversarial Perturbations Against Deep Neural Networks for Malware
Classification,” eprint arXiv:1606.04435, 2016.
COMP90073 Security Analysis
References
• [8] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillation as a
Defense to Adversarial Perturbations against Deep Neural Networks,” eprint
arXiv:1511.04508, 2015.
• [9] S. Zheng, Y. Song, T. Leung, and I. Goodfellow, “Improving the Robustness of
Deep Neural Networks via Stability Training,” eprint arXiv:1604.04326, 2016.
• [10] N. Carlini and D. Wagner, “Adversarial Examples Are Not Easily
Detected: Bypassing Ten Detection Methods,” arXiv:1705.07263, 2017.
• [11] A. Athalye, N. Carlini, and D. Wagner, “Obfuscated Gradients Give a False
Sense of Security: Circumventing Defenses to Adversarial Examples,”
arXiv:1802.00420 [cs], Feb. 2018.
• [12] N. Papernot and P. McDaniel, “Extending Defensive Distillation,”
arXiv:1705.05264 [cs, stat], May 2017.
• [13] K. Grosse, N. Papernot, P. Manoharan, M. Backes, and P. McDaniel,
“Adversarial Perturbations Against Deep Neural Networks for Malware
Classification,” eprint arXiv:1606.04435, 2016.
• [14] Nicholas Carlini and David Wagner. Audio Adversarial Examples: Targeted
Attacks on Speech-to-Text. arXiv 1801.01944, 2018
COMP90073 Security Analysis
References
• [15] Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional
networks for text classification. In Advances in neural information processing
systems, pages 649–657, 2015
• [16] Bin Liang, Hongcheng Li, Miaoqiang Su, Pan Bian, Xirong Li, and Wenchang
Shi. 2018. Deep text classification can be fooled. In Proceedings of the 27th
International Joint Conference on Artificial Intelligence (IJCAI’18). AAAI Press,
4208–4215.
• [17] Qi-Zhi Cai, Chang Liu, and Dawn Song. 2018. Curriculum adversarial training.
In Proceedings of the 27th International Joint Conference on Artificial Intelligence
(IJCAI’18). AAAI Press, 3740–3747.
• [18] Eric Wong, Leslie Rice and J. Zico Kolter. Fast is better than free: Revisiting
adversarial training. arXiv:2001.03994 [cs.LG], 2020.
• [19] Michael McCloskey and Neal J. Cohen. 1989. Catastrophic Interference in
Connectionist Networks: The Sequential Learning Problem. In Psychology of
Learning and Motivation. Vol. 24. Academic Press, 109 – 165.
• [20] Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John P. Dickerson,
Christoph Studer, Larry S. Davis, Gavin Taylor, Tom Goldstein: Adversarial training
for free! NeurIPS 2019: 3353-3364
Week 10: Adversarial Machine Learning – Vulnerabilities (Part II) Explanation, Detection & Defence
Overview
Audio Adversarial Examples
Audio Adversarial Examples
Audio Adversarial Examples
Audio Adversarial Examples
NLP
NLP
NLP
NLP
NLP
NLP
Malware Detection
Evasion attacks (application)
Evasion attacks (application)
Overview
Locations of Adversarial Samples
Locations of Adversarial Samples
Overview
Explanation1: Insufficient Training Data
Explanation1: Insufficient Training Data
Explanation1: Insufficient Training Data
Overview
Poisoning attacks
Poisoning attacks
Explanation2: Unnecessary Features
Explanation2: Unnecessary Features
Explanation2: Unnecessary Features
Overview
Data-driven Defence
Data-driven Defence: Filtering Instances
Data-driven Defence: Filtering Instances
Data-driven Defence: Filtering Instances
Data-driven Defence
Data-driven Defence: Injecting Data
Data-driven Defence: Injecting Data
Data-driven Defence: Injecting Data
Data-driven Defence: Injecting Data
Data-driven Defence: Injecting Data
Data-driven Defence: Injecting Data
Data-driven Defence: Injecting Data
Data-driven Defence: Injecting Data
Data-driven Defence
Data-driven Defence: Projecting Data
Data-driven Defence: Projecting Data
Data-driven Defence: Projecting Data
Data-driven Defence: Projecting Data
Data-driven Defence: Projecting Data
Data-driven Defence: Projecting Data
Overview
Learner Robustification: Distillation
Learner Robustification: Distillation
Learner Robustification: Distillation
Learner Robustification: Distillation
Learner Robustification: Distillation
Learner Robustification: Distillation
Learner Robustification: Distillation
Learner Robustification: Distillation
Learner Robustification: Stability Training
Overview
Adversarial Machine Learning – Challenges
Summary
References
References
References