程序代做 deep learning algorithm Week 10: Adversarial Machine Learning – Vulnerabilities (Part II) Explanation, Detection & Defence

Week 10: Adversarial Machine Learning – Vulnerabilities (Part II) Explanation, Detection & Defence
COMP90073 Security Analytics
, CIS Semester 2, 2021

Overview
• Adversarial machine learning beyond computer vision – Audio
– Natural language processing (NLP)
– Malware detection
• Why are machine learning models vulnerable?
– Insufficient training data
– Unnecessary features
• How to defend against adversarial machine learning?
– Data-driven defences
– Learner robustification • Challenges
COMP90073 Security Analysis

Audio Adversarial Examples
• Speechrecognitionsystem
– RecurrentNeuralNetworks
– Audio waveforma sequence of probability distributions over individual characters
t0 t1 …
https://distill.pub/2017/ctc/
– Challenge:alignmentbetweentheinputandtheoutput • Exact location of each character in the audio file
HEEELLLLOO
t0
t1

a
0.5
0.3

b
0.2
0.4

c
0.1
0.2





COMP90073 Security Analysis

Audio Adversarial Examples
• ConnectionistTemporalClassification(CTC) – Encoding
• 𝑌𝑌 = 𝐵𝐵(𝑌𝑌): modify the ground truth text (𝑌𝑌) by (1) inserting “–”, (2) repeating characters in all possible ways
• Introduce a special character called blank, denoted as “–” ′
• A blank character must be inserted between duplicate characters • E.g.,
– InputXhasalengthof10,and𝑌𝑌=[h,e,l,l,o]. – Valid: heeell–llo, hhhh–el–lo, heell–looo
– Invalid: hhee–llo–o, heel–lo
COMP90073 Security Analysis

Audio Adversarial Examples
• ConnectionistTemporalClassification(CTC)
• Calculate the score for each 𝑌𝑌′ and sum them up – Loss function 𝑝𝑝 𝑌𝑌 𝑋𝑋 = � �|𝑋𝑋| 𝑝𝑝𝑖𝑖(𝑦𝑦𝑖𝑖′|𝑋𝑋)
𝑌𝑌′ 𝑖𝑖=1 per time-step probabilities • Loss = negative log likelihood of the sum
|𝑋𝑋|
− �𝑌𝑌′ �𝑖𝑖=1𝑙𝑙𝑙𝑙𝑙𝑙 𝑝𝑝𝑖𝑖(𝑦𝑦𝑖𝑖′|𝑋𝑋)
– Decoding
• Pick character with highest score for each time step • Remove duplicate characters, remove blanks
• E.g.,HEE–LL–LOOHE–L–LOHELLO
COMP90073 Security Analysis

Audio Adversarial Examples
• Computervisiondomain:
– arg min 𝛿𝛿 + 𝑐𝑐 � 𝑓𝑓 𝑥𝑥 + 𝛿𝛿
𝛿𝛿∈ 0,1 𝑑𝑑
– 𝐶𝐶 𝑥𝑥+𝛿𝛿 =𝑙𝑙𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 ⇔ 𝑓𝑓 𝑥𝑥+𝛿𝛿 =𝑓𝑓(𝑥𝑥′)≤0
• Audioadversarialexamplesagainstspeechrecognitionsystem[14] – Howtomeasuretheperturbation𝛿𝛿?
• Measure 𝛿𝛿 in Decibels (dB): 𝑑𝑑𝐵𝐵 𝑥𝑥 = max20𝑙𝑙𝑙𝑙𝑙𝑙10(𝑥𝑥𝑖𝑖)
• 𝑑𝑑𝐵𝐵𝑥𝑥 𝛿𝛿 =𝑑𝑑𝐵𝐵𝛿𝛿 −𝑑𝑑𝐵𝐵(𝑥𝑥) 𝑖𝑖 – Howtoconstructtheobjectivefunction?
• Choose CTC-Loss(𝑥𝑥′; 𝑦𝑦 ) as function 𝑓𝑓 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
• 𝐶𝐶 𝑥𝑥+𝛿𝛿 =𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 ⇐𝑓𝑓 𝑥𝑥+𝛿𝛿 =𝑓𝑓(𝑥𝑥′)≤0
• 𝐶𝐶 𝑥𝑥+𝛿𝛿 =𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 ⇏ 𝑓𝑓 𝑥𝑥+𝛿𝛿 =𝑓𝑓(𝑥𝑥′)≤0
• Solution will still be adversarial, but may not be minimally perturbed
– Examples:https://nicholas.carlini.com/code/audio_adversarial_examples
COMP90073 Security Analysis

NLP
• Deeptextclassification – Characterlevel[15]
• Every character is represented using
• 6 convolutional layers + 3 fully-connected layers
one-hot encoding
COMP90073 Security Analysis

NLP
• DeepTextClassificationCanbeFooled[16]
– Identifytextitemsthatcontributemosttotheclassification
𝜕𝜕𝑓𝑓𝑡𝑡𝑟𝑟𝑟𝑟𝑟𝑟
– Contribution measured based on the gradient 𝜕𝜕𝑥𝑥 , x: training sample
– Hotcharacter:containingthedimensionswithhighestgradientmagnitude
– Hotword:containing≥3hotcharacters
– Hotphrase:singlehotword,oradjacenthotwords
– HotTraining/SamplePhrase:hotphrasethatoccursmostfrequentlyinthe training data/test sample
COMP90073 Security Analysis

NLP
• DeepTextClassificationCanbeFooled[16] – Giventextx,𝐶𝐶 𝑥𝑥+𝛿𝛿 =𝑡𝑡
– Insertion
• What to insert: Hot Training Phrases of the target class
• Where to insert: near Hot Sample Phrases of the original class
𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
COMP90073 Security Analysis

NLP
• DeepTextClassificationCanbeFooled[16]
– Modification:replacethecharactersinHotSamplePhrasesby
• Common misspellings or • Characters visually similar
COMP90073 Security Analysis

NLP
• DeepTextClassificationCanbeFooled[16]
– Removal:theinessentialadjectiveoradverbinHSPsareremoved
• Less effective
• Only downgrade the confidence of the original class
COMP90073 Security Analysis

NLP
• DeepTextClassificationCanbeFooled[16] – Combinationofthreestrategies
– Limit:allperformedmanually
COMP90073 Security Analysis

Malware Detection
• Attacking malware classifier for mobile phones [7]
– An application is represented by a binary vector Χ ∈ 0, 1 𝑑𝑑
• 1: the app has the feature, 0: the app doesn’t have the feature
• E.g., chat app: contacts, storage, calendar[1, 1, 0] – Classifier: feed forward neural network
𝐹𝐹 𝑋𝑋 = 𝐹𝐹 𝑋𝑋 , 𝐹𝐹 (𝑋𝑋) , 𝐹𝐹 𝑋𝑋 + 𝐹𝐹 𝑋𝑋 = 1, 0: benign, 1: malicious 0101
1 2
. . .
d
Benign,if𝐹𝐹0 𝑋𝑋 >𝐹𝐹1 𝑋𝑋 Malicious, otherwise
F1
. .F .
. . .
0
.
. …… .
COMP90073 Security Analysis

Evasion attacks (application)
• Attacking malware classifier for mobile phones [7]
– Attack goal: make a malicious application classified as benign – Limit: only add features to avoid destroying app functionalities – For each iteration:
𝜕𝜕𝐹𝐹 (𝑋𝑋) 𝜕𝜕𝐹𝐹 (𝑋𝑋) 𝑘𝑘=𝑘𝑘
• Step 1: compute the gradient of F w.r.t. X:
𝜕𝜕𝑋𝑋 𝜕𝜕𝑋𝑋𝑗𝑗
𝑘𝑘∈ 0,1 ,𝑗𝑗∈[1,𝑑𝑑]
• Step 2: change the feature Xi to 1: (1) Xi = 0, (2) with the maximal positive gradientmaximise the change into the
target class 0
𝑖𝑖=argmax 𝐹𝐹 𝑋𝑋 𝑗𝑗∈ 1,𝑚𝑚 ,𝑋𝑋𝑗𝑗=0 0 𝑗𝑗
COMP90073 Security Analysis

Evasion attacks (application)
MWR: malware ratio MR: misclassification rate
COMP90073 Security Analysis

Overview
• Adversarial machine learning beyond computer vision – Audio
– NLP
– Malware detection
• Why are machine learning models vulnerable?
– Insufficient training data
– Unnecessary features
• How to defend against adversarial machine learning?
– Data-driven defences
– Learner robustification • Challenges
COMP90073 Security Analysis

Locations of Adversarial Samples
• Locations of adversarial samples
– Offthedatamanifoldforthelegitimateinstances – Threescenarios[1]
Near the boundary, but far from the “+” manifold
Away from the boundary, but near the manifold – in the “pocket” of the “+” manifold
Close to the boundary and the “-” manifold
COMP90073 Security Analysis

Locations of Adversarial Samples
• Images that are unrecognisable to human eyes, but can be identified by DNNs with nearly certainty [2]
DNNs believe with 99.99% confidence that the above images are digits 0-9
COMP90073 Security Analysis

Overview
• Adversarial machine learning beyond computer vision – Audio
– NLP
– Malware detection
• Why are machine learning models vulnerable?
– Insufficient training data
– Unnecessary features
• How to defend against adversarial machine learning?
– Data-driven defences
– Learner robustification • Challenges
COMP90073 Security Analysis

Explanation1: Insufficient Training Data
• Potential reason 1: insufficient training data • An illustrative example
– 𝑥𝑥∈ −1,1 ,𝑦𝑦∈ −1,1 ,𝑧𝑧∈ −1,2 – Binary classification
• Class 1: 𝑧𝑧 < 𝑥𝑥2 + 𝑦𝑦3 • Class 2: 𝑧𝑧 ≥ 𝑥𝑥2 + 𝑦𝑦3 – x, y, z are increased by 0.01 a total of 200×200×300 = 1.2×107 points – How many points are needed to reconstruct the decision boundary? COMP90073 Security Analysis Explanation1: Insufficient Training Data – Randomly choose the training and test datasets Setting 1 Setting 2 Setting 3 Setting 4 Training dataset 80 800 8000 80000 Test dataset 40 400 4000 40000 – Boundary dataset (adversarial samples are likely to locate here): 𝑥𝑥2+𝑦𝑦3−0.1<𝑧𝑧< 𝑥𝑥2+𝑦𝑦3+0.1 COMP90073 Security Analysis Explanation1: Insufficient Training Data • Test result – RBFSVMs Size of the training dataset Accuracy on its own test dataset Accuracy on the test dataset with 4×104 points Accuracy on the boundary dataset 80 100 92.7 60.8 800 99.0 97.4 74.9 8000 99.5 99.6 94.1 80000 99.9 99.9 98.9 – LinearSVMs Size of the training dataset Accuracy on its own test dataset Accuracy on the test dataset with 4×104 points Accuracy on the boundary dataset 80 100 96.3 70.1 800 99.8 99.0 85.7 8000 99.9 99.8 97.3 80000 99.98 99.98 99.5 • 8000:0.067%of1.2×107 • MNIST:28×288-bitgreyscaleimages, (28)28×28 ≈ 1.1 × 101888 • 1.1×101888 ×0.067% ≫ 6×105 COMP90073 Security Analysis Overview • Adversarial machine learning beyond computer vision – Audio – NLP – Malware detection • Why are machine learning models vulnerable? – Insufficient training data – Unnecessary features • How to defend against adversarial machine learning? – Data-driven defences – Learner robustification • Challenges COMP90073 Security Analysis Poisoning attacks • Poison frog attacks [10] – E.g.,addaseeminglyinnocuousimage(thatisproperlylabeled)toa training set, and control the identity of a chosen image at test time Target class Base class COMP90073 Security Analysis Poisoning attacks • Generate poison data – 𝑓𝑓(𝑥𝑥): the function that propagates an input x through the – 𝑝𝑝 = argmin 𝑓𝑓 𝑥𝑥 − 𝑓𝑓(𝑡𝑡) 2 + 𝛽𝛽 𝑥𝑥 − 𝑏𝑏 2 network to the penultimate layer (before the softmax layer) 2 22 𝑥𝑥 • 𝑓𝑓 𝑥𝑥 − 𝑓𝑓(𝑡𝑡) : makes p move toward the target instance •𝛽𝛽 𝑥𝑥−𝑏𝑏 :makespappearlikeabaseclassinstancetoa 2 in feature space and get embedded in the target class distribution 2 human labeller COMP90073 Security Analysis Explanation2: Unnecessary Features • Potential reason 2: redundant features [3] – Classifier𝑓𝑓=𝑙𝑙∘𝑐𝑐,𝑙𝑙:featureextraction,𝑐𝑐:classification – 𝑑𝑑 : similarity measure – FeaturesextractedbyMLclassifier(X1)≠Featuresextractedbyhuman(X2) COMP90073 Security Analysis Explanation2: Unnecessary Features • Potential reason 2: redundant features [3] 𝐹𝐹𝑖𝑖𝐹𝐹𝑑𝑑 𝑥𝑥′ s.t.𝑓𝑓1 𝑥𝑥 ≠𝑓𝑓1 𝑥𝑥′ – Previousdefinitionofadversarialattacks: ∆(𝑥𝑥,𝑥𝑥′) < 𝜖𝜖 𝐹𝐹𝑖𝑖𝐹𝐹𝑑𝑑 𝑥𝑥′ s.t.𝑓𝑓1 𝑥𝑥 ≠𝑓𝑓1 𝑥𝑥′ – Newdefinition: 𝑑𝑑2(𝑙𝑙2(𝑥𝑥), 𝑙𝑙2(𝑥𝑥′)) < 𝛿𝛿2 𝑓𝑓2 𝑥𝑥 =𝑓𝑓2 𝑥𝑥′ – {𝛿𝛿2, 𝜂𝜂}-strong-robustness: 𝑖𝑖𝑓𝑓 ∀𝑥𝑥,𝑥𝑥′ ∈ 𝑋𝑋 𝑎𝑎.𝑒𝑒. 𝑥𝑥,𝑥𝑥′ 𝑠𝑠𝑎𝑎𝑡𝑡𝑖𝑖𝑠𝑠𝑓𝑓𝑖𝑖𝑒𝑒𝑠𝑠 𝑃𝑃 𝑓𝑓1 𝑥𝑥 =𝑓𝑓1 𝑥𝑥′ |𝑓𝑓2 𝑥𝑥 =𝑓𝑓2 𝑥𝑥′ ,𝑑𝑑2(𝑙𝑙2(𝑥𝑥),𝑙𝑙2(𝑥𝑥′))<𝛿𝛿2 >1−𝜂𝜂
𝑓𝑓1 agrees with 𝑓𝑓2
COMP90073 Security Analysis

Explanation2: Unnecessary Features
• Unnecessary features ruin strong-robustness
– If 𝑓𝑓 uses unnecessary featuresnot strong-robust
1
– If𝑓𝑓1missesnecessaryfeaturesusedby𝑓𝑓2notaccurate
– If𝑓𝑓1usesthesamesetoffeaturesas𝑓𝑓2strong-robust,canbeaccurate
Can be far away to the original instance in the trained classifier’s feature space, and at the other side of the boundary
Each adversarial sample is close to the original instance in the oracle feature space
COMP90073 Security Analysis

Overview
• Adversarial machine learning beyond computer vision – Audio
– NLP
– Malware detection
• Why are machine learning models vulnerable?
– Insufficient training data
– Unnecessary features
• How to defend against adversarial machine learning?
– Data-driven defence
– Learner robustification • Challenges
COMP90073 Security Analysis

Data-driven Defence
• Data-driven defence
– Filteringinstances:poisoningdatainthetrainingdatasetortheadversarial samples against the test dataset either exhibit different statistical features, or follow a different distribution – detection
– Injectingdata:addadversarialsamplesintotraining–adversarialtraining
– Projectingdata:projectdataintolower-dimensionalspace;move adversarial samples closer to the manifold of legitimate samples
COMP90073 Security Analysis

Data-driven Defence: Filtering Instances
• Filteringinstances
• OnDetectingAdversarialPerturbations[4]
– Adversarydetectionnetwork:branchoffthemainnetworkatsomelayer
– Each detector produces 𝑝𝑝 : probability of the input being adversarial
𝑡𝑡𝑑𝑑𝑎𝑎
– Step1:Trainthemainnetworkregularly,andfreezeitsweights
– Step2:Generateanadversarialsampleforeachtrainingdatapoint
– Step3:Trainthedetectorsonthebalanced,binarydataset
COMP90073 Security Analysis

Data-driven Defence: Filtering Instances
• Adaptive/dynamic attacker: attacker that is aware of the detection
method
cross-entropy loss of the classifier
letting the classifier mis-label the input x
cross-entropy loss of the detector
making the detectors output padv as small as possible
• Dynamic adversary training
Static
Dynamic
Defender
Train the classifier
Freeze its weights Precompute adversarial samples
Compute adversarial examples on-the-fly for each mini-batch
Attacker
Modify x only to maximise the classifier’s cross-entropy loss
Modify x to fool classifier + detector
Adapt to each other
COMP90073 Security Analysis

Data-driven Defence: Filtering Instances
• TestonCIFAR10
𝜎𝜎
COMP90073 Security Analysis

Data-driven Defence
• Data-driven defence
– Filteringinstances:poisoningdatainthetrainingdatasetortheadversarial samples against the test dataset either exhibit different statistical features, or follow a different distribution – detection
– Injectingdata:addadversarialsamplesintotraining–adversarialtraining
– Projectingdata:projectdataintolower-dimensionalspace;move adversarial samples closer to the manifold of legitimate samples
COMP90073 Security Analysis

Data-driven Defence: Injecting Data
• Adversarial training: add adversarial samples into training data
• Towards Deep Learning Models Resistant to Adversarial Attacks [5] – Normallyhowaclassificationproblemisformalised
𝜃𝜃∗ =arg min 𝔼𝔼
𝜃𝜃 (𝑥𝑥,𝑦𝑦)~𝐷𝐷
𝐿𝐿 𝑥𝑥;𝑦𝑦;𝜃𝜃 Notrobust Augment

𝜃𝜃=argmin 𝔼𝔼 max𝐿𝐿𝑥𝑥+𝛿𝛿;𝑦𝑦;𝜃𝜃
adversary: perturb 𝑥𝑥 to maximise the loss
– Redefinethelossbyincorporatingtheadversary:
𝜃𝜃 (𝑥𝑥,𝑦𝑦)~𝐷𝐷
𝛿𝛿∈ −𝜀𝜀,𝜀𝜀 𝑑𝑑
Defender: find model parameters 𝜃𝜃∗ to minimise the “adversarial loss”
COMP90073 Security Analysis

Data-driven Defence: Injecting Data
• Towards Deep Learning Models Resistant to Adversarial Attacks [5] – Step1:fix𝜃𝜃,generateadversarialsamplesusingstrongattacks(e.g.,
projected gradient descent, C&W): 𝑥𝑥𝑖𝑖 ← clip𝜀𝜀 𝑥𝑥𝑖𝑖−1 + 𝛼𝛼 � sign 𝜕𝜕𝐿𝐿 𝜕𝜕𝑥𝑥𝑖𝑖−1
– Step2:Update𝜃𝜃:trainthenetworkontheaugmenteddataset // Only one epoch
Inner maximisation: find adversarial examples Outer minimisation: optimise 𝜃𝜃
[17]
COMP90073 Security Analysis

Data-driven Defence: Injecting Data
• Towards Deep Learning Models Resistant to Adversarial Attacks [5]
Potential problem?
COMP90073 Security Analysis

Data-driven Defence: Injecting Data
• CurriculumAdversarialTraining(CAT)[17]
– Adversarialtrainingoverfitstothespecificattackinuse
– Trainingcurriculum:trainamodelfromweakerattackstostrongerattacks – Attackstrength:PGD(k),k:thenumberofiterations
𝐹𝐹= |𝒟𝒟| batch size
COMP90073 Security Analysis

Data-driven Defence: Injecting Data
• CurriculumAdversarialTraining(CAT)[17] – Batchmixing
• Catastrophic forgetting [19]: a neural network tends to forget the information learned in the previous tasks when training on new tasks
• Generate adversarial examples using PGD(𝑖𝑖), 𝑖𝑖 ∈ {0, 1, … , 𝑙𝑙}, and combine them to form a batch, i.e., batch mixing
𝑙𝑙
COMP90073 Security Analysis

Data-driven Defence: Injecting Data
• CurriculumAdversarialTraining(CAT)[17] – Quantization
• Attack generalisation: the model trained with CAT may not defend against stronger attacks
• Quantization: real valueb-bit integer
• Each input x: real value from [0, 1]d  integer value from [0, 2b-1]d
COMP90073 Security Analysis

Data-driven Defence: Injecting Data
• Adversarialtrainingforfree[20]
– Trainoneachminibatchmtimes,numberofepochsNepNep/m
– FGSMisused,butperturbationsarenotresetbetweenminibatches – Singlebackwardpasstoupdatebothmodelweightsandperturbation
COMP90073 Security Analysis

Data-driven Defence: Injecting Data
• FastAdversarialTraining[18]
– FGSMadversarialtrainingwithrandominitialization
– Non-zeroinitialperturbationistheprimarydriverforsuccess
COMP90073 Security Analysis

Data-driven Defence
• Data-driven defence
– Filteringinstances:poisoningdatainthetrainingdatasetortheadversarial samples against the test dataset either exhibit different statistical features, or follow a different distribution – detection
– Injectingdata:addadversarialsamplesintotraining–adversarialtraining
– Projectingdata:projectdataintolower-dimensionalspace;move adversarial samples closer to the manifold of legitimate samples
COMP90073 Security Analysis

Data-driven Defence: Projecting Data
• Projecting data
– Adversarial samples come from low-density regions
– Move adversarial samples back to the data manifold before classification
– Use auto-encoder, GANs, PixelCNN to reform/purify the input
COMP90073 Security Analysis

Data-driven Defence: Projecting Data
• Auto-encoder: get an output identical with the input
Inputcodeoutput ≈ input
https://towardsdatascience.com/applied-deep-learning-part-3-autoencoders-1c083af4d798
COMP90073 Security Analysis

Data-driven Defence: Projecting Data
• MagNet:aTwo-ProngedDefenseagainstAdversarialExamples[6] – Useauto-encoderstodetectandreformadversarialsamples – Detector
• Reconstruction error (RE)
– Normal examplessmall RE
– Adversarial sampleslarge RE
– Threshold: reject no more than 0.1% examples in validation set
• Probability divergence
– Normal examplesSmall divergence btw 𝑓𝑓(𝑥𝑥) and 𝑓𝑓 𝐴𝐴𝐴𝐴 𝑥𝑥
– Adversarial samplesLarge divergence btw 𝑓𝑓(𝑥𝑥′) and 𝑓𝑓 𝐴𝐴𝐴𝐴 𝑥𝑥′
𝐴𝐴𝐴𝐴 𝑥𝑥 : output of the auto-encoder
𝑓𝑓(𝑥𝑥) : output of the last layer (i.e., softmax) of the neural network f on the input x
COMP90073 Security Analysis

Data-driven Defence: Projecting Data
– Reformer
• Normal examples: AE outputs a very similar example
• Adversarial samples: AE outputs an example that is closer to the manifold of the normal examples
Manifold of normal examples
Normal examples
Adversarial samples
COMP90073 Security Analysis

Data-driven Defence: Projecting Data
– Reformer
COMP90073 Security Analysis

Data-driven Defence: Projecting Data
• Can you think of a way to break “MagNet”?
– Hint: an adaptive attacker that attacks not only the classifier, but also the detector (suppose there is only one detector) and the reformer.
–argmin 𝛿𝛿 +𝑐𝑐�𝑓𝑓𝑥𝑥+𝛿𝛿?? 𝛿𝛿∈ 0,1 𝑑𝑑
COMP90073 Security Analysis

Overview
• Adversarial machine learning beyond computer vision – Audio
– NLP
– Malware detection
• Why are machine learning models vulnerable?
– Insufficient training data
– Unnecessary features
• How to defend against adversarial machine learning?
– Data-driven defences
– Learner robustification • Challenges
COMP90073 Security Analysis

Learner Robustification: Distillation
• Distillation as a defense to adversarial perturbations against deep neural networks [8]
– Distillation:transferknowledgefromoneneuralnetworktoanother– suppose there is a trained DNN, the probabilities generated in the final softmax layer are used to train a second DNN, instead of the (hard) class labels
provide richer information about each class
Loss
Y
Loss
COMP90073 Security Analysis

Learner Robustification: Distillation
• Distillation as a defense to adversarial perturbations against deep neural networks [8] (N. Papernot et al.)
– Modificationtothefinalsoftmaxlayer: 𝑖𝑖𝑇𝑇
𝐹𝐹𝑋𝑋= 𝑡𝑡𝑧𝑧(𝑋𝑋) →𝐹𝐹𝑋𝑋= 𝑡𝑡𝑧𝑧𝑖𝑖(𝑋𝑋)
𝑖𝑖 ∑𝑁𝑁 𝑡𝑡𝑧𝑧𝑗𝑗(𝑋𝑋) 𝑖𝑖 𝑧𝑧𝑗𝑗(𝑋𝑋)
𝑗𝑗=1 ∑𝑁𝑁 𝑡𝑡 𝑇𝑇
𝑗𝑗=1 𝑍𝑍(𝑋𝑋): output of the last hidden layer
𝑇𝑇: 𝑑𝑑𝑖𝑖𝑠𝑠𝑡𝑡𝑖𝑖𝑙𝑙𝑙𝑙𝑎𝑎𝑡𝑡𝑖𝑖𝑙𝑙𝐹𝐹 𝑡𝑡𝑒𝑒𝑡𝑡𝑝𝑝𝑒𝑒𝑡𝑡𝑎𝑎𝑡𝑡𝑡𝑡𝑡𝑡𝑒𝑒
COMP90073 Security Analysis

Learner Robustification: Distillation
• Distillation as a defense to adversarial perturbations against deep neural networks [8]
– Given a training set {(X, Y(X))}, train a DNN (F) with a softmax
layer at temperature T
– Form a new training set {(X, F(X))}, train another DNN (FD),
with the same network architecture, also at temperature T
– Test at temperature T=1
– A high empirical value of T (at training time) gives a better performance (T=1 at test time)
– FD provides a smoother loss function – more generalised for an unknown dataset
Probability vector
COMP90073 Security Analysis

Learner Robustification: Distillation
– ResultsonMNISTandCIFAR10
Effect against adversarial samples Influence of distillation on clean data
COMP90073 Security Analysis

Learner Robustification: Distillation
• Whydoes“alargetemperatureattrainingtime(e.g.T=100)+alow temperature at test time (e.g. T=1)” make the model more secure?
•𝐹𝐹𝑋𝑋= 𝑖𝑖
𝑡𝑡𝑧𝑧𝑖𝑖(𝑋𝑋)
𝑧𝑧𝑗𝑗(𝑋𝑋) ∑𝑁𝑁 𝑡𝑡 𝑇𝑇
𝑗𝑗=1
100
200 𝑡𝑡2 𝑡𝑡 𝑡𝑡200
𝑇𝑇
Training (T=100) Test (T=1) 𝑡𝑡1 𝑡𝑡100 1
𝑡𝑡+𝑡𝑡2+𝑡𝑡=1+𝑡𝑡+1 𝑡𝑡100+𝑡𝑡200+𝑡𝑡100=1+𝑡𝑡100+1 𝑡𝑡100
100 𝑡𝑡+𝑡𝑡2+𝑡𝑡=1+𝑡𝑡+1 𝑡𝑡100+𝑡𝑡200+𝑡𝑡100=1+𝑡𝑡100+1 𝑡𝑡1 𝑡𝑡100 1
𝑡𝑡+𝑡𝑡2+𝑡𝑡 1+𝑡𝑡+1 𝑡𝑡100+𝑡𝑡200+𝑡𝑡100 1+𝑡𝑡100+1 𝑍𝑍(𝑋𝑋) 𝐹𝐹(𝑋𝑋) = =
COMP90073 Security Analysis

Learner Robustification: Distillation
• ExtendingDefensiveDistillation[12]
• 1.The1stDNNistrainedasusual(onone-hotlabels)
• 2.Newlabelingvector:originallabelinformation+predictiveuncertainty • 3.Thedistilledmodelistrainedwiththenewlabelvectors.
COMP90073 Security Analysis

Learner Robustification: Distillation
• Predictiveuncertainty
– TakeNforwardpassesthroughtheneuralnetworkwithdropout – Record the N logit vectors 𝑧𝑧0 𝑥𝑥 , … , 𝑧𝑧𝑁𝑁−1(𝑥𝑥)
– Calculateuncertaintyforx:
– Newlabelingvectork(x):
0.1 0.6 0.1 0.2
+
𝛼𝛼=1,max𝜎𝜎(𝑥𝑥)=0.4,𝜎𝜎 𝑥𝑥 =0.1 0 0.25
0
 1–0.25=0.75
COMP90073 Security Analysis

Learner Robustification: Distillation
• White-boxattackviaFGSM
– Recovered:adversarialinputsthatareassignedtotheoriginalclass – Detected:adversarialexamplesthatareclassifiedintheoutlierclass
COMP90073 Security Analysis

Learner Robustification: Stability Training
• Improving the Robustness of Deep Neural Networks via Stability
Training [9]
– Stability objective: if 𝑥𝑥 is close to 𝑥𝑥, 𝑓𝑓 𝑥𝑥 should be close to 𝑓𝑓 𝑥𝑥
∀𝑥𝑥′:𝑑𝑑 𝑥𝑥,𝑥𝑥′ small ↔𝐷𝐷 𝑓𝑓 𝑥𝑥 ,𝑓𝑓 𝑥𝑥′ small
′′
– Definenewtrainingobjective:
𝐿𝐿 𝑥𝑥,𝑥𝑥′;𝜃𝜃 =𝐿𝐿 𝑥𝑥;𝜃𝜃 +𝛼𝛼𝐷𝐷 𝑓𝑓 𝑥𝑥 ,𝑓𝑓 𝑥𝑥′ ,𝐿𝐿 :originaltrainingobjective
00 – New optimisation problem: 𝑖𝑖 𝑖𝑖
𝜃𝜃∗ = argmin∑ 𝐿𝐿 𝑥𝑥 , 𝑥𝑥′;𝜃𝜃 𝜃𝜃 𝑑𝑑 𝑥𝑥 𝑖𝑖 , 𝑥𝑥 𝑖𝑖′ < 𝜖𝜖 – Generate 𝑥𝑥 ′ : adds pixel-wise uncorrelated Gaussian noise 𝜖𝜖 to 𝑥𝑥 𝑥𝑥 𝑘𝑘′ = 𝑥𝑥 𝑘𝑘 + 𝜖𝜖 𝑘𝑘 , 𝜖𝜖 𝑘𝑘 ~ 𝒩𝒩 0 , 𝜎𝜎 𝑘𝑘2 , 𝜎𝜎 𝑘𝑘 > 0
– 𝐿𝐿0 , 𝐷𝐷 are task specific, e.g., 𝐿𝐿0 : cross-entropy loss, D: KL-divergence
COMP90073 Security Analysis

Overview
• Adversarial machine learning beyond computer vision – Audio
– NLP
– Malware detection
• Why are machine learning models vulnerable?
– Insufficient training data
– Unnecessary features
• How to defend against adversarial machine learning?
– Data-driven defences
– Learner robustification • Challenges
COMP90073 Security Analysis

Adversarial Machine Learning – Challenges
• Arm race between attackers and defenders [10][11]
• Many defence methods fail to
– Evaluateagainstastrongattack,e.g.,PGD,C&W – Evaluateagainstanadaptiveattacker
2
•argmin 𝛿𝛿 +𝑐𝑐�𝑓𝑓 𝑥𝑥+𝛿𝛿
• Should not assume the attacker is unaware of the defence method
𝛿𝛿∈ 0,1 𝑑𝑑
2/∞ 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
– EvaluateoncomplicateddatasetlikeCIFAR,ImageNet • Evaluating solely on MNIST is insufficient
– Definearealisticthreatmodel–whatisknown&unknowntotheattacker • Model architecture and model weights
• Training algorithm and training data
• Test time randomness
• White-box – grey-box – black-box
COMP90073 Security Analysis

Summary
• Adversarial machine learning beyond computer vision – Audio
– NLP
– Malware detection
• Why are machine learning models vulnerable?
– Insufficient training data
– Unnecessary features
• How to defend against adversarial machine learning?
– Data-driven defences
• Filtering adversarial samples • Adversarial training
• Project to lower dimension
– Learner robustification • Distillation
• Stability training
COMP90073 Security Analysis

References
• [1]R.Feinman,R.R.Curtin,S.Shintre,andA.B.Gardner,“Detecting Adversarial Samples from Artifacts,” eprint arXiv:1703.00410, 2017.
• [2]A.Nguyen,J.Yosinski,andJ.Clune,“DeepNeuralNetworksareEasily Fooled: High Confidence Predictions for Unrecognizable Images,” in CVPR, 2015.
• [3]B.Wang,J.Gao,andY.Qi,“ATheoreticalFrameworkforRobustnessof (Deep) Classifiers against Adversarial Examples,” eprint arXiv:1612.00334, 2016.
• [4]J.H.Metzen,T.Genewein,V.Fischer,andB.Bischoff,“OnDetecting Adversarial Perturbations,” eprint arXiv:1702.04267, 2017.
• [5]A.Madry,A.Makelov,L.Schmidt,D.Tsipras,andA.Vladu,“TowardsDeep Learning Models Resistant to Adversarial Attacks,” arXiv:1706.06083, 2017.
• [6]D.MengandH.Chen,“MagNet:aTwo-ProngedDefenseagainst Adversarial Examples,” arXiv:1705.09064, 2017.
• [7]K.Grosse,N.Papernot,P.Manoharan,M.Backes,andP.McDaniel, “Adversarial Perturbations Against Deep Neural Networks for Malware Classification,” eprint arXiv:1606.04435, 2016.
COMP90073 Security Analysis

References
• [8] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks,” eprint arXiv:1511.04508, 2015.
• [9] S. Zheng, Y. Song, T. Leung, and I. Goodfellow, “Improving the Robustness of Deep Neural Networks via Stability Training,” eprint arXiv:1604.04326, 2016.
• [10] N. Carlini and D. Wagner, “Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods,” arXiv:1705.07263, 2017.
• [11] A. Athalye, N. Carlini, and D. Wagner, “Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples,” arXiv:1802.00420 [cs], Feb. 2018.
• [12] N. Papernot and P. McDaniel, “Extending Defensive Distillation,” arXiv:1705.05264 [cs, stat], May 2017.
• [13] K. Grosse, N. Papernot, P. Manoharan, M. Backes, and P. McDaniel, “Adversarial Perturbations Against Deep Neural Networks for Malware Classification,” eprint arXiv:1606.04435, 2016.
• [14] and . Audio Adversarial Examples: Targeted Attacks on Speech-to-Text. arXiv 1801.01944, 2018
COMP90073 Security Analysis

References
• [15] , , and Cun. Character-level convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657, 2015
• [16] Bin Liang, Hongcheng Li, , Pan Bian, Xirong Li, and . 2018. Deep text classification can be fooled. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18). AAAI Press, 4208–4215.
• [17] Qi- , , and . 2018. Curriculum adversarial training. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18). AAAI Press, 3740–3747.
• [18] , and J. . Fast is better than free: Revisiting adversarial training. arXiv:2001.03994 [cs.LG], 2020.
• [19] Closkey and . Cohen. 1989. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. In Psychology of Learning and Motivation. Vol. 24. Academic Press, 109 – 165.
• [20] , , , , . Dickerson, , . Davis, , : Adversarial training for free! NeurIPS 2019: 3353-3364
COMP90073 Security Analysis