程序代写 algorithm Week 11: Adversarial Reinforcement Learning

Week 11: Adversarial Reinforcement Learning
COMP90073 Security Analytics
, CIS Semester 2, 2021

Overview
• Background on reinforcement learning – Introduction
– Q-learning
– Application in defending against DDoS attacks • Adversarial attacks against RL models
– Test time attack
– Training time attack • Defence
COMP90073 Security Analysis

Overview
• Background on reinforcement learning – Introduction
– Q-learning
– Application in defending against DDoS attacks • Adversarial attacks against RL models
– Test time attack
– Training time attack • Defence
COMP90073 Security Analysis

Background on reinforcement learning
• Application

https://www.tesla.com/videos/autopilot-self-driving- hardware-neighborhood-long
https://www.myrealfacts.com/2019/05/applications-of- reinforcement-learning.html
COMP90073 Security Analysis

Background on reinforcement learning
• Introduction
Agent
Environment Assessment
Action
State
S A RReward S A R S A R 111 222 333
Reward
over the long run: 𝑅𝑅 = ∑∞ 𝛾𝛾𝜏𝜏−𝑡𝑡𝑟𝑟 ,𝛾𝛾 ∈ (0,1] 𝑡𝑡 𝜏𝜏=𝑡𝑡 𝜏𝜏
Maximise the discounted cumulative rewards
State
S1 S2 S3

Action
A1 A2 A3

COMP90073 Security Analysis

Background on reinforcement learning
• State
01111111 02111111 01000010 02000010
11000110 22000110 01110100 01110100
10010110 10010110 11110011 11110011
COMP90073 Security Analysis

Background on reinforcement learning
• Action – Up
– Left
– Down – Right
• Reward: an immediate feedback on whether an action is good – In the range of [-1, 1]
– 1: reach the exit
– -0.8: move to a blocked cell
– -0.3: move to a visited cell
– -0.05: move to an adjacent cell
COMP90073 Security Analysis

Background on reinforcement learning
• Policy (𝜋𝜋): a mapping from state to action, i.e. a = 𝜋𝜋(𝑠𝑠), it tells the agent what to do in a given state
0.9 0.05 0.05 0
COMP90073 Security Analysis

Background on reinforcement learning
• Value function: the future, long term reward of a state
– Valueofstate𝑠𝑠underpolicy𝜋𝜋:𝑉𝑉𝜋𝜋 𝑠𝑠 =𝔼𝔼 ∑𝑇𝑇 𝛾𝛾𝑖𝑖−1𝑟𝑟|𝑆𝑆 =𝑠𝑠
𝑖𝑖=1 𝑖𝑖𝑡𝑡 – Expected value of following policy 𝜋𝜋 starting from state s
– Conditional on some policy 𝜋𝜋
COMP90073 Security Analysis

Background on reinforcement learning
• Model of the environment: mimic the behaviour of the environment, e.g., given a state & action, what the next state & reward might be.
COMP90073 Security Analysis

Background on reinforcement learning
Agent
Environment
Assessment
Policy, Value
Action
State
Reward
COMP90073 Security Analysis

Background on reinforcement learning
• Classification
– – –
Value-based algorithm estimates the value function
Policy-based algorithm learns the policy directly
Actor-critic: critic updates action-value function, actor updates policy
COMP90073 Security Analysis

Background on reinforcement learning
• Classification
– Model free algorithm directly learns the policy and/or the value
function
– Model based algorithm first builds up how the environment works
COMP90073 Security Analysis

Overview
• Background on reinforcement learning – Introduction
– Q-learning
– Application in defending against DDoS attacks • Adversarial attacks against RL models
– Test time attack
– Training time attack • Defence
COMP90073 Security Analysis

Background on reinforcement learning
• Q-learning: estimate action-value function 𝑄𝑄 𝑠𝑠, 𝑎𝑎
– Expected value of taking action 𝑎𝑎 in state 𝑠𝑠 and then following
policy 𝜋𝜋: 𝑇𝑇
𝑄𝑄𝜋𝜋 𝑠𝑠,𝑎𝑎 =𝔼𝔼 �𝑖𝑖=1𝛾𝛾𝑖𝑖−1𝑟𝑟𝑖𝑖|𝑆𝑆𝑡𝑡 =𝑠𝑠,𝐴𝐴𝑡𝑡 =𝑎𝑎
http://mnemstudio.org/path-finding-q-learning-tutorial.htm
action reward
COMP90073 Security Analysis

Background on reinforcement learning
• Q-learning
80
100+0.8*max(0,0,0) 0+0.8*max(0,100)
100
COMP90073 Security Analysis

Background on reinforcement learning
• Exploitation vs. Exploration ε-greedy
COMP90073 Security Analysis

Background on reinforcement learning
• The tabular version does not scale with the action/state space • Classical Q Network [1]
– Function approximation
– Approximate Q(s,a) via a neural network: Q(s,a) ≈ Q*(s,a,θ)
– 𝐿𝐿 𝜃𝜃 =𝔼𝔼 𝑟𝑟+𝛾𝛾max𝑄𝑄 𝑠𝑠′,𝑎𝑎′;𝜃𝜃 −𝑄𝑄 𝑠𝑠,𝑎𝑎;𝜃𝜃 – Unstable 𝑎𝑎′ Target Q
2
COMP90073 Security Analysis

Background on reinforcement learning
• Deep Q Network (DQN) [2]
– Experience replay: draw randomly from a buffer of (s, a, s’, r)
′′− –𝑄𝑄𝑠𝑠,𝑎𝑎 ←𝑟𝑟+𝛾𝛾𝑚𝑚𝑎𝑎𝑚𝑚′𝑄𝑄(𝑠𝑠,𝑎𝑎,𝜃𝜃𝜃𝜃))
𝑎𝑎
– 𝐿𝐿 𝜃𝜃 =𝔼𝔼 𝑟𝑟+𝛾𝛾max𝑄𝑄 𝑠𝑠′,𝑎𝑎′;𝜃𝜃− −𝑄𝑄 𝑠𝑠,𝑎𝑎;𝜃𝜃 𝑎𝑎′
2
(s’, a’, s”, r’)
– Reward clipped to [-1, 1] • Double DQN (DDQN) [3]
– Separate action selection from action evaluation – 𝑄𝑄 𝑠𝑠,𝑎𝑎 ←𝑟𝑟+𝛾𝛾𝑄𝑄 (𝑠𝑠′,𝑎𝑎𝑟𝑟𝑎𝑎𝑚𝑚𝑎𝑎𝑚𝑚 ′𝑄𝑄 (𝑠𝑠′,𝑎𝑎′))
12𝑎𝑎1
– 𝐿𝐿 𝜃𝜃 =𝔼𝔼 𝑟𝑟+𝛾𝛾𝑄𝑄2(𝑠𝑠′,argmax𝑄𝑄1(𝑠𝑠′,𝑎𝑎′))−𝑄𝑄1 𝑠𝑠,𝑎𝑎;𝜃𝜃 𝑎𝑎′
2
COMP90073 Security Analysis

Overview
• Background on reinforcement learning – Introduction
– Q-learning
– Application in defending against DDoS attacks • Adversarial attacks against RL models
– Test time attack
– Training time attack • Defence
COMP90073 Security Analysis

Background on reinforcement learning
• Distributed Denial-of-Service (DDoS) attacks still occur almost every hour globally
– http://www.digitalattackmap.com/
– StatisticsaregatheredbyArbor’sActiveThreatLevelAnalysisSystem
from 330+ ISP customers with 130Tbps of global traffic
• Can RL be applied to throttle flooding DDoS attacks?
COMP90073 Security Analysis

Background on reinforcement learning
• Problem setup [5]
– Amixedsetoflegitimateusers&
attackers
– Aggregated traffic at s ∈ 𝐿𝐿 , 𝑈𝑈
𝑠𝑠𝑠𝑠
H1
– RLagentsdecidesthedroprates
– Noanomalydetection–expensive
R4 R6
R: router H: host
R9
R11 H2
R12 R13 H3
R2
S R1
R3 R7H4 R5
Server to protect
RL agents
, , Multiagent Router Throttling: Decentralized Coordinated Response Against DDoS Attacks, In Proc. of AAAI 2013
H5 R8 R10
H6
Legitimate or malicious users
COMP90073 Security Analysis

Background on reinforcement learning
• RL problem formalisation – Statespace
• Aggregated traffic arrived at the router over the last T seconds – Actionset
• Percentage of traffic to drop: 0, 10%, 20%, 30%, … 90%
– Reward
• Aggregated traffic at s > Us ? • Legitimate traffic reached s
S R1
R4
R6
H1
R9
R11 H2
R12 R13 H3
R2
R3 R7H4 R5
H5 R8 R10
H6
COMP90073 Security Analysis

Background on reinforcement learning
• Training (DDQN)
S R1
R4 R6
H1
R9
R11 H2
R2
R3 R7H4 R5
H5 R8 R10
H6
Value
Action
R12
R13 H3
50M 70M 20M
0.8, (0.4, 0.6, 0.3)
. . .
. . .
0.5, (0.7, 0.6, 0.6)
COMP90073 Security Analysis

Background on reinforcement learning
𝐿𝐿 𝜃𝜃 =𝔼𝔼 𝑟𝑟+𝛾𝛾𝑄𝑄2(𝑠𝑠′,argmax𝑄𝑄1(𝑠𝑠′,𝑎𝑎′))−𝑄𝑄1 𝑠𝑠,𝑎𝑎;𝜃𝜃 𝑎𝑎′
2
COMP90073 Security Analysis

Background on reinforcement learning
• Test
– 10000 cases (may not be seen in training)
100 80 60 40 20
0
1 0.8 0.6 0.4 0.2 0 -0.2
Average normalised reward
Avg reward ≈ 0.7
Reward ≤ 0.6 in 35% cases
COMP90073 Security Analysis
Cumulative percentage (%)

Overview
• Background on reinforcement learning – Introduction
– Q-learning
– Application in defending against DDoS attacks • Adversarial attacks against RL models
– Test time attack
– Training time attack • Defence
COMP90073 Security Analysis

Adversarial attacks against RL models
• Test time attacks
– Manipulate the environment observed by the agent [5]
– Without attack: …,𝑠𝑠 ,𝑎𝑎 ,𝑟𝑟 ,𝑠𝑠 ,𝑎𝑎 ,𝑟𝑟 ,𝑠𝑠
𝑡𝑡 𝑡𝑡 𝑡𝑡 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+2
,…
– Withattack:…,𝑠𝑠 +𝛿𝛿 ,𝑎𝑎′,𝑟𝑟′,𝑠𝑠 +𝛿𝛿 ,𝑎𝑎′ ,𝑟𝑟′ ,𝑠𝑠 +𝛿𝛿 ,… 𝑡𝑡 𝑡𝑡 𝑡𝑡 𝑡𝑡 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+2 𝑡𝑡+2
COMP90073 Security Analysis

Adversarial attacks against RL models
• 𝐽𝐽(𝜃𝜃,𝑚𝑚,𝑦𝑦)
– y: softmax of the Q-value, i.e., prob. of taking an action
RL agent (𝜃𝜃)
1.5 𝑒𝑒 0.07 2.2 Softmax 0.15
… 𝜋𝜋= 𝑄𝑄(𝑠𝑠,𝑎𝑎𝑖𝑖) … 0.4 𝑎𝑎𝑗𝑗 𝑗𝑗 0.02
3.6 𝑖𝑖 ∑ 𝑒𝑒𝑄𝑄(𝑠𝑠,𝑎𝑎) 0.61
– J: cross-entropy loss between y and the distribution that places all weight on the action with the highest Q-value
0.07 0
y 0.15
0 similar to the one-hot vector
… … in supervised learning 0.02 0
Q-value for each action Prob.
0.61
1
COMP90073 Security Analysis

Adversarial attacks against RL models
• Timing of the attack
– Heuristic method [6]: launch the attack only when
𝑒𝑒𝑄𝑄(𝑠𝑠, 𝑎𝑎) 𝑒𝑒𝑄𝑄 𝑠𝑠, 𝑎𝑎
𝑐𝑐𝑠𝑠 =max 𝑇𝑇 −min 𝑇𝑇 >𝛽𝛽 𝑎𝑎 ∑ 𝑒𝑒𝑄𝑄(𝑠𝑠,𝑎𝑎𝑘𝑘) 𝑎𝑎 ∑ 𝑒𝑒𝑄𝑄 𝑠𝑠,𝑎𝑎𝑘𝑘
𝑎𝑎𝑘𝑘𝑇𝑇 𝑎𝑎𝑘𝑘𝑇𝑇
COMP90073 Security Analysis

Adversarial attacks against RL models
• Timing of the attack [8] – “Brute-force” search
– Consider all possible N consecutive perturbations – Evaluate the attack damage at step t + M (M ≥ N) –𝑠𝑠,𝑎𝑎,𝑠𝑠 ,𝑎𝑎 ,…,𝑠𝑠 ,…,𝑠𝑠
𝑡𝑡 𝑡𝑡 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+𝑁𝑁−1
𝑡𝑡+𝑀𝑀
COMP90073 Security Analysis

Adversarial attacks against RL models
• Timing of the attack [8] – “Brute-force” search
– Train a prediction model: (𝑠𝑠 , 𝑎𝑎 ) → 𝑠𝑠
{(𝑠𝑠 ,𝑎𝑎 ),(𝑠𝑠 ,𝑎𝑎 𝑡𝑡 𝑡𝑡 𝑡𝑡+1
𝑡𝑡𝑡𝑡 𝑡𝑡+1
),…(𝑠𝑠 ,𝑎𝑎 𝑡𝑡+1 𝑡𝑡+𝑀𝑀 𝑡𝑡+𝑀𝑀
)}
– Predict the subsequent states and actions,
– Assess the potential damage of all possible strategies – Danger Awareness Metric (DAM):
𝐷𝐷𝐴𝐴𝐷𝐷 = 𝑇𝑇 𝑠𝑠′ − 𝑇𝑇(𝑠𝑠 ) 𝑡𝑡+𝑀𝑀 𝑡𝑡+𝑀𝑀
𝑇𝑇: domain-specific definition, e.g., distance between the car and the centre of the road
COMP90073 Security Analysis

Adversarial attacks against RL models
• Timing of the attack [8]
– Train an antagonist model
– Learn the optimal attack strategy automatically without any domain knowledge
– Maintain a policy: 𝑠𝑠 → 𝑝𝑝 ,𝑎𝑎′ 𝑡𝑡𝑡𝑡𝑡𝑡
– If 𝑝𝑝𝑡𝑡 > 0.5, add the perturbation to trigger 𝑎𝑎𝑡𝑡′
– Take the original action 𝑎𝑎𝑡𝑡
– Reward: negative of the agent’s reward
COMP90073 Security Analysis

Adversarial attacks against RL models
• Black-box attack [9]
– Train a proxy model that learns a task that is related to the
target agent’s policy – S threat model
– Only have access to the states
– Approximate an expectation of the state transition function
– 𝑝𝑝𝑠𝑠𝑦𝑦𝑐𝑐𝑝𝑝𝑝𝑐𝑐(𝑠𝑠 , 𝜃𝜃 ) ≈ 𝔼𝔼 𝑃𝑃(𝑠𝑠 |𝑠𝑠 ) 𝑡𝑡 𝑃𝑃 𝜋𝜋𝑇𝑇 𝑡𝑡+1 𝑡𝑡
– SR threat model
– Have access to the states and reward
– Estimate the value V of a given state under the policy 𝜋𝜋
𝑇𝑇
–𝑎𝑎𝑠𝑠𝑠𝑠𝑒𝑒𝑠𝑠𝑠𝑠𝑎𝑎𝑟𝑟𝑠𝑠,𝜃𝜃 ≈𝔼𝔼 ∑∞ 𝛾𝛾𝑘𝑘𝑟𝑟 =𝑉𝑉𝜋𝜋𝑇𝑇𝑠𝑠 𝑡𝑡𝐴𝐴 𝜋𝜋𝑇𝑇𝑘𝑘=0𝑡𝑡𝑡𝑡+𝑘𝑘+1 𝑡𝑡
COMP90073 Security Analysis

Adversarial attacks against RL models
• Black-box attack [9] – SA threat model
– Approximate the target’s policy 𝜋𝜋
– Have access to states and actions
– 𝑝𝑝𝑚𝑚𝑝𝑝𝑖𝑖𝑎𝑎𝑖𝑖𝑎𝑎𝑟𝑟(𝑠𝑠𝑡𝑡, 𝜃𝜃𝐼𝐼) ≈ 𝜋𝜋𝑇𝑇(𝑠𝑠𝑡𝑡) – SRA threat model
𝑇𝑇
– Have access to states, actions and rewards
– Action-conditioned psychic (AC-psychic):
𝐴𝐴𝐴𝐴−𝑝𝑝𝑠𝑠𝑦𝑦𝑐𝑐𝑝𝑝𝑝𝑐𝑐(𝑠𝑠 ,𝜃𝜃 ) ≈ 𝔼𝔼 𝑡𝑡 𝑃𝑃 𝜋𝜋
𝑃𝑃(𝑠𝑠 |𝑠𝑠 ,𝑎𝑎 ) 𝑡𝑡+1 𝑡𝑡 𝑡𝑡
𝑇𝑇
– Combine 𝑎𝑎𝑠𝑠𝑠𝑠𝑒𝑒𝑠𝑠𝑠𝑠𝑎𝑎𝑟𝑟 and 𝐴𝐴𝐴𝐴−𝑝𝑝𝑠𝑠𝑦𝑦𝑐𝑐𝑝𝑝𝑝𝑐𝑐 to decide whether to
perturb the state
COMP90073 Security Analysis

Adversarial attacks against RL models
• Black-box attack [9]
– SRA threat model
𝒦𝒦 ∈ {𝑆𝑆, 𝑆𝑆𝑅𝑅, 𝑆𝑆𝐴𝐴}
COMP90073 Security Analysis

Adversarial attacks against RL models
• Black-box attack [9]
– Surrogate: assume the adversary has access to the target agent’s environment and can train an identical model
COMP90073 Security Analysis

Overview
• Background on reinforcement learning – Introduction
– Q-learning
– Application in defending against DDoS attacks • Adversarial attacks against RL models
– Test time attack
– Training time attack • Defence
COMP90073 Security Analysis

Adversarial attacks against RL models
• Training time attack
– Withoutattack:…, 𝑠𝑠 ,𝑎𝑎 ,𝑠𝑠 ,𝑟𝑟 , 𝑠𝑠 ,𝑎𝑎 ,𝑠𝑠 ,𝑟𝑟
,…
– Withattack:…, 𝑠𝑠,𝑎𝑎,𝑠𝑠 +𝛿𝛿 ,𝑟𝑟′ , 𝑠𝑠 +𝛿𝛿 ,𝑎𝑎′ ,𝑠𝑠 +𝛿𝛿 ,𝑟𝑟′ ,…
𝑡𝑡 𝑡𝑡 𝑡𝑡+1 𝑡𝑡 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+2 𝑡𝑡+1
𝑡𝑡 𝑡𝑡 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+2 𝑡𝑡+2 𝑡𝑡+1
– Purpose: generate 𝛿𝛿𝑡𝑡+1 so that the agent will not take the next action 𝑎𝑎𝑡𝑡+1 – Crossentropyloss:𝐽𝐽=−∑𝑖𝑖𝑝𝑝𝑖𝑖𝑙𝑙𝑎𝑎𝑎𝑎𝜋𝜋𝑖𝑖
• 𝜋𝜋 = 𝑒𝑒𝑄𝑄(𝑠𝑠, 𝑎𝑎𝑖𝑖) prob. of taking action 𝑎𝑎 𝑖𝑖 ∑𝑎𝑎𝑗𝑗 𝑒𝑒𝑄𝑄(𝑠𝑠, 𝑎𝑎𝑗𝑗) 𝑖𝑖
• 𝑝𝑝 =� 1,if𝑎𝑎𝑖𝑖 =𝑎𝑎𝑡𝑡+1
– 𝛿𝛿=𝛼𝛼�𝐴𝐴𝑙𝑙𝑝𝑝𝑝𝑝𝜖𝜖 𝜕𝜕𝜕𝜕 𝜕𝜕𝑠𝑠
0, otherwise
What about targeted attacks?
𝑖𝑖
• Maximise 𝐽𝐽 = −𝑙𝑙𝑎𝑎𝑎𝑎𝜋𝜋𝑡𝑡+1minimise the prob. of taking 𝑎𝑎𝑡𝑡+1
COMP90073 Security Analysis

Overview
• Background on reinforcement learning – Introduction
– Q-learning
– Application in defending against DDoS attacks • Adversarial attacks against RL models
– Test time attack
– Training time attack • Defence
COMP90073 Security Analysis

Defence
• Adversarial training [7]
– Calculate 𝛿𝛿 using the attacker’s strategy: 𝑠𝑠 , 𝑎𝑎 , 𝑠𝑠 + 𝛿𝛿 , 𝑟𝑟
𝑡𝑡 𝑡𝑡 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡′
– Generate experience 𝑠𝑠 , 𝑎𝑎′ , 𝑠𝑠′ , 𝑟𝑟′ for the agent to train on
–𝑎𝑎′ =argmax𝑄𝑄𝑠𝑠 +𝛿𝛿 ,𝑎𝑎 𝑡𝑡+1 𝑎𝑎 𝑡𝑡+1 𝑡𝑡+1
𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+2 𝑡𝑡+1
Untampered state Potentially non-optimal action explore more
COMP90073 Security Analysis

Summary
• Reinforcementlearning
– State,action,reward
– Valuefunction,policy,model
– Q-learningQ-networkDQNDDQN
• Adversarialreinforcementlearning – Testtimeattack
• Timing of the attack
• Black-box attack – Trainingtimeattack
• Defence–adversarialtraining
COMP90073 Security Analysis

References
• [1] R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning, First. Cambridge, MA, USA: MIT Press, 1998.
• [2] V. Mnih et al., “Playing Atari with Deep Reinforcement Learning,” CoRR, vol. abs/1312.5602, 2013.
• [3] H. V. Hasselt, A. Guez, and D. Silver, “Deep Reinforcement Learning with Double Q-learning,” eprint arXiv:1509.06461, Sep. 2015.
• [4] K. Malialis and D. Kudenko, “Multiagent Router Throttling: Decentralized Coordinated Response Against DDoS Attacks,” in Proc. of the 27th AAAI Conference on Artificial Intelligence, Washington, 2013, pp. 1551–1556.
• [5] S. Huang, N. Papernot, I. Goodfellow, Y. Duan, and P. Abbeel, “Adversarial Attacks on Neural Network Policies,” eprint arXiv:1702.02284, 2017.
• [6] Y.-C. Lin, Z.-W. Hong, Y.-H. Liao, M.-L. Shih, M.-Y. Liu, and M. Sun, “Tactics of Adversarial Attack on Deep Reinforcement Learning Agents,” eprint arXiv:1703.06748, Mar. 2017.
• [7] A. Pattanaik, Z. Tang, S. Liu, G. Bommannan, and G. Chowdhary, “Robust Deep Reinforcement Learning with Adversarial Attacks,” arXiv:1712.03632 [cs], Dec. 2017.
COMP90073 Security Analysis

References
• [8]JianwenSunandTianweiZhangandXiaofeiXieandLeiMaand and and , “Stealthy and Efficient Adversarial Attacks against Deep Reinforcement Learning,” AAAI 2020: 5883-5891
• [9]MatthewInkawhich,YiranChen,andHaiLi.2020.SnoopingAttackson Deep Reinforcement Learning. In Proceedings of the 19th International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS ’20). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 557–565.
COMP90073 Security Analysis

Adversarial Reinforcement Learning in Autonomous Cyber Defence

Adversarial attacks against RL models
• Attacker:propagatesthroughthenetworktocompromisethecriticalserver
• ThedefenderappliesRLtopreventthecriticalserverfromcompromise,and
preserve as many nodes as possible
1
2 3
4 5 7
9
11 17 12 18
88 94 93 100 24 25 36 46 54 65 80 89 9596 97 98
29 28 14 19 30
47 57 67 82 85 91 99 48 56 66 83 84
26 383755819092
27 39
31 42 41
6
8 13 16
40 49 58 68 86 59 70 69
20 23
50 60 71 87 51 61 72 73 78
32 43
22 44526263 79
10 15 21
33 35 74
34 4553647677 75
Initially compromised nodes Critical node
Possible migration destination
Nodes, links only visible to the defender
Nodes, links visible to the defender & the attacker
COMP90073 Security Analysis

Adversarial attacks against RL models
• Attacker:partialobservabilityofthenetworktopology
88 94 93 100
1 11 17 24 25 36 46 54 65 80 89 9596 97 98
2 3
4 5 7
9
10
2C6 ritical node 55
81 67
66 58
61 62
90 82
92
99
91
86
87
12
14
13
18
being compromised
85 68
28 38 37 47
57 56
29
30 27
39
48
50 51
45
83
84
19
20 21
40 41
44
49
6
31
32
33 34
42 43
59 60
70 72
63 64
69
71 73
8
16 15
23 22
52
78 77
35
74 75
79
53
76
COMP90073 Security Analysis

Adversarial attacks against RL models
State of each link, 0: on, 1: off – State:[0,0,…,0,0,0,…,0]
• Problemdefinition
State of each node, 0: uncompromised, 1: compromised
– Action:
• Action 0~N-1: isolate & patch a node 𝑝𝑝 ∈ 0,𝑁𝑁 − 1
• Action N~2N-1: reconnect a node 𝑝𝑝 ∈ 0, 𝑁𝑁 − 1
• Action 2N~2N+M-1: migrate the critical node to one of the M destinations • Action 2N+M: take no action
– Reward:
• -1: (1) critical node is compromised or isolated, (2) invalid action
• Proportional to number of uncompromised nodes that can still reach the critical node
– Attackercanonlycompromisedanodexifthereisavisiblelinkbetweenx and any compromised node
COMP90073 Security Analysis

Adversarial attacks against RL models
• Withouttraining-timeattacks
88
94 90
82 83
93
100 98
99
1 2
3 4
5 7
9
11
12
17 18
24 26
25 28
31 32
33 34
46 37
54 47
65
80 57
56 56
59 60
89
95 96
85
97 97
92 91
86 78
Isolate 90
36 38
55 48
50 51
81 67
29 30
Isolate 31
14
13 16 15
19
27
39 42
40 40
41 44
49 49
66 66
84 69
71 73
58 61
62
68 68
6
70 72
87 87
8 10
20 21
23 2
43
51 45
72 63
Isolate 62
22 2
35 35
52
74 75
79 77
64 Isolate 53
5
5
5
5
3
3
3
3
76
82/100 nodes are preserved
Isolate 22
COMP90073 Security Analysis

Adversarial attacks against RL models
• Training-timeattack:manipulatestatestopreventagentsfromtakingoptimal actions
– 𝑠𝑠 , 𝑎𝑎 , 𝑠𝑠 , 𝑟𝑟  𝑠𝑠 , 𝑎𝑎 , 𝑠𝑠 + 𝛿𝛿 , 𝑟𝑟′ 𝑡𝑡 𝑡𝑡 𝑡𝑡+1 𝑡𝑡 𝑡𝑡 𝑡𝑡 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡
– Binary state  cannot use gradient-descent based method
– 𝛿𝛿 : false positives & false negatives
– Theattackercannotmanipulatethestatesofalltheobservablenodes • LFP: nodes that can be perturbed as false positive
• L : nodes that can be perturbed as false negative
– min𝑄𝑄 𝑠𝑠 +𝛿𝛿 ,𝑎𝑎 FN𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+1
,𝑎𝑎
:theoptimalactionfor𝑠𝑠 𝑡𝑡+1 𝑡𝑡+1
thathasbeen
learned so far
• Loop through LFP (LFN) and flip the state of one node per time
• Rank all nodes based on ΔQ (decrease of Q-value by flipping state) • Flip the states of the top K nodes
COMP90073 Security Analysis

Adversarial attacks against RL models
COMP90073 Security Analysis

Adversarial attacks against RL models
• Result
COMP90073 Security Analysis

Adversarial attacks against RL models
• After training-time attacks
1 11 17 24 25 36 46 54 65 80 89 9596 97 98
88 81
67 66
58 61
62 Isolate 53
93 100 94 Isolate 90
2 3
4 5 7
26
37 39 Isolate 31
42 43
55 48
50 51
90 82
92 91
12 14
13
18
29 30
23
28
38 27
47 40
41 44
57 56
59 60
85 68
99
19
49
83 70
84 69
6
31
3 32
86
Isolate 87
9
10
20
71 73
74 75
87 78
8
16 15
72 63
64
2 Isolate 21
Isolate 22
52
2
2
22
33 35 34
79 77
2
45
3
1
1
5
5
5
76
3
COMP90073 Security Analysis

Inversion defence method
• Aimtoreverttheperturbation/falsereadings
𝑠𝑠→𝑠𝑠+𝛿𝛿 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+1
Attacker
minimi𝑠𝑠𝑒𝑒 𝑄𝑄 𝑠𝑠𝑡𝑡+1 + 𝛿𝛿𝑡𝑡+1, 𝑎𝑎𝑡𝑡+1 Loop through LFP and LFN
𝑠𝑠+𝛿𝛿→𝑠𝑠+𝛿𝛿+𝛿𝛿 maximi𝑠𝑠𝑒𝑒𝑄𝑄 𝑠𝑠 +𝛿𝛿 +𝛿𝛿′ ,𝑎𝑎
Defender
′ 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+1
𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+1 Loop through all nodes
Flip K nodes
Flip K’ nodes
• EffectiveevenifK’≠K
• Minimumimpactonnormaltrainingprocess(i.e.,K=0,K’>0)
COMP90073 Security Analysis

Inversion defence method
• Before&afterthedefencemethodisapplied
100 80 60 40 20
80
60
100 80
80 70 60
20 00
40 20
40 30 20 10
Have no impact
Cause fewer nodes to be preserved Cause the critical server to be compromis Average number of preserved nodes
ed
40 60 50
Have no impact
Cause fewer nodes to be preserved
Cause the critical server to be compromised Average number of preserved nodes
1 FP + 1 FN,
1 FP + 1 FN, FP {42, 32, 85, 34, 48, 50},
FN {64, 90}
1 FP + 1 FN,
FP {42, 32, 85, 34, 48,
50, 49, 89},
FN {64, 90}
2 FPs + 2 FNs, FP {42, 32,
85, 34, 48, 50}, FN {64, 90}
2 FPs + 2 FNs, FP {42, 32,
85, 34, 48, 50, 49, 89},
FN {64, 90}
FP FN
{42, 32, 85, 34},
{64, 90}
00 Dfn: 1 FP + 1 FN Dfn: 2 FPs + 2 F : 3 FPs + 3 F : 2 FPs + 2 F : 2 FPs + 2 F : 2 FPs + 2 F : 2 FPs + 2 FNs No attack
COMP90073 Security Analysis
Percentage of attacks
No. of preserved servers
Percentage of attacks
No. of preserved servers