Week 11: Adversarial Reinforcement Learning
COMP90073 Security Analytics
, CIS Semester 2, 2021
Copyright By PowCoder代写 加微信 powcoder
• Background on reinforcement learning – Introduction
– Q-learning
– Application in defending against DDoS attacks • Adversarial attacks against RL models
– Test time attack
– Training time attack • Defence
COMP90073 Security Analysis
• Background on reinforcement learning – Introduction
– Q-learning
– Application in defending against DDoS attacks • Adversarial attacks against RL models
– Test time attack
– Training time attack • Defence
COMP90073 Security Analysis
Background on reinforcement learning
• Application
https://www.tesla.com/videos/autopilot-self-driving- hardware-neighborhood-long
https://www.myrealfacts.com/2019/05/applications-of- reinforcement-learning.html
COMP90073 Security Analysis
Background on reinforcement learning
• Introduction
Environment Assessment
S A RReward S A R S A R 111 222 333
over the long run: 𝑅𝑅 = ∑∞ 𝛾𝛾𝜏𝜏−𝑡𝑡𝑟𝑟 ,𝛾𝛾 ∈ (0,1] 𝑡𝑡 𝜏𝜏=𝑡𝑡 𝜏𝜏
Maximise the discounted cumulative rewards
COMP90073 Security Analysis
Background on reinforcement learning
01111111 02111111 01000010 02000010
11000110 22000110 01110100 01110100
10010110 10010110 11110011 11110011
COMP90073 Security Analysis
Background on reinforcement learning
• Action – Up
– Down – Right
• Reward: an immediate feedback on whether an action is good – In the range of [-1, 1]
– 1: reach the exit
– -0.8: move to a blocked cell
– -0.3: move to a visited cell
– -0.05: move to an adjacent cell
COMP90073 Security Analysis
Background on reinforcement learning
• Policy (𝜋𝜋): a mapping from state to action, i.e. a = 𝜋𝜋(𝑠𝑠), it tells the agent what to do in a given state
0.9 0.05 0.05 0
COMP90073 Security Analysis
Background on reinforcement learning
• Value function: the future, long term reward of a state
– Valueofstate𝑠𝑠underpolicy𝜋𝜋:𝑉𝑉𝜋𝜋 𝑠𝑠 =𝔼𝔼 ∑𝑇𝑇 𝛾𝛾𝑖𝑖−1𝑟𝑟|𝑆𝑆 =𝑠𝑠
𝑖𝑖=1 𝑖𝑖𝑡𝑡 – Expected value of following policy 𝜋𝜋 starting from state s
– Conditional on some policy 𝜋𝜋
COMP90073 Security Analysis
Background on reinforcement learning
• Model of the environment: mimic the behaviour of the environment, e.g., given a state & action, what the next state & reward might be.
COMP90073 Security Analysis
Background on reinforcement learning
Environment
Assessment
Policy, Value
COMP90073 Security Analysis
Background on reinforcement learning
• Classification
Value-based algorithm estimates the value function
Policy-based algorithm learns the policy directly
Actor-critic: critic updates action-value function, actor updates policy
COMP90073 Security Analysis
Background on reinforcement learning
• Classification
– Model free algorithm directly learns the policy and/or the value
– Model based algorithm first builds up how the environment works
COMP90073 Security Analysis
• Background on reinforcement learning – Introduction
– Q-learning
– Application in defending against DDoS attacks • Adversarial attacks against RL models
– Test time attack
– Training time attack • Defence
COMP90073 Security Analysis
Background on reinforcement learning
• Q-learning: estimate action-value function 𝑄𝑄 𝑠𝑠, 𝑎𝑎
– Expected value of taking action 𝑎𝑎 in state 𝑠𝑠 and then following
policy 𝜋𝜋: 𝑇𝑇
𝑄𝑄𝜋𝜋 𝑠𝑠,𝑎𝑎 =𝔼𝔼 �𝑖𝑖=1𝛾𝛾𝑖𝑖−1𝑟𝑟𝑖𝑖|𝑆𝑆𝑡𝑡 =𝑠𝑠,𝐴𝐴𝑡𝑡 =𝑎𝑎
http://mnemstudio.org/path-finding-q-learning-tutorial.htm
action reward
COMP90073 Security Analysis
Background on reinforcement learning
• Q-learning
100+0.8*max(0,0,0) 0+0.8*max(0,100)
COMP90073 Security Analysis
Background on reinforcement learning
• Exploitation vs. Exploration ε-greedy
COMP90073 Security Analysis
Background on reinforcement learning
• The tabular version does not scale with the action/state space • Classical Q Network [1]
– Function approximation
– Approximate Q(s,a) via a neural network: Q(s,a) ≈ Q*(s,a,θ)
– 𝐿𝐿 𝜃𝜃 =𝔼𝔼 𝑟𝑟+𝛾𝛾max𝑄𝑄 𝑠𝑠′,𝑎𝑎′;𝜃𝜃 −𝑄𝑄 𝑠𝑠,𝑎𝑎;𝜃𝜃 – Unstable 𝑎𝑎′ Target Q
COMP90073 Security Analysis
Background on reinforcement learning
• Deep Q Network (DQN) [2]
– Experience replay: draw randomly from a buffer of (s, a, s’, r)
′′− –𝑄𝑄𝑠𝑠,𝑎𝑎 ←𝑟𝑟+𝛾𝛾𝑚𝑚𝑎𝑎𝑚𝑚′𝑄𝑄(𝑠𝑠,𝑎𝑎,𝜃𝜃𝜃𝜃))
– 𝐿𝐿 𝜃𝜃 =𝔼𝔼 𝑟𝑟+𝛾𝛾max𝑄𝑄 𝑠𝑠′,𝑎𝑎′;𝜃𝜃− −𝑄𝑄 𝑠𝑠,𝑎𝑎;𝜃𝜃 𝑎𝑎′
(s’, a’, s”, r’)
– Reward clipped to [-1, 1] • Double DQN (DDQN) [3]
– Separate action selection from action evaluation – 𝑄𝑄 𝑠𝑠,𝑎𝑎 ←𝑟𝑟+𝛾𝛾𝑄𝑄 (𝑠𝑠′,𝑎𝑎𝑟𝑟𝑎𝑎𝑚𝑚𝑎𝑎𝑚𝑚 ′𝑄𝑄 (𝑠𝑠′,𝑎𝑎′))
– 𝐿𝐿 𝜃𝜃 =𝔼𝔼 𝑟𝑟+𝛾𝛾𝑄𝑄2(𝑠𝑠′,argmax𝑄𝑄1(𝑠𝑠′,𝑎𝑎′))−𝑄𝑄1 𝑠𝑠,𝑎𝑎;𝜃𝜃 𝑎𝑎′
COMP90073 Security Analysis
• Background on reinforcement learning – Introduction
– Q-learning
– Application in defending against DDoS attacks • Adversarial attacks against RL models
– Test time attack
– Training time attack • Defence
COMP90073 Security Analysis
Background on reinforcement learning
• Distributed Denial-of-Service (DDoS) attacks still occur almost every hour globally
– http://www.digitalattackmap.com/
– StatisticsaregatheredbyArbor’sActiveThreatLevelAnalysisSystem
from 330+ ISP customers with 130Tbps of global traffic
• Can RL be applied to throttle flooding DDoS attacks?
COMP90073 Security Analysis
Background on reinforcement learning
• Problem setup [5]
– Amixedsetoflegitimateusers&
– Aggregated traffic at s ∈ 𝐿𝐿 , 𝑈𝑈
– RLagentsdecidesthedroprates
– Noanomalydetection–expensive
R: router H: host
R12 R13 H3
R3 R7H4 R5
Server to protect
, , Multiagent Router Throttling: Decentralized Coordinated Response Against DDoS Attacks, In Proc. of AAAI 2013
Legitimate or malicious users
COMP90073 Security Analysis
Background on reinforcement learning
• RL problem formalisation – Statespace
• Aggregated traffic arrived at the router over the last T seconds – Actionset
• Percentage of traffic to drop: 0, 10%, 20%, 30%, … 90%
• Aggregated traffic at s > Us ? • Legitimate traffic reached s
R12 R13 H3
R3 R7H4 R5
COMP90073 Security Analysis
Background on reinforcement learning
• Training (DDQN)
R3 R7H4 R5
50M 70M 20M
0.8, (0.4, 0.6, 0.3)
0.5, (0.7, 0.6, 0.6)
COMP90073 Security Analysis
Background on reinforcement learning
𝐿𝐿 𝜃𝜃 =𝔼𝔼 𝑟𝑟+𝛾𝛾𝑄𝑄2(𝑠𝑠′,argmax𝑄𝑄1(𝑠𝑠′,𝑎𝑎′))−𝑄𝑄1 𝑠𝑠,𝑎𝑎;𝜃𝜃 𝑎𝑎′
COMP90073 Security Analysis
Background on reinforcement learning
– 10000 cases (may not be seen in training)
100 80 60 40 20
1 0.8 0.6 0.4 0.2 0 -0.2
Average normalised reward
Avg reward ≈ 0.7
Reward ≤ 0.6 in 35% cases
COMP90073 Security Analysis
Cumulative percentage (%)
• Background on reinforcement learning – Introduction
– Q-learning
– Application in defending against DDoS attacks • Adversarial attacks against RL models
– Test time attack
– Training time attack • Defence
COMP90073 Security Analysis
Adversarial attacks against RL models
• Test time attacks
– Manipulate the environment observed by the agent [5]
– Without attack: …,𝑠𝑠 ,𝑎𝑎 ,𝑟𝑟 ,𝑠𝑠 ,𝑎𝑎 ,𝑟𝑟 ,𝑠𝑠
𝑡𝑡 𝑡𝑡 𝑡𝑡 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+2
– Withattack:…,𝑠𝑠 +𝛿𝛿 ,𝑎𝑎′,𝑟𝑟′,𝑠𝑠 +𝛿𝛿 ,𝑎𝑎′ ,𝑟𝑟′ ,𝑠𝑠 +𝛿𝛿 ,… 𝑡𝑡 𝑡𝑡 𝑡𝑡 𝑡𝑡 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+2 𝑡𝑡+2
COMP90073 Security Analysis
Adversarial attacks against RL models
• 𝐽𝐽(𝜃𝜃,𝑚𝑚,𝑦𝑦)
– y: softmax of the Q-value, i.e., prob. of taking an action
RL agent (𝜃𝜃)
1.5 𝑒𝑒 0.07 2.2 Softmax 0.15
… 𝜋𝜋= 𝑄𝑄(𝑠𝑠,𝑎𝑎𝑖𝑖) … 0.4 𝑎𝑎𝑗𝑗 𝑗𝑗 0.02
3.6 𝑖𝑖 ∑ 𝑒𝑒𝑄𝑄(𝑠𝑠,𝑎𝑎) 0.61
– J: cross-entropy loss between y and the distribution that places all weight on the action with the highest Q-value
0 similar to the one-hot vector
… … in supervised learning 0.02 0
Q-value for each action Prob.
COMP90073 Security Analysis
Adversarial attacks against RL models
• Timing of the attack
– Heuristic method [6]: launch the attack only when
𝑒𝑒𝑄𝑄(𝑠𝑠, 𝑎𝑎) 𝑒𝑒𝑄𝑄 𝑠𝑠, 𝑎𝑎
𝑐𝑐𝑠𝑠 =max 𝑇𝑇 −min 𝑇𝑇 >𝛽𝛽 𝑎𝑎 ∑ 𝑒𝑒𝑄𝑄(𝑠𝑠,𝑎𝑎𝑘𝑘) 𝑎𝑎 ∑ 𝑒𝑒𝑄𝑄 𝑠𝑠,𝑎𝑎𝑘𝑘
𝑎𝑎𝑘𝑘𝑇𝑇 𝑎𝑎𝑘𝑘𝑇𝑇
COMP90073 Security Analysis
Adversarial attacks against RL models
• Timing of the attack [8] – “Brute-force” search
– Consider all possible N consecutive perturbations – Evaluate the attack damage at step t + M (M ≥ N) –𝑠𝑠,𝑎𝑎,𝑠𝑠 ,𝑎𝑎 ,…,𝑠𝑠 ,…,𝑠𝑠
𝑡𝑡 𝑡𝑡 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+𝑁𝑁−1
COMP90073 Security Analysis
Adversarial attacks against RL models
• Timing of the attack [8] – “Brute-force” search
– Train a prediction model: (𝑠𝑠 , 𝑎𝑎 ) → 𝑠𝑠
{(𝑠𝑠 ,𝑎𝑎 ),(𝑠𝑠 ,𝑎𝑎 𝑡𝑡 𝑡𝑡 𝑡𝑡+1
),…(𝑠𝑠 ,𝑎𝑎 𝑡𝑡+1 𝑡𝑡+𝑀𝑀 𝑡𝑡+𝑀𝑀
– Predict the subsequent states and actions,
– Assess the potential damage of all possible strategies – Danger Awareness Metric (DAM):
𝐷𝐷𝐴𝐴𝐷𝐷 = 𝑇𝑇 𝑠𝑠′ − 𝑇𝑇(𝑠𝑠 ) 𝑡𝑡+𝑀𝑀 𝑡𝑡+𝑀𝑀
𝑇𝑇: domain-specific definition, e.g., distance between the car and the centre of the road
COMP90073 Security Analysis
Adversarial attacks against RL models
• Timing of the attack [8]
– Train an antagonist model
– Learn the optimal attack strategy automatically without any domain knowledge
– Maintain a policy: 𝑠𝑠 → 𝑝𝑝 ,𝑎𝑎′ 𝑡𝑡𝑡𝑡𝑡𝑡
– If 𝑝𝑝𝑡𝑡 > 0.5, add the perturbation to trigger 𝑎𝑎𝑡𝑡′
– Take the original action 𝑎𝑎𝑡𝑡
– Reward: negative of the agent’s reward
COMP90073 Security Analysis
Adversarial attacks against RL models
• Black-box attack [9]
– Train a proxy model that learns a task that is related to the
target agent’s policy – S threat model
– Only have access to the states
– Approximate an expectation of the state transition function
– 𝑝𝑝𝑠𝑠𝑦𝑦𝑐𝑐𝑝𝑝𝑝𝑐𝑐(𝑠𝑠 , 𝜃𝜃 ) ≈ 𝔼𝔼 𝑃𝑃(𝑠𝑠 |𝑠𝑠 ) 𝑡𝑡 𝑃𝑃 𝜋𝜋𝑇𝑇 𝑡𝑡+1 𝑡𝑡
– SR threat model
– Have access to the states and reward
– Estimate the value V of a given state under the policy 𝜋𝜋
–𝑎𝑎𝑠𝑠𝑠𝑠𝑒𝑒𝑠𝑠𝑠𝑠𝑎𝑎𝑟𝑟𝑠𝑠,𝜃𝜃 ≈𝔼𝔼 ∑∞ 𝛾𝛾𝑘𝑘𝑟𝑟 =𝑉𝑉𝜋𝜋𝑇𝑇𝑠𝑠 𝑡𝑡𝐴𝐴 𝜋𝜋𝑇𝑇𝑘𝑘=0𝑡𝑡𝑡𝑡+𝑘𝑘+1 𝑡𝑡
COMP90073 Security Analysis
Adversarial attacks against RL models
• Black-box attack [9] – SA threat model
– Approximate the target’s policy 𝜋𝜋
– Have access to states and actions
– 𝑝𝑝𝑚𝑚𝑝𝑝𝑖𝑖𝑎𝑎𝑖𝑖𝑎𝑎𝑟𝑟(𝑠𝑠𝑡𝑡, 𝜃𝜃𝐼𝐼) ≈ 𝜋𝜋𝑇𝑇(𝑠𝑠𝑡𝑡) – SRA threat model
– Have access to states, actions and rewards
– Action-conditioned psychic (AC-psychic):
𝐴𝐴𝐴𝐴−𝑝𝑝𝑠𝑠𝑦𝑦𝑐𝑐𝑝𝑝𝑝𝑐𝑐(𝑠𝑠 ,𝜃𝜃 ) ≈ 𝔼𝔼 𝑡𝑡 𝑃𝑃 𝜋𝜋
𝑃𝑃(𝑠𝑠 |𝑠𝑠 ,𝑎𝑎 ) 𝑡𝑡+1 𝑡𝑡 𝑡𝑡
– Combine 𝑎𝑎𝑠𝑠𝑠𝑠𝑒𝑒𝑠𝑠𝑠𝑠𝑎𝑎𝑟𝑟 and 𝐴𝐴𝐴𝐴−𝑝𝑝𝑠𝑠𝑦𝑦𝑐𝑐𝑝𝑝𝑝𝑐𝑐 to decide whether to
perturb the state
COMP90073 Security Analysis
Adversarial attacks against RL models
• Black-box attack [9]
– SRA threat model
𝒦𝒦 ∈ {𝑆𝑆, 𝑆𝑆𝑅𝑅, 𝑆𝑆𝐴𝐴}
COMP90073 Security Analysis
Adversarial attacks against RL models
• Black-box attack [9]
– Surrogate: assume the adversary has access to the target agent’s environment and can train an identical model
COMP90073 Security Analysis
• Background on reinforcement learning – Introduction
– Q-learning
– Application in defending against DDoS attacks • Adversarial attacks against RL models
– Test time attack
– Training time attack • Defence
COMP90073 Security Analysis
Adversarial attacks against RL models
• Training time attack
– Withoutattack:…, 𝑠𝑠 ,𝑎𝑎 ,𝑠𝑠 ,𝑟𝑟 , 𝑠𝑠 ,𝑎𝑎 ,𝑠𝑠 ,𝑟𝑟
– Withattack:…, 𝑠𝑠,𝑎𝑎,𝑠𝑠 +𝛿𝛿 ,𝑟𝑟′ , 𝑠𝑠 +𝛿𝛿 ,𝑎𝑎′ ,𝑠𝑠 +𝛿𝛿 ,𝑟𝑟′ ,…
𝑡𝑡 𝑡𝑡 𝑡𝑡+1 𝑡𝑡 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+2 𝑡𝑡+1
𝑡𝑡 𝑡𝑡 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+2 𝑡𝑡+2 𝑡𝑡+1
– Purpose: generate 𝛿𝛿𝑡𝑡+1 so that the agent will not take the next action 𝑎𝑎𝑡𝑡+1 – Crossentropyloss:𝐽𝐽=−∑𝑖𝑖𝑝𝑝𝑖𝑖𝑙𝑙𝑎𝑎𝑎𝑎𝜋𝜋𝑖𝑖
• 𝜋𝜋 = 𝑒𝑒𝑄𝑄(𝑠𝑠, 𝑎𝑎𝑖𝑖) prob. of taking action 𝑎𝑎 𝑖𝑖 ∑𝑎𝑎𝑗𝑗 𝑒𝑒𝑄𝑄(𝑠𝑠, 𝑎𝑎𝑗𝑗) 𝑖𝑖
• 𝑝𝑝 =� 1,if𝑎𝑎𝑖𝑖 =𝑎𝑎𝑡𝑡+1
– 𝛿𝛿=𝛼𝛼�𝐴𝐴𝑙𝑙𝑝𝑝𝑝𝑝𝜖𝜖 𝜕𝜕𝜕𝜕 𝜕𝜕𝑠𝑠
0, otherwise
What about targeted attacks?
• Maximise 𝐽𝐽 = −𝑙𝑙𝑎𝑎𝑎𝑎𝜋𝜋𝑡𝑡+1minimise the prob. of taking 𝑎𝑎𝑡𝑡+1
COMP90073 Security Analysis
• Background on reinforcement learning – Introduction
– Q-learning
– Application in defending against DDoS attacks • Adversarial attacks against RL models
– Test time attack
– Training time attack • Defence
COMP90073 Security Analysis
• Adversarial training [7]
– Calculate 𝛿𝛿 using the attacker’s strategy: 𝑠𝑠 , 𝑎𝑎 , 𝑠𝑠 + 𝛿𝛿 , 𝑟𝑟
𝑡𝑡 𝑡𝑡 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡′
– Generate experience 𝑠𝑠 , 𝑎𝑎′ , 𝑠𝑠′ , 𝑟𝑟′ for the agent to train on
–𝑎𝑎′ =argmax𝑄𝑄𝑠𝑠 +𝛿𝛿 ,𝑎𝑎 𝑡𝑡+1 𝑎𝑎 𝑡𝑡+1 𝑡𝑡+1
𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+2 𝑡𝑡+1
Untampered state Potentially non-optimal action explore more
COMP90073 Security Analysis
• Reinforcementlearning
– State,action,reward
– Valuefunction,policy,model
– Q-learningQ-networkDQNDDQN
• Adversarialreinforcementlearning – Testtimeattack
• Timing of the attack
• Black-box attack – Trainingtimeattack
• Defence–adversarialtraining
COMP90073 Security Analysis
References
• [1] R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning, First. Cambridge, MA, USA: MIT Press, 1998.
• [2] V. Mnih et al., “Playing Atari with Deep Reinforcement Learning,” CoRR, vol. abs/1312.5602, 2013.
• [3] H. V. Hasselt, A. Guez, and D. Silver, “Deep Reinforcement Learning with Double Q-learning,” eprint arXiv:1509.06461, Sep. 2015.
• [4] K. Malialis and D. Kudenko, “Multiagent Router Throttling: Decentralized Coordinated Response Against DDoS Attacks,” in Proc. of the 27th AAAI Conference on Artificial Intelligence, Washington, 2013, pp. 1551–1556.
• [5] S. Huang, N. Papernot, I. Goodfellow, Y. Duan, and P. Abbeel, “Adversarial Attacks on Neural Network Policies,” eprint arXiv:1702.02284, 2017.
• [6] Y.-C. Lin, Z.-W. Hong, Y.-H. Liao, M.-L. Shih, M.-Y. Liu, and M. Sun, “Tactics of Adversarial Attack on Deep Reinforcement Learning Agents,” eprint arXiv:1703.06748, Mar. 2017.
• [7] A. Pattanaik, Z. Tang, S. Liu, G. Bommannan, and G. Chowdhary, “Robust Deep Reinforcement Learning with Adversarial Attacks,” arXiv:1712.03632 [cs], Dec. 2017.
COMP90073 Security Analysis
References
• [8]JianwenSunandTianweiZhangandXiaofeiXieandLeiMaand and and , “Stealthy and Efficient Adversarial Attacks against Deep Reinforcement Learning,” AAAI 2020: 5883-5891
• [9]MatthewInkawhich,YiranChen,andHaiLi.2020.SnoopingAttackson Deep Reinforcement Learning. In Proceedings of the 19th International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS ’20). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 557–565.
COMP90073 Security Analysis
Adversarial Reinforcement Learning in Autonomous Cyber Defence
Adversarial attacks against RL models
• Attacker:propagatesthroughthenetworktocompromisethecriticalserver
• ThedefenderappliesRLtopreventthecriticalserverfromcompromise,and
preserve as many nodes as possible
11 17 12 18
88 94 93 100 24 25 36 46 54 65 80 89 9596 97 98
29 28 14 19 30
47 57 67 82 85 91 99 48 56 66 83 84
26 383755819092
40 49 58 68 86 59 70 69
50 60 71 87 51 61 72 73 78
22 44526263 79
34 4553647677 75
Initially compromised nodes Critical node
Possible migration destination
Nodes, links only visible to the defender
Nodes, links visible to the defender & the attacker
COMP90073 Security Analysis
Adversarial attacks against RL models
• Attacker:partialobservabilityofthenetworktopology
88 94 93 100
1 11 17 24 25 36 46 54 65 80 89 9596 97 98
2C6 ritical node 55
being compromised
28 38 37 47
30 27
COMP90073 Security Analysis
Adversarial attacks against RL models
State of each link, 0: on, 1: off – State:[0,0,…,0,0,0,…,0]
• Problemdefinition
State of each node, 0: uncompromised, 1: compromised
• Action 0~N-1: isolate & patch a node 𝑝𝑝 ∈ 0,𝑁𝑁 − 1
• Action N~2N-1: reconnect a node 𝑝𝑝 ∈ 0, 𝑁𝑁 − 1
• Action 2N~2N+M-1: migrate the critical node to one of the M destinations • Action 2N+M: take no action
• -1: (1) critical node is compromised or isolated, (2) invalid action
• Proportional to number of uncompromised nodes that can still reach the critical node
– Attackercanonlycompromisedanodexifthereisavisiblelinkbetweenx and any compromised node
COMP90073 Security Analysis
Adversarial attacks against RL models
• Withouttraining-timeattacks
Isolate 90
Isolate 31
Isolate 62
64 Isolate 53
82/100 nodes are preserved
Isolate 22
COMP90073 Security Analysis
Adversarial attacks against RL models
• Training-timeattack:manipulatestatestopreventagentsfromtakingoptimal actions
– 𝑠𝑠 , 𝑎𝑎 , 𝑠𝑠 , 𝑟𝑟 𝑠𝑠 , 𝑎𝑎 , 𝑠𝑠 + 𝛿𝛿 , 𝑟𝑟′ 𝑡𝑡 𝑡𝑡 𝑡𝑡+1 𝑡𝑡 𝑡𝑡 𝑡𝑡 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡
– Binary state cannot use gradient-descent based method
– 𝛿𝛿 : false positives & false negatives
– Theattackercannotmanipulatethestatesofalltheobservablenodes • LFP: nodes that can be perturbed as false positive
• L : nodes that can be perturbed as false negative
– min𝑄𝑄 𝑠𝑠 +𝛿𝛿 ,𝑎𝑎 FN𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+1
:theoptimalactionfor𝑠𝑠 𝑡𝑡+1 𝑡𝑡+1
thathasbeen
learned so far
• Loop through LFP (LFN) and flip the state of one node per time
• Rank all nodes based on ΔQ (decrease of Q-value by flipping state) • Flip the states of the top K nodes
COMP90073 Security Analysis
Adversarial attacks against RL models
COMP90073 Security Analysis
Adversarial attacks against RL models
COMP90073 Security Analysis
Adversarial attacks against RL models
• After training-time attacks
1 11 17 24 25 36 46 54 65 80 89 9596 97 98
62 Isolate 53
93 100 94 Isolate 90
37 39 Isolate 31
Isolate 87
2 Isolate 21
Isolate 22
COMP90073 Security Analysis
Inversion defence method
• Aimtoreverttheperturbation/falsereadings
𝑠𝑠→𝑠𝑠+𝛿𝛿 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+1
minimi𝑠𝑠𝑒𝑒 𝑄𝑄 𝑠𝑠𝑡𝑡+1 + 𝛿𝛿𝑡𝑡+1, 𝑎𝑎𝑡𝑡+1 Loop through LFP and LFN
𝑠𝑠+𝛿𝛿→𝑠𝑠+𝛿𝛿+𝛿𝛿 maximi𝑠𝑠𝑒𝑒𝑄𝑄 𝑠𝑠 +𝛿𝛿 +𝛿𝛿′ ,𝑎𝑎
′ 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+1
𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+1 𝑡𝑡+1 Loop through all nodes
Flip K nodes
Flip K’ nodes
• EffectiveevenifK’≠K
• Minimumimpactonnormaltrainingprocess(i.e.,K=0,K’>0)
COMP90073 Security Analysis
Inversion defence method
• Before&afterthedefencemethodisapplied
100 80 60 40 20
40 30 20 10
Have no impact
Cause fewer nodes to be preserved Cause the critical server to be compromis Average number of preserved nodes
Have no impact
Cause fewer nodes to be preserved
Cause the critical server to be compromised Average number of preserved nodes
1 FP + 1 FN,
1 FP + 1 FN, FP {42, 32, 85, 34, 48, 50},
FN {64, 90}
1 FP + 1 FN,
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com