程序代写代做代考 cache game html distributed system deep learning algorithm finance Received July 20, 2020, accepted July 30, 2020, date of publication August 6, 2020, date of current version August 19, 2020.

Received July 20, 2020, accepted July 30, 2020, date of publication August 6, 2020, date of current version August 19, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.3014791
Deep Reinforcement Learning-Based Access Control for Buffer-Aided Relaying Systems With Energy Harvesting
HAODI ZHANG 1, DI ZHAN1, CHEN JASON ZHANG 2, (Member, IEEE),
KAISHUN WU 1, (Member, IEEE), YE LIU1, AND SHENG LUO 1, (Member, IEEE)
1College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518061, China 2Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong
Corresponding author: Sheng Luo (sluo@szu.edu.cn) and Ye Liu (ly@szu.edu.cn)
This work was supported in part by the National Natural Science Foundation of China under Grant 61806132 and Grant 61801304, in part by the Guangdong NSF under Grant 2020A1515010741, and in part by Tencent Rhino-Bird Young Faculty Open Fund and the Priming Research Fund in Shenzhen University.
ABSTRACT This paper considered a buffer-aided relaying system with multiple source-destination pairs and a relay node (RN) with energy harvesting (EH) capability. The RN harvests energy from the ambient environment and uses the harvested energy to forward the sources’ information packet to the corresponding destinations. It is assumed that information on the EH and channel gain processes is unavailable. Thus, the model free deep reinforcement learning (DRL) method, specifically the deep Q-learning, is applied to learn an optimal link selection policy directly from historical experience to maximize the system utility. In addition, by taking advantage of the structural features of the considered system, a pretraining scheme is proposed to accelerate the training convergence of the deep Q-network. Experiment results show that the proposed pretraining method can significantly reduce the training time required. Moreover, the performance of the transmission policy obtained by using deep Q-learning is compared with that of several conventional transmission schemes. It is shown that the transmission policy obtained by using our proposed model can achieve better performance.
INDEX TERMS Energy harvesting, buffer-aided relaying, Markov decision process, deep reinforcement learning.
I. INTRODUCTION
Cooperative relaying communication, in which a relay node helps to forward the source’s information to destination, is capable of attaining significant throughput and reliability improvements [1]. It has also been shown that by adding a data buffer at the relay node (buffer-aided relaying), the throughput and reliability of cooperative relaying sys- tem can be further improved [2]–[4]. Usually, the nodes of buffer-aided relaying systems are powered by using power line (e.g., a base station) or battery with limited operation life. However, in many environments, e.g., places which is hard- to-reach, it is costly or even impossible to place dedicated power lines or replace battery regularly. To overcome these challenges, energy harvesting (EH) has been considered as
The associate editor coordinating the review of this manuscript and approving it for publication was Tao Wang .
one of the promising technologies [5]. By collecting ambient energy from environments, e.g., solar, wind, thermal, vibra- tion, radio frequency signal, and converting them into elec- trical energy, EH technology can reduce the dependence on the conventional sources of energy and prolong the lifetimes of EH devices [6].
Hence, EH based buffer-aided relaying has been studied for different network models [7]–[10]. Generally, implement- ing efficient EH communication should consider the chal- lenges of limited amount of energy that can be harvested and the changing harvestable energy and channel gain with time. In [7], by taking both the quality of the links and the energy stored in the RNs into consideration, the benefits of relay selection on multiple EH based buffer-aided relay nodes (RNs) was investigated. In [8], the achievable through- put of a buffer-aided two-way relaying system with EH were derived. The optimal transmission strategy for a three nodes
145006 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 8, 2020

H. Zhang et al.: Deep Reinforcement Learning-Based Access Control for Buffer-Aided
buffer-aided relaying system was studied in [9], where both the source node and the RN harvest energy from the environ- ment. In [10], the delay constraint for a buffer-aided relaying system with energy harvesting was considered and the system throughput was investigated. In these works, it was assumed that the EH and channel gain processes are either known ahead of time (no-causal case) or their distributions are avail- able. However, in practical systems, these information on EH and channel gain processes are usually unknown.
To overcome such challenges, model free technologies has been discussed for designing efficient transmission strategies [11]–[14]. One of the promising approaches used is reinforce- ment learning (RL), which is known as algorithms enabling to optimize system performance in unknown environments [15], [16]. In [13], an EH point-to-point communications system is investigated. The EH and channel gain processes are modeled as Markov processes and Q-learning was used to learn a transmission power allocation policy that maxi- mizes the amount of data arriving at the destination. The authors of [14] also considered an EH point-to-point com- munications system in which the energy and data arrivals are formulated as Markov processes. Q-learning is used to find the optimal transmission policy that maximizes the expected total transmitted data when the transmitter is active. In [17], the energy efficient resource allocation strategy was investi- gated for a body area network by using RL method. In [18] and [19], the actor-critic RL method were adopted to design transmission schemes for wireless communication systems. Specifically, in [18], user scheduling and resource alloca- tion in heterogeneous networks powered by hybrid energy is studied, and actor-critic RL was used to maximize energy efficiency of the overall network. In [19], the problem of energy management for EH wireless sensor nodes is con- sidered, and formulated as an RL problem with continuous state and action spaces. In [20] and [21], distributed RL based communication were investigated for EH networks. All these works have shown the potentiality of applying RL in designing communication schemes for EH communication networks.
In this paper, we consider a buffer-aided relaying sys- tem, in which multiple source-destination pairs communicate through the help of a RN. The RN has no reliable power supply and harvests energy from the environment and stores it in its battery to support wireless communication. This system can be treated as a communication system which is temporar- ily set up in an emergency or hard-to-reach environment. The RN can be a base station which cannot be powered with a power line. It is also assumed that the RN has a data buffer and thus can dynamically choose one link from the source-relay and the relay-destination links in each time slot to deliver data packets. We study how to effectively control the access of each link under the energy harvesting constraint to maximize the system utility. For this system setup, the decision (selec- tion which link to communicate) in current time slot will have impact on the future operations. Thus, it is usually modeled as a Markov decision process (MDP) if its probabilities can
be accurately got. However, as we assume no information on the EH and channel gain processes, the model free RL is applied to find the optimal link selection strategy. Since the considered system has large state space, the applying of Q-learning faces the challenge of storing a large Q-value table. Thus, we deploy Deep Q-Learning with pretraining to learn the optimal link access policy directly from historical experience without any prior information about the system dynamics. Deep Q-Learning is a combination of deep learn- ing and reinforcement learning, which is firstly proposed in [22], [23]. The method originally aims to deal with control policy learning with high-dimensional input, such as raw images. The model and its variants have been successfully used in many dynamic decision making domains, especially those with very large state spaces, such as game playing. Since proposed in 2013, Deep Q-Learning has gained much attention, and many variants have been proposed, including double DQN [24], dueling DQN [25], DRQN [26], prioritized DQN [27], bootstrapped DQN [28], etc. Some applications can be found in [29]–[32]. These models differ in network structures, experience replay, ε-greedy or reward function but have the same core infrastructure. The main idea comes from the basic version of Q-learning [33].
The main contribution of this work is as follows:
1) We formulate the link selection problem in the EH based buffer-aided relaying systems as a RL prob- lem. To resolve the optimization problem with Deep Q-learning, we propose to rewrite the problem into a state, action, and reward form.
2) Basedonsomestructuredpropertiesofconsideredsys- tem, we proposed a pretraining method to accelerate the training convergence of the proposed Deep Q-learning method. Experiment results show that the proposed pretraining method can significantly reduce the training time.
3) Simulations are conducted to show the performance of the proposed Deep Q-learning method. Experiment results show that as compared with the traditional meth- ods, our model can achieve better performance.
II. SYSTEM MODEL
Figure 1 shows the framework of the system. Suppose that in the system there are N source-destination pairs,
FIGURE 1. A buffer-aided relaying system with energy harvesting.
VOLUME 8, 2020
145007

H. Zhang et al.: Deep Reinforcement Learning-Based Access Control for Buffer-Aided
(UEi , DEi ) for i = 1..N , where RN receives packets from the source nodes via the source-relay links, and forwards these packets to the corresponding destination nodes via the relay-destination links. The transmission rate of each link is discretized into different levels, represented by the amount of packets delivered in each time slot. The RN has a data buffer and a battery, for data and power storage respec- tively. A certain share of data buffer is allocated for each source-destination pair. The share of the buffer stores those packets that are received from the source node and have not yet forwarded. The energy harvested from the environment is stored in the battery.
At any time t, the relaying system needs to decide which link should be selected for transmission. If a source-relay link is selected, RN receives data packets from the corresponding source node via the link. The amount of received data packets is under the constraints of the quality of the link (transmission rate) and the buffer quota allocated to the source-destination pair. If a relay-destination link is selected, RN forwards the buffered data packets to the corresponding destination node via the link. The amount of forwarded data packets is under the constraints of the quality of the link, existing buffered packets, and the remaining power in the battery. We assume that receiving a packet does not consume energy of the RN and the RN uses fixed transmitting power to forward data packets to the destinations.
The RN has EH capacity and harvests energy from the ambient environment in each time slot t. A battery with certain capacity stores the energy that is harvested from the environment by RN. If the remaining energy in the battery is insufficient for forwarding data, the RN can choose to suspend the transmission and wait for more energy. In this paper, we regard the energy harvest at the RN as a Poisson process with capacity constraint, similar as many previous works [7]–[10], [34], [35].
The relaying system with the data buffer and the battery is capable to memorize historical states and decisions. The system state st at a certain time t is completely decided by the previous state st−1 and the action taken at time t − 1. Hence the task of link selection for the relaying system can be formulated as a Markov decision process.
III. MDP FORMULATION
As stated above, the access control of a buffer-aided relaying system with energy harvest is a Markov decision process. The system decides an action a of access control given current state s, and transitions from s to its successor state s′ after taking out the action. In this section, we introduce the MDP formulation of the above relaying system, including the state space, the action space, the transition function and the reward function.
A. STATE SPACE
Let S = {s|s = (U, D, B, E)} be the state space, in which a system state st in time slot t consists of four components,
namely the source-relay link state U t , the relay-destination link state Dt , the data buffer state Bt , and the battery state E t . The transmission states Ut and Dt indicate the quality of the links at time t . The source-relay link state U t = {ut1, ut2, . . . , utN }, where uti is the discretized achievable trans- mission rate from UEi to the relay RN at time t. Similarly, the relay-destination link state Dt = {d1t , d2t , . . . , dNt }, where
dit is the the discretized achievable transmission rate from RN to DEi at time t. The status of the links in Ut and Dt are independent random variables with the sample space Llink .
The data buffer state Bt = {bt1,bt2,…,btN}, where bti is the amount of cached data packets for pair (UEi , DEi ), namely, the packets that are received from UEi and have not yet been forwarded to DEi. Notice that at any time t, for each source-destination pairs (UEi , DEi ), the amount of the cached data can not exceed the quota qi allocated for the pair, i.e. 0≤bti ≤qi fori=1..N andeacht.
The energy state Et indicates the remaining energy in RN for forwarding packets to the destination nodes. At time t, the battery state Et depends on 1) the previously remaining energy E t −1 , 2) the energy gain EGt −1 from the environment, 3) the energy cost EC t −1 , and 4) the battery capacity Emax . The energy gain submits to a certain energy harvest model, and the energy cost depends on the amount of packets that are successfully transmitted to the destination nodes. Details will be introduced in the transition function.
B. ACTION SPACE
Given a system state s, the RN decides whether to transmit packets or not, and determines which link should be selected for transmission. Formally, the action space A = {wait} ∪ {rcvi | i = 1..N}∪{fwdi | i = 1..N}.Atanytimet, the RN either waits for more harvested energy, or receives packets from a source node UEi (rcvi), or forwards packets to a destination node DEi (fwdi). Note that the above receiving and forwarding actions do not include the packet amount as a parameter. The packet amount of each action is determined by following factors: the current link state, the current buffer state and the remaining energy in the battery. The RN always receives or forwards, if it is not waiting, as many packets as possible under the constraints of the transmission rate of the links, the buffer quota and the energy. Hence, at time t, there are totally 2N + 1 candidate actions in A. The action rcvti is to receive from source node UEi as many packets as possible at time t, under following constraints:
1) the amount of the received packets can not exceed the source-relay transmission rate uti ,
2) by receiving the packets, the total amount bt+1 of the i
buffered data in RN can not exceed the quota qi.
Therefore, the amount of the data packets received by taking action rcvti is min(uti , qi − bti ). Similarly, the action fwdit is to forward to user DEi as many data as possible at time t, with following constraints:
1) theamountoftheforwardedpacketscannotexceedthe relay-destination transmission rate dit ,
145008
VOLUME 8, 2020

H. Zhang et al.: Deep Reinforcement Learning-Based Access Control for Buffer-Aided
2) the amount of the forwarded packets can not exceed the total amount bti of the buffered packets in RN. In other words, only the data that is already in RN can be forwarded to DE ,
The expected reward of a system state st is defined as R(st) = 􏰉 Pr(at|st)􏰀R(st,at)
i
3) the remaining energy E t is enough for the transmission.
at ∈A
st+1∈S
icy, and Pr(st,at,st+1) depends on the environment model,
which might be unknown in a real-world domain.
E. OPTIMIZATION PROBLEM
With the above state space, action space, transition function and reward function, we formulate our task of link selection as following optimization problem. For each system state st , the system needs to find an optimal policy π : S → A that maximizes the expected long-term reward

π∗ = argmax􏰏E􏰉γtR(st,at|π)􏰐 (5)
π t=0
For any given state st, the policy should give an optimal
Therefore, the amount of the data packets forwarded by tak- ingactionfwdit ismin(dit,bti,Et).
C. TRANSITION FUNCTION
The transition function Tr : S × A → S is defined
where
Pr (at |st )
is
+γ determined
by the given
decision pol-
as follow: for a state st taking the action at , the (Ut+1, Dt+1, Bt+1, Et+1):
= (Ut,Dt,Bt,Et) at time t, by successor system state st +1 =
• The transmission states Ut+1 = {ut+1, ut+1, . . . , ut+1} 12N
andDt+1 ={dt+1,dt+1,…,dt+1}whereut+1,dt+1 ∈ 12Nii
Llink . The transition of the link states are decided by the environment model. In the experiment section, we will verify our approach with multiple environment models. Note that in many real applications, the environment model is unavailable to the system.
• The buffer state Bt+1 = {bt+1, bt+1, . . . , bt+1} where 12N
action, π∗(st)=argmax􏰏R(st,at)+􏰉γkR(st+k,π∗(st+k))􏰐 (6)
VOLUME 8, 2020
145009
min(bti +uti,qi) at =rcvti
bt+1 = max(bt −min(dt,Et),0) at =fwdt (1)
∞ at∈A k=1
iiii bti otherwise
where sk+t = Tr(st+k−1,π∗(st+k−1)).
Such an optimization problem certainly can be solved by
some traditional methods, such as value iteration and policy iteration. However, as mentioned in the previous section, these traditional methods require the explicit transition model for optimization. But in many environments, the transition model is unavailable to the relaying system. For instance, the transmission quality of a link at certain time is affected by many complex factors, which are difficult, sometimes even impossible to be modeled. A straightforward solution is to estimate the transition model by sampling, but in this way the performance becomes very sensitive to the sampling quality. In particular, for a dynamic environment where the transition model might change over periods, it is difficult for the sampling-based estimation to achieve satisfying preci- sion. Moreover, the state spaces in real-world domains are usually large, which results that these model-based methods suffer from the computational complexity. Therefore, in this paper we deploy the Deep Reinforcement Learning (DRL) and approximate the optimal policy with a deep Q-network, which requires neither explicit transition model, nor a large storage for maintaining the state-action value table.
As a well known traditional reinforcement learning method, Q-learning maintains a lookup table of the state-action values, i.e. Q-values Q(s, a). In order to learn the optimal Q-value function, the Q-learning algorithm makes use of the fixed point of the Bellman equation for the Q-value function,
Q∗(s,a)=􏰉T(s,a,s′)(R(s,a,s′)+γ maxQ∗(s′,a′)) (7)
If the action at is to receive data from UEi, the amount of packets finally received by RN is determined by the current source-relay transmission rate and the remaining buffer quota for UEi at time t. If the action at is sending data to DEi, the amount of packets received by DEi is determined by the current relay-destination transmis- sion rate, the remaining cached data, and the remaining energy in the relay at time t.
• The energy state Et+1 = min(Emax,Et + EGt − ECt), where Emax is the maximum energy volume of the bat- tery, EGt is the energy gain from the environment, a random variable submitting to the energy harvest model, and ECt is the energy consumption at time t,
􏰑
t min(dit,bti,Et) at=fwdit EC = 0 otherwise
D. REWARD FUNCTION
(2)
The reward function R : S × A → R is defined as follows:
given a system state st , and and action at , suppose Tr(st , at ) = st+1, and the reward
r ·(bt+1−bt) a=rcvt,i=1..N uiii
R(st,at)=
where ru and rd are unit rewards for successfully receiving
and forwarding data, respectively.
rd ·(bt −bt+1) a=fwdt,i=1..N iii
(3)
0 a = wait
that
s′∈S
a′ ∈A
􏰉
t t t+1 t+1
Pr(s ,a ,s )R(s ) (4)

H. Zhang et al.: Deep Reinforcement Learning-Based Access Control for Buffer-Aided
The basic Q-learning looks up in the Q-table to decide an optimal action a∗ to take (the one with highest Q-value Q∗ (s, a)), given a system state s. However, for a domain with a large state space, the computation of the Q-table is quite consuming both in time and space. DRL uses a deep neural network with weights θ as a function approximator, referred as a Q-network. A Q-network can be trained by minimizing a sequence of loss functions Li (θi ) that changes at each iteration i,
Li(θi) = Es,a∼ρ(·)􏰏(yi − Q(s, a; θi))2􏰐 (8) where yi = Es′∼E 􏰏r + γ maxa′ Q(s′, a′; θi−1)|s, a􏰐 is the
target for iteration i and ρ(s,a) is a probability distribution over states and actions. Note that above r is the instant reward, which corresponds to R(s, a) in our formulation. The distribution ρ(s,a) is selected by an ε-greedy strategy. In each training iteration, the system selects a random action with probability ε, otherwise, follows the greedy strategy and selects the action with highest estimated Q-value. To alleviate the problems of correlated data and non-stationary distribu- tions, DQN maintains an experience replay memory D. The system executes the selected action and observe the reward and successor state, and then stores the transition in D. Then DQN performs a gradient descent step on (yi − Q(s, a; θi))2:
∇θi Li(θi) = Es,a∼ρ(·),s′∼E 􏰏􏰀r + γ max Q(s′, a′; θi−1) a′
−Q(s, a; θi )􏰁∇θi Q(s, a; θi )􏰐 (9) with a random batch of previous transitions.
IV. PROPOSED DRL-BASED ACCESS CONTROL WITH PRETRAINING
In this section, we present the DRL-based access control for a buffer-aided relaying system with energy harvesting. The architecture of the system is shown in Figure 2. As shown in Algorithm 1, our method aims to maximize the long-term
FIGURE 2. DRL-based access control with pretraining. 145010
Algorithm 1 DRL-Based Access Control With Pretraining Initialization
1: Initializethecachequotasandthedecayrate
2: InitializetheQ-networktorandomweights
Pretraining
Sample state transition model P forepisode=1toMpdo
Initialize system state s0 for t = 1 to Tp do
With probability ε set at to a random action, otherwise set at = maxa Q(st,at;θ)
Take action at, get reward rt and next state st+1 according to transition model P
Store transition (st , at , rt , st+1) in D
Sample random minibatch of transitions from D Perform gradient descent and update the network
end for endfor
return the parameter θP of current network. Training
3: 4: 5: 6: 7:
8:
15: 16: 17: 18: 19: 20:
21: 22: 23:
9: 10: 11: 12: 13: 14:
Initialize the parameter θ := θP and D := ∅ forepisode=1toMdo
Initialize system state s0 for t = 1 to T do
Perform ε-greedy step
Perform gradient descent with D, and fine
tune θ on the last layer
Perform gradient descent and fine tuning on Q
end for endfor
system reward

Gt = 􏰉γkR(sk+t,ak+t),
k=0
with the discount factor γ ∈ (0, 1).
A. STRUCTURED FEATURES OF SYSTEM
To learn an effective access control policy for the system, the DQN should be trained first. Usually, the weights θ of the DQN is randomly initialized before training, which means the Q-values Q(s, a) is randomly initialized. Then by training using data sampled from real system, effective weights θ of the DQN can be found. However, with this initializing method, it may take a long training time to find an effective DQN weights θ. Actually, for the considered system, there are some structured features which are independent of the environment model (the channel gain processes and energy harvesting models). For instance, it is always preferred to select a relay-destination link if the power (data) in the battery (date buffer) is full or almost full no matter what transition functionTr : S×A → Sisforthesystem.Whenthe power (data) in the battery (date buffer) is empty or almost empty, it is a better choice to receive data packets.
VOLUME 8, 2020

H. Zhang et al.: Deep Reinforcement Learning-Based Access Control for Buffer-Aided
These features indicates that when s is a state in which the power (data) in the battery (date buffer) is full or almost full, the Q-values Q(s, a) with action a choosing to trans- mit packets from the RN to a certain destination should be assigned a larger value. While when s is a state in which the power (data) in the battery (date buffer) is empty or almost empty, the Q-values Q(s, a) with action a choosing to receive packets from a certain source should be assigned a larger value. Note that these preferences hold for any system transition function Tr : S × A → S. With these observations, we proposed a pretraining method to assign a proper initial Q-values Q(s, a) for the DQN.
B. PRETRAINING WITH HYPOTHETICAL ENVIRONMENT
The pretraining module consists of a sampler and a MDP solver. The pretraining sampler first samples a transition model P, and then samples transitions one by one according to P. This pretrained model is used to initialize and fine tune the network parameters in order to leverage faster training improvement. The ε-greedy step and gradient descent step are similar with the original deep Q-learning. The transition model in the pretraining process is a hypothetical one, which is sampled randomly in the experiment. The ε-greedy step and gradient descent step are similar with the original deep Q-learning. With probability ε the system takes arbitrary action to explore in the hypothesis environment, otherwise, the system will take the action with the maximum state-action value (estimated by the Q-network). The environment gives feedback to the system, including the successor state and direct reward. To weaken the correlation among the training samples, the transitions during the process will be restored in an experience replay set, from which the following training transitions are sampled for the Q-network update. The update step is to perform a gradient descent on (yj − Q(φj , aj ; θ ))2 with respect to the network parameters θ. When the average performance is satisfying enough or maximum training time is reached, we fix the network and return the pretrained parameter.
The model might be quite different with the one in the authentic environment, yet the core ability of the deci- sion making under the framework can be captured by the Q-network. It turns out that with the pretrained Q-network, only a fine tuning is needed to enable good performance in the authentic environment. Note that in the pretraining stage, the Q-table can be calculated or estimated by any given MDP solver. In our experiment, we just deploy the deep neural network to achieve the approximation.
The pretraining mechanism accelerates the learning pro- cess, and guarantees the efficiency even in a dynamic envi- ronment where the transition model changes over time. For instance, consider an environment where the quality of the links submit to a dynamic distribution P, which changes every M time slots. The tasks with different P share some common structural features of the system, but without further training or fine tuning, they are after all different tasks for the neural network. It is verified by our experiment that the deep neural
network does not perform well if it is directly deployed in a brand new environment. So in this case, increasing training episodes does not help improve the performance and the robustness. In contrast, the fine tuning of the pretrained model eases the pain of context switching. Within each cycle of the dynamic transition model, we tune the pretrained model for the first m time slots, and the neural network can achieve good performance under the new distribution for the rest M − m time slots. Our experiment result shows that the pretraining mechanism can dramatically improve the learning efficiency and the performance stability.
C. FINE TUNING WITHIN AUTHENTIC ENVIRONMENTS
Once the model is well pretrained with a hypothetical envi- ronment, it is ready for further training in authentic environ- ments. As shown in Algorithm 1, during the training process, the pretrained model only needs to be fine tuned. Although the transition model in authentic environment may be quite different from the hypothetical environment, they share some common structural features of the relaying system. By pre- training in the hypothetical environment, the deep neural network well captures the structural features of the relaying system, and is able to quickly learn how to maximize the expected long-term reward even in a different environment. Verified by the experiments on different environment models, the training efficiency can be dramatically improved by the pretraining mechanism.
Note that to avoid overfitting, the experience replay of the deep model needs to be reset before further training in an authentic environment, as in Line 15 of Algorithm 1.
D. EXTENSION: STRATEGY FOR POWER OUTAGE
In the experiment to introduce in following sections, the relaying system sometimes needs to work in an environment short of energy. The algorithm above for sure can deal with this situation by learning, resulting that the system decides to wait for energy harvest when the power is used up. Neverthe- less, some other strategy can be considered to deal with the specific situation of power outage. In this section, we extend the above main algorithm by embedding power outage pre- vention into the reward function.
The reward function (3) is revised to
R(st , at ) = Rtr (st , at ) + Rpw(st , at ) (10)
VOLUME 8, 2020
145011
where Rtr(st,at)=
and
r ·(bt+1−bt) a=rcvt,i=1..N uiii
rd ·(bt −bt+1) a=fwdt,i=1..N (11) iii
0
Rpw(st,at)=rp ·δ(Et+1), (12)
the rp is the unit reward for energy sufficiency, and δ is the reward model for preventing the power outage. In our experi- ment, we use a sigmoid function as our power reward model.
a = wait

H. Zhang et al.: Deep Reinforcement Learning-Based Access Control for Buffer-Aided
We compares the above algorithm with and without the energy strategy and give the experiment details in Section V. Note that beside encoding the power sufficiency policy into the reward function, there are other ways to implement given policies in a reinforcement learning based framework, and we
will leave them for future work.
V. EXPERIMENTS
In this section, we demonstrate the performance of the algo- rithm with a series of experiments. For simplicity, we con- sider an instance of the system model above with the number of communication pairs N = 2. Note that the experiments carried out in this section can be directly applied to the cases with N > 2 communication pairs. Since in practical wireless communication systems, even if the channel state information is perfectly known at the transmitter, only several discrete communication rate levels are supported [36] which corre- sponding to different channel coding rates and modulation orders, e.g., for LTE, only several turbo (LDPC) code rates and modulation schemes (BPSK, QPSK, 16QAM, etc.) are adopted. Thus, in this paper, we also adopt a discrete trans- mission rate model Llink = {0, 1, 2, 3}. Here, Llink = 0 can be understood as that the channel condition is too bad, even if transmission with the lowest code rate and modulation order, a data packet cannot be successfully delivered. Llink = 1 (2 or 3) means that a proper code rate and modulation order can be adopted to successfully deliver a data packet. Note that for different code rates and modulation orders, a data packet contains different amount of information. For energy harvest- ing, we adopted a Poisson process, which is also adopted in some other papers on energy harvesting (e.g., [5] and [6]) to model the energy arriving at the relay node. We set the arrival density as λ = 1.6. For this setup, the probability to harvest more than 3 units of energy becomes very small. Thus, we use a truncated energy harvesting model by simply assume that the energy harvest in each time slot is given by Lpower = {0, 1, 2, 3}. In addition, the maximum energy capacity is set as Emax = 10 and the maximum cache quota for each pair bi = 10 for i = 1..N to ensure that the packet delay at the relay will not to too large.
The above instance seems to be a pretty small domain, yet the state space is already quite large:
|S|=|L |2N ·|L |·b2N =10,240,000, link power i
which makes it impossible to efficiently maintain and update on a complete state-action value table, as in some tra- ditional MDP solution such as policy iteration or value iteration. So we use deep neural network to estimate the state-action value function and make decisions. In the follow- ing, we present the experiments in two parts. In the first part, we compares our model with some traditional algorithms, and give the canonical variate analysis. In the second part, we dis- cuss the improvement on learning efficiency of our pretrained model and demonstrate the generality of the acceleration on training.
A. PERFORMANCE COMPARISON
Deep neural networks well trained by good algorithm are capable to make good decision, even in domains with quite large state spaces. In the experiment, we compare the per- formance of our pretrained model with three baseline algo- rithms, namely round-robin, round-robin on pairs, and greedy approach, which are widely used in in realistic scheduling protocols and applications. For buffer-aided relay, actually it has been shown in [2] that greedy approach can usu- ally achieve the highest throughput. However, for EH-based buffer-aided relaying, few works have been done and there is no commonly accepted access strategy that can achieve a good performance. Although the access control problem has already been investigated for some EH-based wireless communication systems, many of them also adopted the RL method to find an effective strategy. Thus, in this paper, to show the effectiveness of DRL in designing access con- trol strategy for EH buffer-aided relaying systems, we just compared the proposed model with Round Robin and Greedy approaches.
1) Round-robinonlinks.ThefirstbaselineRound-robin on Links is to serves the users in a fixed order. The 2N end users take turns to be served by the relay. In other words, the decision sequence is following fixed one:
12N {rcv1,rcv2,…,rcvN,
fwdN+1,fwdN+1,…,fwd2N, 12N
rcv2N+1,rcv2N+2,…,rcv3N,…}. 12N
2) Round-robin on pairs. The second baseline Round- robin on Pairs rotates the service on pairs of communi- cation, whose decision sequence is also a fixed one:
{rcv1,fwd2,…rcv2N−1,fwd2N, 11NN
rcv2N+1,fwd2N+2,…,rcv4N−1,fwd4N, 11NN
rcv4N+1,fwd4N+2,…,rcv6N−1,fwd6N, 11NN
…}.
3) Greedy.ThethirdbaselineGreedyistoselectthelink with the best transmission rate. Suppose that pmax = max(Ut ∪Dt),then
􏰑fwdt p = dt at= i max i
rcvti pmax = uti
for any time t. For links with the same transmission rate, destination-relay links have the priority to be selected.
Please note that some other heuristics can be embedded in Round and Robin or Greedy, but the basic ideas are the same. To our knowledge, except above baselines, there is not yet a commonly accepted strategy for the access control of relays with energy harvest.
First, We test our method in the six environment models, namely Env1 to Env6, as shown in Table 1. As in many other works such as [34], [35], the energy harvest of the RN is
145012
VOLUME 8, 2020

H. Zhang et al.: Deep Reinforcement Learning-Based Access Control for Buffer-Aided
TABLE 1. Different environment models.
and robin. One of the reasons is that the transmission rate are usually quite high in Env4, and for a pair with good communication, the buffer quota can be easily used up, and the greedy algorithm still tries to select the pair and produces redundant actions.
Third, we compare the competing algorithms on the aver- age packet delay in Figure 5. The metric average packet delay pd(T)foratimeperiodT isdefinedas
pd(T)= numfail +􏰈Tt=1􏰈Ni=1bti T
where numfail is the total number of transmission failures. As shown in the result, our method outperforms than others in almost all the environment models.
To sum up, our approach shows better performance than the three baselines for all the above environment models, in terms of the average reward and the average packet delay.
B. CANONICAL VARIATE ANALYSIS ON EH MODEL
The energy harvest of the RN is modeled as a Poisson process as in other works such as [35]. The harvested power is a random variable under a given Poisson distribution. It is worth noting that the energy harvested at the current time can only be used for subsequent time slots, that is, the harvest- store-use (HSU) mode is used in this paper. We present the canonical variate analysis of our algorithm, discussing the performance of the algorithm with different energy harvest rates. Under the premise that the source-relay transmission state and relay-destination transmission state are sampled from Llink according to a a common distribution, and the maximum energy capacity Emax = 10, and the maximum cache quota for each pair bi = 10 for i = 1..N . We consider Poisson(λ = 0.8) which indicates that the harvested power is sufficient, and Poisson(λ = 2.8) where the harvested power is relatively insufficient. As shown in Figure 6, our method outperforms all the baselines in both EH models. The improvement is more obvious when the harvested power is insufficient.
FIGURE 3. Different transition models of transmission rates.
modeled as a Poisson process. The transmission rates of the links are independent random variables that submit to the given environment model. The distribution of the link states is shown in Figure 3.
Second, we compare the performance of the algorithm with traditional methods in terms of average reward, in each envi- ronment model. As shown in Figure 4, in all the environment models, our method dominates others and shows the best per- formance. For instance in Figure 4(a), the link transmission rate of each link distribute on the sample space uniformly. The average reward of our model is significantly higher than the other three traditional methods. In Figure 4(c) our method is slightly better than greedy and much better than two Round and Robin baselines. For Env4 in Figure 4(d), the greedy method shows the worst performance, even worse than round
FIGURE 4. Performance comparison in Env1 – Env6: average reward. VOLUME 8, 2020
145013

FIGURE 5. Performance comparison in Env1 – Env6: average packet delay.
FIGURE 6. Performance comparison with different energy harvest models.
C. POWER OUTAGE PREVENTION
As stated in previous sections, when the power in the battery is used up, the relaying system has to suspend and wait until enough energy is harvested, which is not good enough for data transmission. We embed a policy for preventing the power outage into the reward function and make the access control smarter. The new reward function is,
R(st , at ) = Rtr (st , at ) + Rpw(st , at ) (13) where Rtr (st , at ) is the same as the previous R(st , at ) in above
groups of experiments, and
Rpw(st,at)=rp ·δ(Et+1), (14)
the rp is the unit reward for energy sufficiency, and δ is the reward model for preventing the power outage.
Note that there is no existing work on how to embed a given policy of power outage prevention into the reward function for a deep reinforcement learning based relaying system with harvest energy. In our experiment, we set the rp = rd and
FIGURE 7. Strategy for power outage prevention. TABLE 2. Average number of training episodes.
around 1.5 ∗ ud . In other words, only if more than 1.5 packets are going to successfully forwarded, the relay will decide to forward even the power is almost used up. Otherwise, the relay just waits for more energy harvest. If the power left in t +1 is going to be larger than 1.2∗ud, then the penalty is almost 0.
The performance comparison is shown in Figure 8. The result indicates that the average reward of our model with strategy is a bit lower than the algorithm without the strategy, due to the extra waiting when the battery is short of power, and better then the baselines in almost all environments. But as shown in Figure 7(a), the rate of transmission failure due to power outage is dramatically reduced.
D. ACCELERATION BY PRETRAINING
As stated previously, we first sample a transition model for pretraining, and test the performance in an authentic envi- ronment. The comparison of our algorithm with original deep Q-learning is shown in Figure 9. The deep model is pretrained in a hypothetical environment for 500 iterations, and then compared with a deep model without pretraining for 1000 training iterations. The learning efficiency is dramati- cally improved by the pretraining.
VOLUME 8, 2020
t+1 1
) = −5(Et+1−0.5) − 2
δ(E
Part of the curve of δ(Et+1) is shown in Figure 7(a). When
0.5+0.125·e
the power left in t + 1 is going to be close to 0, the penalty of 145014
H. Zhang et al.: Deep Reinforcement Learning-Based Access Control for Buffer-Aided

H. Zhang et al.: Deep Reinforcement Learning-Based Access Control for Buffer-Aided
FIGURE 8. Performance of our model with power saving strategy (our model+).
to be well trained, without pretraining. But the pretraining
FIGURE 9. Deep models with and without pretraining.
To demonstrate the generality of the pretraining mecha- nism, we compare in Figure 10 our algorithm with the unpre- trained deep models within different authentic environments. The result verifies the generality of the acceleration – no mat- ter how the authentic environment varies, the pretained model always converges much faster than the baseline. In Table 2, we give the average training episodes in of the two methods in Env1 to Env6. For this MDP problem with a relatively large state space, the neuron network takes hundreds of episodes
successfully eases the pain of learning complexity.
VI. CONCLUSION AND FUTURE WORK
In this paper, we formulate the link selection problem in the EH based buffer-aided relaying communication systems as a RL problem to maximize the long-term average system utility. The DQN is adopted to find an optimal access control strategy. Experiment results showed that the transmission pol- icy obtained by using our proposed model can achieve better performance compared with the traditional methods, and is general enough to be used in different environments, which demonstrate the adaptive ability of our proposed model.To reduce the training time, an effective pretraining algorithm is proposed to accelerate the convergence of the proposed Deep Q-learning method. It is shown that the proposed pre- training method can significantly reduce the training time. Thus, it was proved that the considered system have some general structural features and adaptive training method can be adopted in practical time varying systems whose transition model changes with time.
Moreover, in this paper, we embed some domain knowl- edge in terms of a power policy into the reward function of the deep learning. There are also some other ways to embed explicit policies or domain knowledge into the learn- ing frameworks, and we will leave them for future works.
FIGURE 10. Learning efficiency comparison in Env1 – Env6. VOLUME 8, 2020
145015

REFERENCES
[1] A.Nosratinia,T.E.Hunter,andA.Hedayat,‘‘Cooperativecommunication in wireless networks,’’ IEEE Commun. Mag., vol. 42, no. 10, pp. 74–80, Oct. 2004.
[2] S.LuoandK.C.Teh,‘‘Bufferstatebasedrelayselectionforbuffer-aided cooperative relaying systems,’’ IEEE Trans. Wireless Commun., vol. 14, no. 10, pp. 5430–5439, Oct. 2015.
[3] N. Zlatanov, R. Schober, and P. Popovski, ‘‘Buffer-aided relaying with adaptive link selection,’’ IEEE J. Sel. Areas Commun., vol. 31, no. 8, pp. 1530–1542, Aug. 2013.
[4] S. Luo and K. C. Teh, ‘‘Adaptive transmission for cooperative NOMA system with buffer-aided relaying,’’ IEEE Commun. Lett., vol. 21, no. 4, pp. 937–940, Apr. 2017.
[5] S. Ulukus, A. Yener, E. Erkip, O. Simeone, M. Zorzi, P. Grover, and K. Huang, ‘‘Energy harvesting wireless communications: A review of recent advances,’’ IEEE J. Sel. Areas Commun., vol. 33, no. 3, pp. 360–381, Mar. 2015.
[6] O.Ozel,K.Tutuncuoglu,S.Ulukus,andA.Yener,‘‘Fundamentallimitsof energy harvesting communications,’’ IEEE Commun. Mag., vol. 53, no. 4, pp. 126–132, Apr. 2015.
[7] G. Shabbir, J. Ahmad, W. Raza, Y. Amin, A. Akram, J. Loo, and H. Tenhunen, ‘‘Buffer-aided successive relay selection scheme for energy harvesting IoT networks,’’ IEEE Access, vol. 7, pp. 36246–36258, 2019.
[8] F. Zeng, X. Xiao, Z. Xiao, J. Sun, J. Bai, V. Havyarimana, and H. Jiang, ‘‘Throughput maximization for two-way buffer-aided and energy-harvesting enabled multi-relay networks,’’ IEEE Access, vol. 7, pp. 157972–157986, 2019.
[9] I. Ahmed, A. Ikhlef, R. Schober, and R. K. Mallik, ‘‘Power allocation for conventional and buffer-aided link adaptive relaying systems with energy harvesting nodes,’’ IEEE Trans. Wireless Commun., vol. 13, no. 3, pp. 1182–1195, Mar. 2014.
[10] B. Varan and A. Yener, ‘‘Delay constrained energy harvesting networks with limited energy and data storage,’’ IEEE J. Sel. Areas Commun., vol. 34, no. 5, pp. 1550–1564, May 2016.
[11] A. Ortiz, H. Al-Shatri, X. Li, T. Weber, and A. Klein, ‘‘Reinforcement learning for energy harvesting point-to-point communications,’’ in Proc. IEEE Int. Conf. Commun. (ICC), May 2016, pp. 1–6.
[12] A. Ortiz, H. Al-Shatri, X. Li, T. Weber, and A. Klein, ‘‘Reinforcement learning for energy harvesting decode-and-forward two-hop communica- tions,’’ IEEE Trans. Green Commun. Netw., vol. 1, no. 3, pp. 309–319, Sep. 2017.
[13] A.Masadeh,Z.Wang,andA.E.Kamal,‘‘Reinforcementlearningexplo- ration algorithms for energy harvesting communications systems,’’ in Proc. IEEE Int. Conf. Commun. (ICC), May 2018, pp. 1–6.
[14] P. Blasco, D. Gunduz, and M. Dohler, ‘‘A learning theoretic approach to energy harvesting communication system optimization,’’ IEEE Trans. Wireless Commun., vol. 12, no. 4, pp. 1872–1882, Apr. 2013.
[15] R. Sutton and A. Barto, Reinforcement Learning: An Introduction. Cambridge, MA, USA: MIT Press, 1998.
[16] C. Szepesvári, ‘‘Algorithms for reinforcement learning,’’ Synth. Lectures Artif. Intell. Mach. Learn., vol. 4, no. 1, pp. 1–103, 2010.
[17] Y.-H. Xu, J.-W. Xie, Y.-G. Zhang, M. Hua, and W. Zhou, ‘‘Reinforce- ment learning (RL)-based energy efficient resource allocation for energy harvesting-powered wireless body area network,’’ Sensors, vol. 20, no. 1, p. 44, Dec. 2019, doi: 10.3390/s20010044.
[18] Y. Wei, F. R. Yu, M. Song, and Z. Han, ‘‘User scheduling and resource allocation in HetNets with hybrid energy supply: An actor-critic reinforce- ment learning approach,’’ IEEE Trans. Wireless Commun., vol. 17, no. 1, pp. 680–692, Jan. 2018.
[19] F.AitAoudia,M.Gautier,andO.Berder,‘‘RLMan:Anenergymanager based on reinforcement learning for energy harvesting wireless sensor networks,’’ IEEE Trans. Green Commun. Netw., vol. 2, no. 2, pp. 408–417, Jun. 2018.
[20] V. Hakami and M. Dehghan, ‘‘Distributed power control for delay opti- mization in energy harvesting cooperative relay networks,’’ IEEE Trans. Veh. Technol., vol. 66, no. 6, pp. 4742–4755, Jun. 2017.
[21] M. Miozzo, L. Giupponi, M. Rossi, and P. Dini, ‘‘Switch-on/off policies for energy harvesting small cells through distributed Q-learning,’’ in Proc. IEEE Wireless Commun. Netw. Conf. Workshops (WCNCW), Mar. 2017, pp. 1–6.
[22] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, ‘‘Playing atari with deep reinforcement learning,’’ 2013, arXiv:1312.5602. [Online]. Available: http://arxiv.org/abs/1312.5602
[23] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, ‘‘Human- level control through deep reinforcement learning,’’ Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015, doi: 10.1038/nature14236.
[24] H. van Hasselt, A. Guez, and D. Silver, ‘‘Deep reinforcement learning with double Q-learning,’’ in Proc. 13th AAAI Conf. Artif. Intell. (AAAI), 2016, pp.2094–2100. [Online]. Available: http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12389
[25] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and N. de Freitas, ‘‘Dueling network architectures for deep reinforce- ment learning,’’ in Proc. 33nd Int. Conf. Mach. Learn. (ICML), 2016, pp. 1995–2003. [Online]. Available: http://proceedings.mlr.press/v48/ wangf16.html
[26] M. J. Hausknecht and P. Stone, ‘‘Deep recurrent Q-learning for par- tially observable MDPs,’’ in Proc. AAAI Fall Symp., 2015, pp. 29–37. [Online]. Available: http://www.aaai.org/ocs/index.php/FSS/FSS15/paper/ view/11673
[27] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, ‘‘Prioritized experience replay,’’ 2015, arXiv:1511.05952. [Online]. Available: http://arxiv.org/abs/1511.05952
[28] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy, ‘‘Deep exploration via bootstrapped DQN,’’ in Proc. Adv. Neural Inf. Process. Syst., Annu. Conf. Neural Inf. Process. Syst., 2016, pp. 4026–4034. [Online]. Available: http://papers.nips.cc/paper/6501-deep-exploration-via-bootstrapped-dqn
[29] H. Le, N. Jiang, A. Agarwal, M. Dudík, Y. Yue, and H. Daumé, ‘‘Hierarchical imitation and reinforcement learning,’’ in Proc. ICML, 2018, pp. 2923–2932. [Online]. Available: http://proceedings.mlr.press/v80 /le18a.html
[30] C. Tessler, S. Givony, T. Zahavy, D. J. Mankowitz, and S. Mannor, ‘‘A deep hierarchical approach to lifelong learning in Minecraft,’’ in Proc. AAAI, 2017, pp. 1553–1561. [Online]. Available: http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14630
[31] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum, ‘‘Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation,’’ in Proc. NIPS, 2016, pp. 3675–3683. [Online]. Available: http://papers.nips.cc/paper/6233-hierarchical-deep- reinforcement-learning-integrating-temporal-abstraction-and-intrinsic- motivation
[32] H. Yin and S. J. Pan, ‘‘Knowledge transfer for deep reinforcement learning with hierarchical experience replay,’’ in Proc. AAAI, 2017, pp. 1640–1646. [Online]. Available: http://aaai.org/ocs/index. php/AAAI/AAAI17/paper/view/14478
[33] C. J. C. H. Watkins and P. Dayan, ‘‘Q-learning,’’ Mach. Learn., vol. 8, nos. 3–4, pp. 279–292, May 1992, doi: 10.1007/BF00992698.
[34] O. Ozel, K. Tutuncuoglu, J. Yang, S. Ulukus, and A. Yener, ‘‘Transmis- sion with energy harvesting nodes in fading wireless channels: Optimal policies,’’ IEEE J. Sel. Areas Commun., vol. 29, no. 8, pp. 1732–1743, Sep. 2011, doi: 10.1109/JSAC.2011.110921.
[35] P.SakulkarandB.Krishnamachari,‘‘Onlinelearningschemesforpower allocation in energy harvesting communications,’’ IEEE Trans. Inf. Theory, vol. 64, no. 6, pp. 4610–4628, Jun. 2018.
[36] W. Wicke, N. Zlatanov, V. Jamali, and R. Schober, ‘‘Buffer-aided relay- ing with discrete transmission rates for the two-hop half-duplex relay network,’’ IEEE Trans. Wireless Commun., vol. 16, no. 2, pp. 967–981, Feb. 2017.
HAODI ZHANG received the Ph.D. degree from the Department of Computer Science and Engi- neering, Hong Kong University of Science and Technology, in 2016. He is currently an Assistant Professor with the College of Computer Science and Software Engineering, Shenzhen University and Guangdong Laboratory of Artificial Intelli- gence and Digital Economy, Shenzhen, China. His current research interests are in the areas of deep reinforcement learning, knowledge representation
and reasoning, explainable artificial intelligence, artificial intelligence in communication, buffer-aided relaying and wireless information.
H. Zhang et al.: Deep Reinforcement Learning-Based Access Control for Buffer-Aided
145016
VOLUME 8, 2020

H. Zhang et al.: Deep Reinforcement Learning-Based Access Control for Buffer-Aided
DI ZHAN received the bachelor’s degree from Huanggang Normal College, in 2017. She is currently pursuing the master’s degree with the College of Computer Science and Software Engineering, Shenzhen University and Guang- dong Laboratory of Artificial Intelligence and Dig- ital Economy (SZ), Shenzhen University, China. Her current research interests are in the areas of deep reinforcement learning, buffer-aided relay- ing, and wireless information.
CHEN JASON ZHANG (Member, IEEE) received the Ph.D. degree from the Department of Com- puter Science and Engineering, Hong Kong Uni- versity of Science and Technology, in 2015. He is currently a Postdoctoral Research Fellow with the Hong Kong University of Science and Tech- nology as well as an Associate Professor at the Shandong University of Finance and Economics. His research interests include crowdsourcing and data integration.
KAISHUN WU (Member, IEEE) received the Ph.D. degree in computer science and engineer- ing from HKUST, in 2011. He was a Research Assistant Professor with HKUST. In 2013, he joined Shenzhen University as a Distinguished Professor. He has co-authored two books and published over 90 high-quality research articles in international leading journals and primer con- ferences, like the IEEE TRANSACTIONS ON MOBILE COMPUTING, the IEEE TRANSACTIONS ON PARALLEL
AND DISTRIBUTED SYSTEMS, ACM MobiCom, and the IEEE INFOCOM. He has invented six U.S. and over 80 Chinese pending patents. He was a recipient of the 2012 Hong Kong Young Scientist Award and the 2014 Hong Kong ICT Awards: Best Innovation and 2014 IEEE ComSoc Asia-Pacific Outstanding Young Researcher Award. He is an IET Fellow.
YE LIU received the master’s degree from the Department of Computer Science and Engineer- ing, Shenzhen University, in 2015. He is currently an Assistant Research Fellow with Shenzhen University.
SHENG LUO (Member, IEEE) received the Ph.D. degree from Nanyang Technological University, Singapore, in 2017, and the master’s and bache- lor’s degrees from the University of Electronic Sci- ence and Technology of China, Chengdu, China, in 2012 and 2009, respectively. Since 2017, he has been with Shenzhen University, where he is cur- rently an Assistant Professor with the College of Computer Science and Software Engineering. His current research interests are in the areas of coop-
erative communication, buffer-aided relaying and wireless information and power transfer, mmWave communication, and spatial modulation.
VOLUME 8, 2020
145017