Optimal Recovery of Single Disk Failure in RDP Code Storage Systems
Liping Xiang† Yinlong Xu†
†School of Computer Science and Technology University of Science and Technology of China ylxu@ustc.edu.cn
{xlping, qchang}@mail.ustc.edu.cn
ABSTRACT
Modern storage systems use thousands of inexpensive disks to meet the storage requirement of applications. To enhance the data availability, some form of redundancy is used. For example, conventional RAID-5 systems provide data avail- ability for single disk failure only, while recent advanced cod- ing techniques such as row-diagonal parity (RDP) can pro- vide data availability with up to two disk failures. To reduce the probability of data unavailability, whenever a single disk fails, disk recovery (or rebuild) will be carried out. We show that conventional recovery scheme of RDP code for a single disk failure is inefficient and suboptimal. In this paper, we propose an optimal and efficient disk recovery scheme, Row- Diagonal Optimal Recovery (RDOR), for single disk failure of RDP code that has the following properties: (1) it is read optimal in the sense that it issues the smallest number of disk reads to recover the failed disk; (2) it has the load bal- ancing property that all surviving disks will be subjected to the same amount of additional workload in rebuilding the failed disk. We carefully explore the design state space and theoretically show the optimality of RDOR. We carry out performance evaluation to quantify the merits of RDOR on some widely used disks.
Categories and Subject Descriptors
B.8.1 [Performance and Reliability]: Reliability, Test- ing, and Fault-Tolerance; D.4.2 [Operating Systems]: Stor- age Management—Secondary storage
General Terms
Algorithms, Reliability, Theory
Keywords
RAID recovery, RDP code, disk failure, recovery algorithm
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
SIGMETRICS’10, June 14–18, 2010, New York, New York, USA. Copyright 2010 ACM 978-1-4503-0038-4/10/06 …$10.00.
John C.S. Lui* Qian Chang†
*Department of Computer Science and Engineering The Chinese University of Hong Kong cslui@cse.cuhk.edu.hk
1. INTRODUCTION
The tremendous growth of information makes designing a massive data storage system more challenging nowadays. As pointed out by authors in [20], there were around 5 ex- abytes of new information generated in the year 2002, and the amount of information produced in the world increases by 30% every year. This massive amount of information translates to a huge demand for large and cost-effective stor- age systems, which use thousands or even hundreds of thou- sands of inexpensive disks to preserve large volumes of data [17, 16, 6].
The continuing increase of system size and the usage of inexpensive but less reliable components for economical con- sideration make component failures (e.g., disk failure) more common, and this has a huge impact on the reliability and data availability of large scale storage systems [27, 25]. How- ever, a fundamental requirement of building large storage systems is to make sure information is reliable and available even under the presence of component failure. High reliabil- ity and data availability can be achieved by maintaining a sufficient amount of redundancy in the storage system. For example, in recent years, several erasure codes [26, 15, 7, 14, 8], such as row-diagonal parity (RDP) [5], EVENODD [2], and X-code [36] were proposed to protect data against more than one disk failure and still can maintain higher level reliability as compared to the conventional RAID-5 systems.
When the number of failed disks is more than the erasure capability, information will be lost and data will be unavail- able. Thus, to maintain a high system reliability level, a key approach is to repair and recover a failed component as quickly as possible [1]. In general, the recovery process re- quires multiple reads to surviving disks, and the total data that must be processed during recovery plays a crucial role in the overall recovery time [5]. Moreover, as the storage ca- pacity of modern disk is growing at a much faster rate than the disk I/O speed, the disk recovery process for modern disks takes much longer time. This translates to lengthen- ing the window of vulnerability (WOV) and increases the probability of data loss [34]. Thus, reducing the disk reads will accelerate the recovery process and improve system re- liability.
Another fact which should be taken into consideration is that most of the storage systems have an online failure re- covery mode [10], in which the system still provides service to the users during the recovery period. However, multi- ple reads from disks for recovery consume I/O bandwidth of the system and influence the system service performance
[12]. Therefore, by reducing the number of disk reads, one can also improve the performance to users during the on- line failure recovery mode. Additional benefits for reducing disk reads include conserving disk energy and relieving the communication load of the entire storage system.
Systems based on double fault tolerant array codes (e.g., RDP, EVENODD) usually contain two parity disks to pro- tect against double disk failures. However, the frequency of single disk failure is much higher than double disk failures (this is especially true when the aggregated disk failure rate is much lower than the disk recovery rate). Therefore, one should focus on designing an efficient recovery scheme of a single failed disk for storage systems that can tolerate mul- tiple disk failures. However, most commonly adopted disk recovery strategies only use single parity for single failure recovery [2, 5, 36], and there is no research which focuses on designing a fast and efficient recovery scheme for single failure recovery of double fault tolerant array codes.
Recent studies show that memory reads are about 100 times faster than disk reads. For instance, a state-of-the- art SATA disk driver has a linear read speed of approxi- mately 60MB/sec, while DDR2-800 memory read speed is 6400MB/sec [32]. If some disk reads can be substituted by memory reads, one can speed up the disk recovery process and reduce the communication load of the storage system.
The aim of this work is on designing an optimal recovery scheme for single disk failure in a storage system that uses double fault tolerant array codes. We consider the single failure recovery problem for storage systems which use the RDP codes. We show that one can exploit both parity disks to reduce the number of disk reads for the recovery of single disk failure and propose a hybrid recovery scheme RDOR. Using RDOR, the disk reads for recovery can be reduced by approximately 25% compared with the conventional recov- ery strategies. The main contributions of this paper are:
1. We propose an optimal recovery scheme for single disk failure which can reduce disk reads and therefore im- proves system recovery performance.
2. We derive the lower bounds on the total amount of disk reads needed for any single failure recovery of RDP code storage systems.
3. RDOR has the load-balancing property that all sur- viving disks will experience the same amount of work- load to recover the failed disk, and the recovery scheme matches the lower bound on the number of disk reads.
This paper is organized as follows. We will briefly in- troduce the concept of storage system and the construction mechanisms of RDP code in Section 2. Section 3 presents the main idea of our recovery scheme using a simple example. In Section 4, we theoretically analyze the lower bounds of the disk reads and propose the RDOR scheme which is read- optimal and load balanced. In Section 5, we present the ex- perimental results to show the advantage of RDOR. Section 6 introduces the related works on disk recovery. Conclusion and discussion of future work are presented in Section 7.
in an array constitute a disk array storage system. Among these disks, some of them are information disks and the rest are parity disks.
Erasure codes are defined by parity symbol construction mechanisms to protect information from disk failure. When a disk fails in the system, to maintain the system reliability level, the data in the failed disk should be reconstructed by reading corresponding information and parity data from the surviving disks, and the reconstructed data should be stored in a spare disk as soon as possible. To deal with disk failures, each erasure code has its specific recovery algorithm.
RDP code is one of the most important double fault tol- erant erasure codes which achieves optimality both in com- putation efficiency and I/O efficiency. It is also named as RAID-DP in NetApp’s commercial products [19]. The RDP encoding takes a (p − 1) × (p + 1) two-dimensional array (or what we call a stripe), where p is a prime number greater than 2. The first p − 1 columns in the array are informa- tion columns and the last two are parity columns. In actual implementations, the identities of information and parity columns are rotated between stripes [26], so all the disks will get the same workload.
The two parity columns in the array, named as the row parity column and the diagonal parity column respectively, ensure all information is recoverable when there are no more than two disk failures. A row parity symbol, just like in RAID-4, is generated by XOR-summing all the information symbols in that row, and a diagonal parity symbol is by XOR-summing all symbols along the same diagonal. Fig- ure 1 shows the construction mechanism of RDP code when p = 5, where di,j is the i-th symbol in column j. The first four disks, Disk 0 to Disk 3 are information disks, while the last two disks, Disk 4 and Disk 5, are parity disks. Disk 4 contains all the row parity symbols, e.g., d0,4 is the XOR-sum of data blocks {d0,0 , d0,1 , d0,2 , d0,3 }, while Disk 5 contains the diagonal parity symbols, e.g., d0,5 is the XOR-sum of data blocks {d0,0, d3,2, d2,3, d1,4}. Diagonal p−1({di,j|i+j ≡ p−1 (mod p)}) is called the missing di- agonal as it does not have a corresponding diagonal parity. Symbols {d0,4, d1,3, d2,2, d3,1} consist of the missing diago- nal. Refer to [5] for more details.
Information Disks
Parity Disks
Disk0
Disk1
Disk2
Disk3
d0,0
d0,1
d0,2
d0,3
d1,0
d1,1
d1,2
d1,3
d1,4
d1,5
2.
BACKGROUND ON RDP STORAGE SYS-
TEMS
Figure 1: Storage system with RDP code of p = 5.
2.1 Conventional Recovery Scheme for RDP Storage Systems
In this subsection, we will discuss the “conventional” ap- proach of recovering a failed disk in the RDP storage sys- tems.
For RDP, if the failed disk is an information disk, the
RDP code storage system can be realized by multiple in- expensive disks. Disks of the same size which are arranged
d2,0
d2,1
d2,2
d2,3
Disk4
d0,4
d2,4
Disk5
d0,5
d2,5
d3,0
d3,1
d3,2
d3,3
d3,4
d3,5
Disk0
Disk1
Disk2
Disk3
Disk4
Disk5
Disk6
Disk7
conventional approach is to use row parity and information symbols in that row to recover each erasure symbol. For example, if Disk 0 fails, one can read {d0,1,d0,2,d0,3,d0,4} to recover d0,0.
If the failed disk is a parity disk, recovery is equivalent to the corresponding encoding. For example, if Disk 4 fails, one can XOR-sum {d0,0,d0,1,d0,2,d0,3} to reconstruct d0,4. If Disk 5 fails, one can XOR-sum {d0,3,d1,2,d2,1,d3,0} to reconstruct d3,5.
It is important to note that the conventional strategy for single failure recovery of RDP code only uses a single par- ity column. However, all data blocks are protected by two different parity groups. In the following, we will present the RDOR scheme which uses both parities. The interesting properties of RDOR are that (1) it reduces the number of disk read operations; (2) it maintains load balancing prop- erty among all surviving disks.
Figure 3: Reducing the number of disk reads for recovering Disk 0.
in memory after it is read from a disk at the first time, it can be read from memory for the recovery of other infor- mation symbols. So a total of 36 − 9 = 27 symbols will be read from disk for the recovery, while 9 symbols are read from memory. Because the speed of memory read is much faster than that of disk read, reading symbols from memory instead of from disk will speed up recovery process. Since this recovery scheme reduces disk reads by using memory read instead of disk read, the recovery process will be sped up and the communication load of the storage system will be reduced. The main idea of using previously read symbols to recover data so as to reduce I/O operation is appealing. Theoretically, we also like to answer the following questions:
• What is the lower bound of disk reads for single failure recovery?
• How to design a recovery scheme which matches this lower bound?
Balancing Disk Reads: During recovery, one would ex- pect the recovery read requests are uniformly distributed among all surviving disks to speed up the recovery process. For conventional recovery scheme, this can be achieved by parity rotation between stripes. However, by using double parities for the recovery, the numbers of symbols read from each surviving disk are determined by the recovery scheme we choose and can not be balanced simply by rotating par- ities between stripes. For example, simply rotating parities between stripes, if the recovery scheme in Figure 3 is carried out in each stripe, nearly all the data from Disk 1 need to be read for the recovery of the failed Disk 0, while only half from Disk 4.
Consider another hybrid recovery scheme in Example 1, where d2,0, d4,0, d5,0 are recovered from row parity and d0,0, d1,0, d3,0 are recovered from diagonal parity. Figure 4 de- picts the symbols being read for the recovery with this scheme. In Figure 4, 27 symbols are read from disks for the recovery, the same as that in Figure 3. However, in Figure 4, the num- bers of symbols we need to read from Disk 1 to Disk 6 for recovery are all equal to 4. In other words, the additional workload to recover data in Disk 0 are perfectly balanced among all surviving disks. One interesting question we like to address is:
• Whether there exists a balanced recovery scheme with which the numbers of read operations from each disk are the same?
In the following sections, We will analyze the minimum disk reads and balanced disk reads for the recovery. We will
3.
HYBRID RECOVERY APPROACH FOR
SINGLE FAILURE
In this section, we will first introduce an example to illus- trate the idea of RDOR for single failure recovery, and then we formally present the recovery process.
Example 1: Let us consider a storage system in Figure 2 which uses an RDP code with p = 7. Assume that Disk 0 (column 0) failed. The six information symbols di,0 (0 ≤ i ≤ 5) are to be recovered. With the conventional recovery strategy, di,0 (0 ≤ i ≤ 5) are all recovered from row parity (or Disk 6). Therefore, 36 symbols need to be read from disks for the recovery. As illustrated in Figure 2, the era- sure symbols are labeled with “×”, and the symbols used for recovery are labeled with “⃝”.
Disk0
Disk1
Disk2
Disk3
Disk4
Disk5
Disk6
Disk7
Figure 2: Conventional scheme to recover Disk 0.
MinimizingDiskReads:1 Becauseaninformationsym- bol can be recovered from either row parity or diagonal parity, the conventional recovery strategy is only one of many possible schemes. Suppose that in Example 1, if the d0,0, d1,0, d2,0 are recovered from row parity and d3,0, d4,0, d5,0 are recovered from diagonal parity, there are 9 over- lapping symbols which will be read twice for the recovery process. Figure 3 illustrates this case, where the symbols la- beled with “⃝” are read for the recovery through row parity and the symbols labeled with “” are read for the recovery through diagonal parity. If an overlapping symbol is stored
1The main idea of using double parities for single failure recovery to reduce the number of disk reads can be gen- eralized to other double fault tolerant array codes such as EVENODD and X-code.
Disk0
Disk1
Disk2
Disk3
Disk4
Disk5
Disk6
Disk7
Figure 4: Load balancing in recovering Disk 0.
also design recovery schemes for single disk failure recov- ery of RDP code which match the minimum disk reads and balanced disk read simultaneously.
Remark: Erasure symbols can only be recovered from their parity sets, but the surviving symbols for the recovery can be either directly read from disks or further generated from other parity sets. In this paper, we only consider that all the surviving symbols are read directly from disks. In the appendix, we will show that the lower bound of disk reads in the following Theorem 1 can not be further reduced by generating surviving symbols from other parity sets.
To illustrate, consider the example in Figure 1. We have R0 = {d0,0, d0,1, d0,2, d0,3, d0,4}, D0 ={d0,0, d3,2, d2,3, d1,4, d0,5 }, d0,0 ∈ R0 and d0,0 ∈ D0. d0,0 can be recovered from R0 orfromD0. d0,4 ∈R0,butd0,4 ∈/Dj for0≤j≤p−2. So d0,4 can only be recovered from R0.
Assume that Disk k in the disk array fails. We need to recover all the erasure symbols di,k (0 ≤ i ≤ p−2) in column k (0 ≤ k ≤ p). Based on RDP code, we have the following Lemma 2.
Lemma 2. Given an erasure column k, 0 ≤ k ≤ p,
1)If 0≤k≤p−1,whichmeansthatcolumnkisnot
the diagonal parity column.
a) If i ̸= p−1−k, di,k ∈ Ri and di,k ∈ Dp. Symbol di,k can be recovered from either Ri or Dp . 2
b) If i = p−1−k, only di,k ∈ Rp−1−k. Symbol dp−1−k,k is in the missing diagonal and can only be recovered from Rp−1−k.
2) If k = p, which means that column k is the diagonal parity column, then di,p ∈ Di, di,p can only be recov- ered by Di.
Remark: Lemma 2 shows the possible choices for the recov- ery of an erasure symbol. Case a) indicates that an erasure symbol, which is not in the diagonal parity column and not in missing diagonal, can be recovered from a row parity set and from a diagonal parity set. Case b) indicates that an erasure symbol, which is not in diagonal parity column but in missing diagonal, can be recovered only from a row parity set. Case 2) indicates that all symbols in the diagonal parity column can be recovered only from diagonal parity sets.
Figure 5 shows the possible recovery choices of each era- sure symbol with p = 5. Instead of using di,j to denote the data at the ith row and the jth column, we use a pair of numbers to identify the parity sets the corresponding sym- bol belongs to. The first number in the pair is the subscript of the row parity set, and the second number is the subscript of the diagonal parity set. The symbols in column 5 only belong to their diagonal parity sets, and the symbols along the missing diagonal only belong to their row parity sets. For example, (2, 3) at row 2, column 1 means that d2,1 can be recovered from R2 or D3; (1, null) at row 1, column 3 means that d1,3 can only be recovered from R1.
We can choose a combination of parity sets (Note: We call it recovery combination in the following.) to recover the p − 1 erasure symbols in a column. Consider Disk 1 in Figure 5, we can choose a combination of (R0, D2, R2, R3) to recover Disk 1, which means that R0 is used to recover d0,1, D2 is used to recover d1,1, and so on. Since there are
2For the ease of presentation, we denote r mod p as < r >p in this paper.
4. 4.1
THE RECOVERY OF SINGLE DISK FAIL- URE
Lower Bounds of Disk Reads for Single
Failure Recovery
In this section, we derive the lower bound of the number of symbols to be read from the disks for any single column erasure recovery under the RDP code. Because an erasure symbol can be recovered either from the row parity or the diagonal parity, we will show that there are many choices for the recovery of an erasure column. Among the different recovery choices, we want to find one which minimizes the number of disk reads. To describe the different recovery choices, we define the following notations.
Definition 1.
1) Define Ri = {di,r|0 ≤ r ≤ p−1} as the i-th row parity set, 0 ≤ i ≤ p − 2;
2) DefineDj ={di,r|(i+r)modp=j,0≤i≤p−2,0≤ r ≤ p} as the j-th diagonal parity set, 0 ≤ j ≤ p − 2.
From Definition 1, |Ri| = p for 0 ≤ i ≤ p−2 and |Dj| = p for0≤j≤p−2(Notethatdj,0 ∈Dj anddj,p ∈Dj).
Ri is the set of symbols in row i except for the diagonal
parity symbol di,p in column p. According to RDP code,
di,p−1 ∈ Ri is the XOR-sum of all the other symbols in
Ri, i.e. di,p−1 = p−2 di,j. So given a symbol d ∈ Ri, d j=0
can be recovered by XOR-summing all the other symbols in Ri − {d}. While Dj is the set of symbols in the diagonal which contains dj,p, the j-th symbol in column p. With the same reason, given a symbol d ∈ Dj , d can be recovered by XOR-summing all the other symbols in Dj − {d}. So we have the following Lemma 1.
Lemma 1. Given a symbol d in an RDP disk array,
1) if d ∈ Ri, d can be recovered by XORing all symbols in Ri − {d}; (For the ease of presentation, we say that d can be recovered from Ri throughout this paper.)
2) if d ∈ Dj, d can be recovered from Dj;
3) d can only be recovered by 1) and/or 2)
Disk 0
Disk 1
Disk 2
Disk 3
Disk 4
Disk 5
0:0
0:1
0:2
0:3
0:null
null:0
1:1
1:2
1:3
1:null
1:0
null:1
2:2
2:3
2:null
2:0
2:1
null:2
3:3
3:null
3:0
3:1
3:2
null:3
Figure 5: RDP coding with Ri , Dj
representation.
Proof. (1) Let the erasure column be column k (0 ≤ k ≤ p−1). We need to recover all the p−1 symbols di,k(0 ≤ i ≤ p−2) in column k. Erasure symbol di,k can only be recovered by the parity sets to which it belongs. From Lemma 3, we know there is one overlapping symbol between each pair of Ri and Dj . If t erasure symbols (except for dp−1−k,k ) are recovered from diagonal parity sets and the remaining p − 1 − t symbols are recovered from row parity sets, the number of overlapping symbols is
t(p−1−t)=(t−(p−1)/2)2 +(p−1)2/4.
When t = (p − 1)/2, the number of overlapping symbols, t(p − 1 − t), is maximized, which is equal to (p − 1)2/4. During the recovery process, if an overlapping symbol is read from a disk to recover an erasure symbol, it will be stored in memory for the recovery of another erasure symbol. While recovering the late erasure symbol, the overlapping symbol can be read from memory instead of performing another disk read. So the maximum of (p−1)2/4 symbols can be reduced from disk read. The minimum number of disk reads for the recovery is
(p−1)2 −(p−1)2/4=3(p−1)2/4,
and a read-optimal recovery combination which matches the lower bound should consist of (p − 1)/2 row parity sets and (p − 1)/2 diagonal parity sets.
(2) The correctness of condition 2) follows directly from the proof of condition 1).
Therefore, Theorem 1 concludes.
From Theorem 1, a read-optimal recovery scheme can be stated as: for a single failed Disk k, choose any (p−1)/2 sym- bols (Note: dp−1−k,k must be included), reconstruct them from row parity sets and reconstruct the remaining (p − 1)/2 symbols from diagonal parity sets.
However, the read operations on individual disk of some choices are the same (or almost the same), while that of other choices varies greatly. In the next subsection, we will study how to balance the disk reads.
4.2 Balancing Number of Disk Reads
Let the erasure column be column k(0 ≤ k ≤ p − 1). The minimum number of symbols to be read from disks for the recovery is 3(p − 1)2 /4. Since any read-optimal recovery combination contains (p−1)/2 diagonal parity sets, (p−1)/2 symbols will always be read from Disk p (diagonal parity disk). Then the average number of symbols to be read from the other surviving disks (except for Disk k and Disk p) is
3(p−1)2/4−(p−1)/2 =(3p−5)/4. p−1
As (p−1)/2 symbols will always be read from Disk p, now we only consider how to provide a balanced read recovery scheme on the remaining disks (except for Disk k and Disk p).
Given a prime number p > 2, we have to consider two cases: eitherp≡3 (mod4)orp≡1 (mod4).
Case1:Ifp≡3(mod4),then3p−5isdivisibleby4. A balanced read-optimal recovery combination should read (3p − 5)/4 symbols from each of the disks (except for Disk k and Disk p).
two choices for the recovery of d0,1 (from R0 or D1), d1,1 and d2,1 respectively, and only one choice for the recovery of d3,1 (from R3 only), there are all together 2×2×2×1 = 8 recovery combinations to reconstruct symbols in Disk 1. The conventional recovery scheme, (R0 , R1 , R2 , R3 ), is just one such possible combination. Since Disk p (or column p) is the diagonal parity column, which can only be recovered by diagonal parity set, there is only one recovery combination for column p. So we only consider the recovery of Disk k, with k ̸= p in this paper.
Our objective is to find the read-optimal one from all pos- sible recovery combinations which has the minimum number of disk reads. This is equivalent to find a recovery combina- tion which has the highest number of overlapping symbols. We have the following Lemma 3 stating the property of over- lapping symbols between two parity sets.
Lemma 3.
1) There is just one overlapping symbol between each pair ofRiandDjfor0≤i,j≤p−2,andtheoverlapping symbol is Ri ∩ Dj = {di,
2) There is no overlapping symbol between each pair of Ri andRj,orDi andDj,i.e. Ri∩Rj =∅andDi∩Dj =∅
for 0≤i,j≤p−2,i̸=j.
It is interesting to point out that the conventional recovery strategy for single disk failure only uses row parity sets for the recovery. Lemma 3 shows that there is no overlapping symbol during the recovery process with the conventional recovery strategy. Therefore, one cannot reduce the read operations using the conventional recovery scheme. Also ac- cording to Lemma 3, if we use both the row and the diagonal parity sets to recover Disk k (k ̸= p), we can reduce disk read operations by putting overlapping symbols in memory. The following Theorem 1 gives a lower bound of disk reads for any single erasure column recovery.
Theorem 1.
1) For any erasure column k (k ̸= p), the minimum num- ber of disk reads for single disk failure recovery of RDP code is 3(p − 1)2/4.
2) Any recovery combination which consists of (p − 1)/2
row parity sets and (p − 1)/2 diagonal parity sets will
match the minimum number of disk reads. There are
p−1 possible recovery combinations which match (p−1)/2
the minimum number of disk reads.
Case2:Ifp≡1(mod4),then3p−5isn’tdivisibleby 4. A balanced and read-optimal recovery combination should read ⌈(3p − 5)/4⌉ symbols from some disks and ⌊(3p − 5)/4⌋ symbols from the rest.
In the following, we will only present the analysis for the caseofp≡3(mod4). Theanalysisofthecaseofp≡1 (mod 4) is similar. So we will only list the conclusions of the case of p ≡ 1 (mod 4) at the end of this section and omit their analysis for brevity.
To simplify the analysis of balanced disk read, we intro- duce a recovery sequence x0, x1, . . . , xp−2, where xi = 0 means that di,k is recovered from its row parity set, and xi = 1 means that di,k is recovered from its diagonal parity set. So a recovery combination can be exactly represented by a recovery sequence.
Our analysis of balanced disk read is based on the finite field Fp. To facilitate the analysis, we add an additional row p−1 with all symbols dp−1,j = 0(0 ≤ j ≤ p) to the encoded array throughout this section. So the row number set {0, 1,…, p−1} consists of all the elements of Fp. With the additional row, the disk array is now a p × (p + 1) array. Since all symbols in row p−1 are 0, dp−1,k in row p−1 can be regarded as being recovered from its row parity set. Because dp−1,k is recovered from its row parity set, xp−1 = 0. Moreover, dp−1−k,k is in missing diagonal and only belongs to its row parity set, which can only be recovered by its row parity set. So, xp−1−k = 0. With the additional row, a recovery sequence is x0, x1, . . . , xp−2, xp−1 with xp−1 = 0 and xp−1−k = 0.
A balanced recovery scheme will read the same (or almost the same) number of symbols from each of the surviving disks. Given a recovery sequence {xi }0≤i≤p−1 , the following Lemma 4 indicates that how many symbols will not be read from a surviving disk with the recovery combination corre- sponding to {xi}0≤i≤p−1. It helps to understand Theorem 2.
Lemma 4. Given a recovery sequence {xi }0≤i≤p−1 . If an erasure column k(0 ≤ k ≤ p − 1) is recovered by the recovery combination corresponding to {xi}0≤i≤p−1, the number of symbols which are not read from Disk j is:
p−1 p−1
xi − xixp .
i=0 i=0
Proof. We define ci,j to identify whether di,j is read for the recovery. If di,j is read for the recovery, ci,j = 0; otherwiseci,j =1(0≤i,j≤p−1).
If xi = 0, symbol di,k is recovered by row parity set Ri, di,j is read to recovery di,k because di,j ∈ Ri. If xp = 1, symbol dp,k is recovered by diagonal parity set Dp , di,j is read to recover dp,k because di,j ∈ Dp . Given di,j , either di,j ∈ Ri or di,j ∈ Dp . So only when xi = 0 or xp = 1, di,j is read for recovery. Therefore, only when xi = 0 or xp = 1, ci,j = 0. So ci,j =xi(1−xp).
The number of symbols which do not need to be read from Disk j (0 ≤ j ≤ p − 1, j ̸= k) for the recovery of Disk k is
p−1 p−1 p−1
ci,j =xi −xixp.
i=0 i=0 i=0 Therefore, Lemma 4 concludes.
The following Theorem 2 provides a necessary and suf- ficient condition for a recovery sequence to be both read- optimal and balanced.
Theorem 2. {xi}0≤i≤p−1 is a read-optimal and balanced recovery sequence for Disk k (0 ≤ k ≤ p − 1), if the following three conditions hold:
1) x0 +x1 +···+xp−2 +xp−1 =(p−1)/2, 2) xp−1−k = xp−1 = 0,
p−1
3) ∀t,1≤t≤p−1, xixp =(p−3)/4.
i=0
Proof. Condition 1) just means that (p − 1)/2 symbols are recovered from row parity sets and (p − 1)/2 symbols are recovered from diagonal parity sets. While condition 2) means that dp−1−k,k (which is in missing diagonal) and dp−1,k (which is an additional symbol) are recovered from their row parity sets. From Theorem 1, that condition 1) and condition 2) hold is equivalent to that {xi}0≤i≤p−1 is a read-optimal sequence.
In the following of the proof, we concentrate on condition 3).
As any read-optimal recovery will averagely read (3p−5)/4 symbols from each of the surviving disks (except for Disk p), if the disk read of recovery is balanced,
(p − 1) − (3p − 5)/4 = (p + 1)/4
symbols will not be read from each of the disks. From Lemma 4, The number of symbols which don’t need to be read from Disk j (0 ≤ j ≤ p − 1, j ̸= k) is
Therefore
p−1 p−1
xi − xixp . (1)
i=0 i=0
p−1 p−1
xi − xixp = (p + 1)/4. (2) i=0 i=0
Because the recovery is read optimal, from Theorem 1, (p − 1)/2 symbols in column k are recovered from their di- agonal parity sets. So
p−1
xi =(p−1)/2. i=0
From Equation (1) and (2),
p−1
xix
i=0
Lett=
One can prove in the similar way that if condition 3) holds, (3p − 5)/4 symbols are read from each of the disks (except for Disk k and Disk p). So {xi}0≤i≤p−1 is balanced.
Therefore, Theorem 2 holds.
In the following of this subsection, we use difference set [31] to present the read-optimal and balanced recovery scheme. Firstly, we introduce some definitions and properties of mul- tiset and difference set.
p
= (p − 3)/4. (3)
Definition 2. A multiset M is a generalization of a set, in which there maybe multiple instances of an element. M is denoted as
M ={k1 ×a1,k2 ×a2,…,kn ×an},
where a1, a2, . . . , an are all the distinct elements in M, and there are ki instances of ai in M, ki is called the multiplicity of ai(1≤i≤n).
Definition 3. Given a recovery sequence {xi}0≤i≤p−1. De- fineAx ={i|xi =1,0≤i≤p−1}andthemultiset MAx ={a1 −a2|a1,a2 ∈Ax,a1 ̸=a2}.
Theorem 4. Given a prime number p with p ≡ 3 (mod 4). For the nonzero square set Sp and the non-square set Sp′ in Fp , then
A= Sp′ −(k+1) k∈Sp Sp−(k+1) k∈/Sp
corresponds to a read-optimal and balanced recovery sequence {xi}0≤i≤p−1 for the recovery of Disk k (k ̸= p), where Sp′ − (k+1) = {s−(k+1)|s ∈ Sp′ } and Sp−(k+1) = {s−(k+1)|s ∈ Sp }.
Proof. We only prove the case of k ∈ Sp. The proof of the case of k ∈/ Sp is similar and therefore omitted.
According to A = Sp′ − (k + 1), define a recovery sequence {xi}0≤i≤p−1 as that if i ∈ A, xi = 1, otherwise xi = 0. In the following, we will prove that recovery sequence {xi}0≤i≤p−1 is read-optimal and balanced. According to Theorem 2, we need to prove that {xi}0≤i≤p−1 satisfies the three conditions of Theorem 2.
(1) Because |A| = |Sp′ | = (p−1)/2, x0 +x1 +···+xp−2 + xp−1 = (p − 1)/2. So {xi}0≤i≤p−1 satisfies condition 1) of Theorem 2.
(2)Givenk∈Sp andA=Sp′ −(k+1),
k∈Sp ⇒k∈/Sp′ ⇒k−(k+1)∈/Sp′ −(k+1)
⇒p−1∈/A⇒xp−1 =0 p∈/Sp′ ⇒p−(k+1)∈/Sp′ −(k+1)
⇒p−(k+1)∈/A⇒xp−1−k =0
So {xi}0≤i≤p−1 satisfies condition 2) of Theorem 2.
(3)AccordingtoTheorem3,Sp′ isap, p−1, p−3-difference 24
set in the additive group of Fp. So ∀t (1 ≤ t ≤ p−1), t has a multiplicity of (p − 3)/4 in the multiset MSp′ =
{ s 1 − s 2 | s 1 , s 2 ∈ S p′ , s 1 ̸ = s 2 } .
Because A = Sp′ −(k+1), multiset MA = {a1 −a2|a1,a2 ∈
A,a1 ̸= a2} = {s1 −s2|s1,s2 ∈ Sp′,s1 ̸= s2} = MSp′ . So ∀t (1 ≤ t ≤ p−1), t has a multiplicity of (p−3)/4 in MA, which is equivalent to
p−1
xixp = (p − 3)/4 i=0
for any t, 1 ≤ t ≤ p − 1. Thus, {xi}0≤i≤p−1 satisfies condi- tion 3) of Theorem 2.
From (1) to (3), {xi}0≤i≤p−1 is a read optimal and bal- anced recovery sequence when k ∈ Sp.
From Theorem 4, we give the read-optimal and balanced recovery scheme of Disk k, and it can be generalized as:
1) Figure out the nonzero square set Sp and non-square setSp′ inFp.
2)Ifk∈Sp,letAbeSp′ −(k+1);ifk∈Sp′,letAbe Sp −(k+1).
3) For the failed Disk k, the read-optimal and balanced recovery scheme is corresponding to A, which means when i ∈ A, di,k is recovered by diagonal parity and otherwise di,k is recovered by row parity.
Now we use an example to explain the read-optimal and balanced recovery scheme of Disk k (k ̸= p), when p ≡ 3 (mod4). Assume that p = 7, then S7 = {i2 mod7|1 ≤
Given a recovery sequence {xi }0≤i≤p−1 . Then xixp =(p−3)/4 for1≤t≤p−1
isequivalenttothatAx ={i|xi =1,0≤i≤p−1}satisfies MAx = {[(p−3)/4]×1, [(p−3)/4]×2, . . . , [(p−3)/4]×(p−1)}, i.e. for any t, 1≤t≤p−1, the multiplicity of t in MAx is (p − 3)/4.
Proof. Given t, 1 ≤ t ≤ p−1. If p−1xix = i=0 p
(p−3)/4, there are it1,it2,…,it(p−3)/4 with 0 ≤ itu ≤ p−1 and itu ̸= itv for tu ̸= tv, such that
xitu = x
for 1 ≤ u ≤ (p − 3)/4. According to Definition 3, this is
Lemma 5.
p−1 i=0
equivalent to
for 1 ≤ u ≤ (p − 3)/4, and further equivalent to that for
xitu , x
1 ≤ u ≤ (p − 3)/4, Soforanyt,1≤t≤p−1,themultiplicityoftinMAx is
(p − 3)/4. Lemma 5 concludes.
Consider the example in Figure 4, (D0, D1, R2, D3, R4, R5) is a read-optimal and balanced recovery combination of Disk 0.
Therefore, the corresponding recovery sequence {xi}0≤i≤6 = {1101000}, and Ax = {0,1,3}. MAx = {a1 − a2|a1,a2 ∈ Ax,a1 ̸= a2} = {1,2,3,4,5,6}, which satisfies that ∀t (1 ≤ t ≤ 6), t has a multiplicity of (p − 3)/4 = 1 in the multiset MAx .
Definition 4. Let G be an additive group of order v and S be a subset of G with k elements. S is called a (v, k, λ)- difference set if each nonzero element in G has a multiplicity of λ in the multiset MS = {s1 − s2 |s1, s2 ∈ S, s1 ̸= s2}.
Definition 5. Let p be a prime number. Sp = {i2 mod p|1 ≤ i ≤ p−1} is called the nonzero square set of Fp. Sp′ = Fp \ (Sp ∪ {0}) is called the non-square set of Fp .
Theorem 3. [31] Let p be a prime number with p ≡ 3 (mod 4), Sp and Sp′ both are p, p−1 , p−3 -difference sets
in the additive group of Fp.
Since (p−i)2 ≡ (p2 −2pi+i2) ≡ i2 (mod p), the nonzero squaresetSp ={i2 modp|1≤i≤p−1}={i2 modp|1≤ i ≤ (p − 1)/2}. The following theorem gives a read-optimal and balanced recovery sequence {xi}0≤i≤p−1 for any failed Diskk(k̸=p),whenp≡3 (mod4).
24
i ≤ (7−1)/2} = {12 mod 7,22 mod 7,32 mod 7} = {1,2,4}, and S7′ = F7 \(S7 ∪{0}) = {3,5,6}.
When Disk k = 1 fails, to recover the failed Disk 1, we need to find a read-optimal and balanced recovery se- quence {xi}0≤i≤6. As k = 1 ∈ S7, from Theorem 4, A = Sp′ −(k+1) = Sp′ −(1+1) = {(s−2) mod 7|s ∈ Sp′ } = {1,3,4} corresponds to a recovery sequence {xi}0≤i≤6 = {0101100}. Thus, the read-optimal and balanced recovery scheme is symbols d0,1, d2,1, d5,1 are recovered by row parity, and symbols d1,1, d3,1, d4,1 are recovered by diagonal parity.
WhenDiskk=3fails,k=3∈/S7,fromTheorem4, A=Sp −(k+1)=Sp −(3+1)={(s−4)mod7|s∈Sp}= {0, 4, 5} corresponds to a recovery sequence {xi}0≤i≤6 = {1000110}. Thus, the read-optimal and balanced recovery scheme is symbols d1,3 , d2,3 , d3,3 are recovered by row parity, and symbols d0,3, d4,3, d5,3 are recovered by diagonal parity.
The following Theorem 5 to Theorem 7 give a read-optimal and balanced recovery scheme for any single disk failure with p ≡ 1 (mod 4). They are similar to Theorem 2 to Theorem 4 of the case of p ≡ 3 (mod 4) respectively. Therefore they can be proved in the similar way and we omit them for brevity.
Theorem 5. Let p be a prime number with p ≡ 1 (mod 4). {xi}0≤i≤p−1 is a read-optimal and balanced recovery sequence for the recovery of Disk k(0 ≤ k ≤ p−1) in a disk array with RDP code, if the following three conditions hold:
To recover any single column k, a read-optimal and bal- anced recovery combination we choose is according to the recovery sequence demonstrated in Theorem 4 and Theorem 7. For0≤i≤p−1,ifi∈A,symboldi,k isrecoveredby diagonal parity set Dp ; otherwise di,k is recovered by row parity set Ri. Any overlapping symbol d can be located in the disk array by calculating intersection of the selected row parity sets and diagonal parity sets for the recovery.
The recovery process of RDOR is executed in a “row- parity-first” manner, i.e., we first recover the (p − 1)/2 era- suresymbolsdj,k withj∈/A(0≤j≤p−1)byRj,and then recover the rest (p − 1)/2 symbols di,k with i ∈ A (0 ≤ i ≤ p−1). (p−1)/2 memory units are reserved to recover the (p − 1)/2 symbols di,k with i ∈ A.
In summary, the following is the RDOR scheme:
1. (p − 1)/2 memory units Mi (i ∈ A) are allocated for each erasure symbol di,k (i ∈ A) and the symbol di stored in Mi is initialized 0.
2. Recover the (p−1)/2 symbols dj,k (0 ≤ j ≤ p−1, j ∈/ A).
(a) Recover symbol dj,k by XOR-summing all avail- able symbols in Rj;
(b) During the recovery process, for each symbol d ∈ Rj, if it is an overlapping symbol which should be used to recover some di,k (i ∈ A) later, it is XOR-summed with the symbol di stored in its corresponding memory locations Mi (i ∈ A) (in other words, di ← di ⊕ d).
3. Recover the rest (p − 1)/2 symbols di,k (i ∈ A).
(a) While recovering di,k (i ∈ A), for each symbol
d ∈ Dp , if it is not an overlapping symbol, read it from disk and XOR-sum it to di.
(b) After 3(a) finished, recovered symbol di,k is stored in Mi.
4. Once the p − 1 erasure symbols are successfully recov- ered, the recovery process can move on to the next stripe, and the (p − 1)/2 memory units Mi can be reused.
We compare RDOR with the conventional recovery scheme in the following three aspects.
1. Number of Disk Reads: The conventional recov- ery performs (p − 1)2 disk reads while RDOR needs to read 3(p − 1)2/4 symbols for recovery per stripe. RDOR can reduce by approximately 25% disk reads compared with the conventional scheme at the expense of a proper extra memory usage.
2. Read Balance: Both RDOR and the conventional scheme achieve disk read balance during recovery.
3. Computational Complexity: The computational cost of recovery scheme is determined as the total num- ber of exclusive or (XOR) operations needed for recov- ery. Each parity set contains p symbols, so to recover any erasure symbol no matter which kind of parity was chosen for recovery, p − 2 XOR operations on p − 1 symbols are required. Thus the total number of XOR operations required for recovery of RDOR is equal to the conventional scheme.
1) x0 +x1 +···+xp−2 +xp−1 =(p−1)/2, 2) xp−1 = xp−1−k = 0,
3) ∀t(1 ≤ t ≤ p−1), either p−1 xix i=0
or p−1 xix = (p − 5)/4. i=0 p
p
= (p−1)/4,
Definition 6. Let G be an additive group of order v and S be a subset of G with k elements. S is called a (v,k,λ,μ)- partial difference set of G, if each nonzero element s ∈ S has a multiplicity of λ, and each nonzero element s′ ∈ G\S has a multiplicity of μ in the multiset {s1 −s2|s1, s2 ∈ S, s1 ̸= s2}.
Theorem 6. [21] Let p ≡ 1 (mod 4) be a prime number, Sp be the set of all nonzero squares in Fp and Sp′ is the non- square set in Fp. Sp and Sp′ both are p, p−1, p−5, p−1- partial difference sets in the additive group of2Fp .4 4
Theorem 7. Given a prime number p with p ≡ 1 (mod 4). For the nonzero square set Sp and the non-square set Sp′ in Fp,
A= Sp′ −(k+1) k∈Sp Sp−(k+1) k∈/Sp
corresponds to a read-optimal and balanced recovery sequence {xi}0≤i≤p−1 for the recovery of Disk k(k ̸= p) in an RDP coded disk array.
4.3 The RDOR Recovery Scheme for Single Disk Failure
In the last subsection, we presented a recovery combi- nation which is read-optimal and balanced for any erasure column k (k ̸= p). In the following, we propose the RDOR scheme, which is a read-optimal, balanced recovery scheme and pre-stores (p − 1)/2 extra symbols.
0.85 0.8 0.75 0.7 0.65 0.6
32KB 64KB
128KB
256KB
512KB
1024KB
p
=7
p=13 p=19
Strip size
Table 1: Disk array simulation parameters.
Parameter
Value
I/O bus bandwidth
XOR bandwidth
Disk capacity
Sustainable disk transfer rate Rotation speed
Single track seek Full seek
512MBps 1GBps 146.8G 99MBps 15000RPM 0.3ms 7.2ms
5. PERFORMANCE EVALUATION
To evaluate the performance of RDOR, we conduct ex- periments which compare RDOR scheme with conventional scheme in terms of total recovery time and individual disk access time.
5.1 Methodology
We conduct the experiments by using a widely accepted disk simulator, DiskSim [3]. The simulated disk array sub- system is composed of p + 1 disks including two redundant disks. The logical layout is rotated between different parity stripes as in common RAID-6 implementation. Each disk is attached to a controller, and all these disk controllers are under the same interleaved I/O bus. The failed disk is re- constructed in an off-line mode, without any interruptions of outside requests while recovering. We take it as a simpli- fied approach to compare the performance of RDOR scheme with conventional scheme under the same circumstances.
For the recovery of a single failed disk, RDOR and the con- ventional scheme are implemented into the simulation envi- ronment. In both schemes, during recovering each stripe, as soon as all the failure blocks are regenerated in memory, they are written to the spare disk at once. Two metrics, total re- covery time and individual disk access time, are evaluated. Disk access time refers to the total time of a disk in me- dia access during recovery. It consists of seek time, rotation time, transfer time, etc.
Note that, the strip size and the disk array size affect the system recovery performance. Our experiments are con- ducted with different number of disks and different strip sizes. The simulated disk’s parameters are extracted from a modern enterprise hard disk [28], and Table 1 lists all the parameters used in this section.
5.2 Performance with Different Strip Sizes
To evaluate the recovery performance under different strip sizes, we vary the strip size from 32KB to 1024KB with fixed number of disks, p + 1 = 8, 14, and 20 respectively.
Disk access time. RDOR improves the read access effi- ciency by reducing the number of disk reads in each surviv- ing disk. Figure 6 shows that the average disk read access time of RDOR is reduced by 15.16% ∼ 22.60% compared with the conventional scheme. However, the ratio of disk access time doesn’t show a stable tendency with strip size varying from 32KB to 1024KB.
RDOR reduces disk reads but changes the access pattern of disk read from purely sequential accesses to a more ran- dom pattern, therefore incurs some additional disk seeks. The individual disk access time is determined not only by the amount of data read but also by the number of disk seeks and seek distances while reading. And it may be the main
Figure 6: Average disk access time ratio of RDOR to conventional scheme with different strip sizes when p=7, 13, and 19.
4000 3500 3000 2500 2000 1500 1000
500 0
Disk 0
Disk 1
Disk 2
Disk 3
Disk 4
Disk 5
Disk 6
Disk 7
Conventional
RDOR
Figure 7: Disk access time of RDOR and conven- tional scheme of each disk when p = 7 and strip size is 32KB. Disk 0 is the failed and thereafter replaced disk.
reason that the access time improvements shown in Figure 6 are less than the theoretical value, approximately 25%.
Moreover, the sequentiality of reads on individual surviv- ing disk in RDOR is not the same. As in Figure 4, reading all the needed data in Disk 1 is broken into 3 non-sequential read requests, while that of Disk 4 can be completed in only one read request. Therefore, the access time of reading in a stripe on individual surviving disk may be different. As parities are rotated between stripes, the access time of each disk for the recovery is balanced, which is shown in Figure 7.
Figure 7 shows the access time of each disk (Disk 0 is the failed and replaced disk) with p = 7 and strip size of 32KB. From Figure 7, the access times of different disks (excluding the spare Disk 0) with RDOR are almost the same, and the average disk access time is decreased by 18.25% compared with conventional scheme.
Total recovery time. Figure 8 shows the total recov- ery time ratio of RDOR to conventional scheme, from which we can see that RDOR constantly outperforms conventional scheme as strip size varies from 32KB to 1024KB. For ex- ample, when p = 7, the total recovery time of RDOR is reduced by 5.72% ∼ 12.60%. As shown in Figure 8, the improvements of recovery time decrease as strip size varies
Per−disk total access time (in seconds)
Average access time ratio
1 0.95 0.9 0.85 0.8 0.75 0.7
8 12 14 18 20
Number of disks(p+1)
64KB
256KB 1024KB
1 0.95 0.9 0.85 0.8 0.75 0.7
32KB 64KB
128KB
256KB
512KB
1024KB
p=7
p=13 p=19
Strip size
Figure 8: Recovery time ratio of RDOR to conven- tional scheme with different strip sizes when p = 7, 13, and 19.
Figure 10: Recovery time ratio of RDOR to conven- tional scheme with different number of disks when strip size is 64KB, 256KB, and 1024KB.
ratios are almost the same with different number of disks when the strip size is fixed to 64KB, 256KB, and 1024KB respectively, which implies RDOR can be implemented in fairly large storage systems to improve the read access effi- ciency. With strip size of 1024KB, the disk access time of RDOR is reduced by 18.13% ∼ 20.17%.
Total recovery time. Figure 10 shows the total recovery time ratio increases as the number of disks increases. With strip size of 1024KB, the total recovery time ratio increases from 0.929 to 0.971 as the number of disks increases from 8 to 20. Since the access time of reading in a stripe on indi- vidual surviving disk with RDOR may be different and the write requests to the spare disk should wait until the data has been reconstructed, the recovery process of RDOR may cost more time in waiting and synchronizing with large num- ber of disks. Thus, RDOR provides comparatively smaller improvements in recovery time with large number of disks.
0.85 0.8 0.75 0.7 0.65 0.6
8 12 14 18 20
Number of disks(p+1)
64KB
256KB 1024KB
Figure 9: Average disk access time ratio of RDOR to conventional scheme with different number of disks when strip size is 64KB, 256KB, and 1024KB.
from 32KB to 128KB first and then increase as strip size varies from 128KB to 1024KB.
However, in Figure 8 the total recovery time ratios are mostly larger than 0.9. Although RDOR can reduce ap- proximately 25% disk reads, the total recovery time is bot- tlenecked by the slowest surviving disk in recovering each stripe (or stipe set). Also when there are no foreground re- quest served, recovery buffer write to the spare disk is slower than read on surviving disk, which becomes another bottle- neck. Thus, the reduction on disk reads cannot be translated into equivalent saving of recovery time.
In the case of an RDP storage system still serving the foreground requests while recovery, the whole system will go into a degraded mode and the recovery process usually runs in a lower priority. Then, reduction on disk read access time to surviving disks can be translated into more saving of recovery time in online scenario. The performance of RDOR is expected to be better in online recovery than off-line.
5.3 Performance with Different Number of Disks
In this subsection, we analyze the impact of different num- ber of disks on recovery performance. We conduct experi- ments with different number of disks from 8 to 20 with fixed strip sizes of 64KB, 256KB, and 1024KB respectively.
Disk access time. Figure 9 shows the disk access time
The experimental results can be generalized as follows.
6.
1)
2)
By using RDOR, the access time of an individual disk for recovery is reduced by 15.16% ∼ 22.60% compared with conventional scheme. Less disk access time indi- cates that the disk array system can further accelerate the recovery process in online recovery scenario.
RDOR can constantly reduce the total recovery time compared with the conventional scheme with strip size varies from 32KB to 1024KB, which indicates that RDOR scales well as strip size increases.
RELATED WORK
System designers expect better recovery solutions to mini- mize the time taken for recovery and the degradation of user performance in RAID storage systems. Recent years a large number of approaches have been proposed for recovering the failed disk more efficiently.
Several approaches [11, 13, 35] address the problem by trading storage capacity for improved failure recovery per- formance. Menon and Mattson [22] showed that instead of using a dedicated spare disk, there are many advantages to distribute the spare space across the disk array. In such an array, the recovery writes are distributed as well as the recovery reads. Clustered RAID [24], in which data plus
Average access time ratio
Recovery time ratio
Recovery time ratio
parity groups are distributed over a larger number of disks, was introduced to reduce the increased load per disk.
To improve the recovery parallelism, the Disk-Oriented Reconstruction (DOR)[13] algorithm executes reconstruc- tion processes associated with disks, which keeps every sur- viving disk busy with reconstruction reads at all time. In another work by Lee and Lui [18], a pipelined recovery algo- rithm is proposed to take advantage of the sequential prop- erty of track retrievals to pipeline the reading and writing processes.
In online recovery mode, the foreground requests will in- terfere with the recovery process [23, 29]. To relieve the interference of foreground workload, Wu et.al [33] proposed the WorkOut scheme which temporarily redirects all write requests and popular read requests to a surrogate RAID set to let the degraded RAID set be devoted to the recovery process. A Popularity-based multi-threaded reconstruction optimization (PRO) algorithm [30] was proposed to acceler- ate the users’ accesses and avoid performance degradation by recovering the high-popularity data units prior to other units.
In contrast to the above recovery mechanisms which focus on reorganizing data layout or optimizing recovery workflow, there are some other researches focused on designing specific recovery algorithms for erasure codes according to the cod- ing characteristics. By using generator check matrix of era- sure codes, Hafner et.al [9] proposed a code-specific hybrid reconstruction algorithm which can reduce XOR operations during recovery. Due to the reduction of XOR operations, the performance of decoding was improved. Later, a revised version of this algorithm was proposed in [4] to further im- prove space and time efficiency during recovery.
7. CONCLUSION
This paper proposes an optimal recovery scheme RDOR for single disk failure of RDP code storage systems. The RDOR scheme uses both row parity and diagonal parity for data reconstruction, and also minimizes the number of disk reads from surviving disks to reduce the recovery time. We derive the lower bound of disk reads for the recovery and design recovery schemes which match the lower bound of disk reads, as well as have a load balancing property on all surviving disks. Experimental results show that our scheme outperforms the conventional recovery scheme in terms of total recovery time and recovery workload on individual sur- viving disk. It is important to point out that the main idea of our approach can be generalized to other double fault tolerant array codes such as EVENODD or X-code. Our future work includes evaluating the performance on storage systems which are working in online recovery mode.
8. ACKNOWLEDGMENTS
We would like to thank our shepherd, Bianca Schroeder, and the anonymous reviewers for their valuable comments. We would also like to thank Kaiqian Ou, Haotian Zhao, Yubiao Pan and Runhui Li for the helpful discussions during this work. This work is supported by the National Science Foundation of China under Grant No. 60773036.
9. REFERENCES
[1] M. Baker, M. Shah, D. S. H. Rosenthal,
M. Roussopoulos, P. Maniatis, T. Giuli, and
P. Bungale. A fresh look at the reliability of long-term digital storage. In Proceedings of the 2006 EuroSys Conference, pages 221–234, 2006.
[2] M. Blaum, J. Brady, J. Bruck, and J. Menon. EVENODD: an efficient scheme for tolerating double disk failures in RAID architectures. IEEE Transactions on Computers, 44(2):192–202, 1995.
[3] J. Bucy, J. Schindler, S. Schlosser, and G. Ganger. The DiskSim simulation environment (v4.0). http://www.pdl.cmu.edu/DiskSim/.
[4] B. Cassidy and J. L. Hafner. Space efficient matrix methods for lost data reconstruction in erasure codes. Technical Report RJ10415, IBM Research, 2007.
[5] P. Corbett, B. English, A. Goel, T. Grcanac,
S. Kleiman, J. Leong, and S. Sankar. Row-diagonal parity for double disk failure correction. In FAST ’04: Proceedings of the 3rd USENIX Conference on File and Storage Technologies, pages 1–14, 2004.
[6] S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In SOSP ’03: Proceedings of the 19th ACM Symposium on Operating systems Principles, pages 29–43, 2003.
[7] J. L. Hafner. WEAVER codes: highly fault tolerant erasure codes for storage systems. In FAST ’05: Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies, pages 211–224, 2005.
[8] J. L. Hafner. HoVer erasure codes for disk arrays. In
DSN ’06: Proceedings of the International Conference on Dependable Systems and Networks, pages 217–226, 2006.
[9] J. L. Hafner, V. Deenadhayalan, K. K. Rao, and J. A. Tomlin. Matrix methods for lost data reconstruction in erasure codes. In FAST ’05: Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies, pages 15–30, 2005.
[10] M. Holland. On-Line Data Reconstruction in Redundant Disk Arrays. PhD thesis, Carnegie Mellon University, 1994.
[11] M. Holland and G. A. Gibson. Parity declustering for continuous operation in redundant disk arrays. In Proceedings of the 5th international conference on Architectural support for programming languages and operating systems, pages 23–35, 1992.
[12] M. Holland, G. A. Gibson, and D. P. Siewiorek. Fast, on-line failure recovery in redundant disk arrays. In Proceedings of the 23rd Annual International Symposium on Fault-Tolerant Computing, pages 422–431, 1993.
[13] M. Holland, G. A. Gibson, and D. P. Siewiorek. Architectures and algorithms for on-line failure recovery in redundant disk arrays. Distributed and Parallel Databases, 2(3):295–335, 1994.
[14] C. Huang, M. Chen, and J. Li. Pyramid codes: flexible schemes to trade space for access efficiency in reliable data storage systems. In NCA ’07: 6th IEEE International Symposium on Network Computing Applications, pages 79–86, 2007.
[15] C. Huang and L. Xu. STAR: an efficient coding scheme for correcting triple storage node failures. In FAST ’05: Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies, 2005.
[16] N. Joukov, A. M. Krishnakumar, C. Patti, A. Rai,
S. Satnur, A. Traeger, and E. Zadok. RAIF: Redundant array of independent filesystems. In MSST ’07: Proceedings of 24th IEEE Conference on Mass Storage Systems and Technologies, pages 199–212, 2007.
[17] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski,
P. Eaton, D. Geels, R. Gummadi, S. Rhea,
H. Weatherspoon, W. Weimer, C. Wells, and B. Zhao. Oceanstore: an architecture for global-scale persistent storage. In ASPLOS ’00: Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, 2000.
[18] J. Y. B. Lee and J. C. S. Lui. Automatic recovery from disk failure in continuous-media servers. IEEE Transactions on Parallel Distributed Systems, 13(5):499–515, 2002.
[19] C. Lueth. RAID-DP: Network Appliance implementation of RAID double parity for data protection. Technical Report No. 3298, Network Appliance Inc, 2004.
[20] P. Lyman and H. R. Varian. How much information? http://www.sims.berkeley.edu/how-much-info-2003.
[21] S. L. Ma. A survey of partial difference sets. Des. Codes Cryptography, 4(3):221–261, 1994.
[22] J. Menon and D. Mattson. Comparison of sparing alternative for disk arrays. In Proceedings of the International Symposium on Computer Architecture, pages 318–329, 1992.
[23] A. Merchant and P. S. Yu. Analytic modeling of clustered raid with mapping based on nearly random permutation. IEEE Transactions on Computers, 45(3):367–373, 1996.
[24] R. R. Muntz and J. C. S. Lui. Performance analysis of disk arrays under failure. In Proceedings of the sixteenth international conference on Very large databases, pages 162–173, 1990.
[25] E. Pinheiro, W.-D. Weber, and L. A. Barroso. Failure trends in a large disk drive population. In FAST ’07: Proceedings of the 5th conference on USENIX Conference on File and Storage Technologies, pages 17–28, 2007.
[26] J. S. Plank. The RAID-6 liberation codes. In FAST ’08: Proceedings of the 6th USENIX Conference on File and Storage Technologies, pages 1–14, 2008.
[27] B. Schroeder and G. A. Gibson. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In FAST ’07: Proceedings of the 5th conference on USENIX Conference on File and Storage Technologies, pages 1–16, 2007.
[28] Seagate Inc. Cheetah⃝R 15K.5 Fibre Channel 146-GB Hard Drive ST3146855FC Product Manual.
[29] A. Thomasian and J. Menon. RAID-5 performance with distributed sparing. IEEE Transactions on Parallel Distributed Systems, 8(6):640–657, 1997.
[30] L. Tian, D. Feng, H. Jiang, K. Zhou, L. Zeng,
J. Chen, Z. Wang, and Z. Song. PRO: a popularity-based multi-threaded reconstruction optimization for RAID-structured storage systems. In FAST ’07: Proceedings of the 5th USENIX conference on File and Storage Technologies, 2007.
[31] J. van Lint, R. M. Wilson, and J. K. Hale. A Course
in Combinatorics. Cambridge University Press,
Cambridge, MA, 1993.
[32] Wikipedia. DDR2 SDRAM — Wikipedia, the free
encyclopedia, 2010.
[33] S. Wu, H. Jiang, D. Feng, L. Tian, and B. Mao.
Workout: I/O workload outsourcing for boosting raid reconstruction performance. In FAST ’09: Proccedings of the 7th conference on File and storage technologies, pages 239–252, 2009.
[34] Q. Xin, E. L. Miller, T. Schwarz, D. D. E. Long, S. A. Brandt, and W. Litwin. Reliability mechanisms for very large storage systems. In MSST ’03: Proceedings of the 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 146–156, 2003.
[35] Q. Xin, E. L. Miller, and T. J. E. Schwarz. Evaluation of distributed recovery in large-scale storage systems. In HPDC ’04: Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing, pages 172–181, 2004.
[36] L. Xu and J. Bruck. X-code: MDS array codes with optimal encoding. IEEE Transactions on Information Theory, 45(1):272–276, 1999.
APPENDIX
In this section, we will prove that only when all the surviving symbols are read directly from disk, the number of disk reads can achieve the lower bound 3(p − 1)2/4.
Lettheerasurecolumnbecolumnk(0≤k≤p−1). We need to recover all the p−1 symbols di,k(0 ≤ i ≤ p−2) in column k. Erasure symbol di,k can only be recovered by the parity sets to which it belongs. Surviving symbols for recovery are either read directly from disk or further generated from other parity sets.
Assume that in the recovery process, s surviving sym- bols are generated from other parity sets instead of being read directly from disks. Each generated symbol reads p − 3 additional symbols. Therefore, p − 1 + s parity sets are needed in total and consist of (p − 1)2 + s(p − 3) symbols (Note: There are some symbols counted twice which will be subtracted later).
Among all the p−1+s parity sets, suppose there are t row parity sets and p−1+s−t diagonal parity sets. According to Lemma 3, there is one overlapping symbol between each pair of row and diagonal parity sets. So there are t(p−1+s−t)− 2s overlapping symbols (excluding the s surviving symbols which are generated by parity sets and the s symbols which overlap at the failed disk).
So the number of disk reads for the recovery is (p−1)2 +s(p−3)−(t(p−1+s−t)−2s)
=(t−(p−1+s)/2)2 +3(p−1)2/4+s(2p−2−s)/4.
When t = (p−1+s)/2, the number of disk reads is minimized, which is equal to 3(p−1)2/4+s(2p−2−s)/4. As RDP can only tolerate double disk failures, 0 ≤ s < 2p − 2, only when s = 0 and t = (p − 1)/2,
3(p − 1)2/4 + s(2p − 2 − s)/4 = 3(p − 1)2/4,
which achieves the lower bound in Theorem 1. Therefore, no matter how we get surviving symbols for recovery, the lower bound of disk reads is 3(p − 1)2/4.