BTS: An Accelerator for Bootstrappable Fully Homomorphic Encryption
Sangpyo Kim† Jongmin Kim† Kim† Wonkyung Jung† ‡ ‡ Ahn†
Seoul National University†, KAIST‡
{vnb987, jongmin.kim, michael604, jungwk, {mrhu,
Copyright By PowCoder代写 加微信 powcoder
Abstract—Homomorphic encryption (HE) enables secure of- floading of computation to the cloud by providing computation on encrypted data (ciphertexts). HE is based on noisy encryption schemes such that noise accumulates as we apply more compu- tation to the data. The limited number of operations applicable to the data prevents practical applications from exploiting HE. Bootstrapping enables an unlimited number of operations or fully HE (FHE) by refreshing the ciphertext. Unfortunately, bootstrap- ping requires a significant amount of additional computation and memory bandwidth. Prior works have proposed hardware accelerators for computation primitives of FHE. However, to the best of our knowledge, this is the first to propose a hardware FHE accelerator tailored to support bootstrapping efficiently.
In particular, we propose BTS — Bootstrappable, Technology- driven, Secure accelerator architecture for FHE. We identify the challenges of supporting bootstrapping in the accelerator and analyze the off-chip memory bandwidth and computation required. In particular, given the limitations of modern memory technology, we identify the HE parameter sets that are efficient for FHE acceleration. Based on the insights from our analysis, we propose BTS that effectively exploits parallelism innate in HE operations by arranging a massive number of processing elements in a grid. We present the design and microarchitecture of BTS, including the network-on-chip that exploits the deterministic communication pattern. BTS shows 5,556× and 1,306× improved execution time on ResNet-20 and logistic regression over CPU, using 373.6mm2 chip area and up to 133.8W of power.
I. INTRODUCTION
Homomorphic encryption (HE) allows computation on en- crypted data or ciphertexts (cts). In the era of machine- learning-as-a-service (MLaaS), HE is being spotlighted as an enabler for privacy-preserving cloud computing, as it allows safe offloading of private data. Because HE schemes are based on the learning-with-errors (LWE) [63] problem, they are noisy in nature. Noise accumulates as we apply a sequence of computations on cts. This limits the number of computations that can be performed and hinders the HE applicability for practical applications. To overcome such limitation, fully HE (FHE) [34] was proposed, featuring an operation (op) called bootstrapping, that “refreshes” the ct and hence permits an unlimited number of computations on the ct. Among multiple HE schemes that support FHE, CKKS [20] is one of the prime candidates as it supports fixed-point real number arithmetic.
One of the main barriers to adopting HE has been its high computational and memory overhead. New schemes [13], [14], [20], [23], [33] and algorithmic optimizations [2], [12], [37], such as using the residue number system (RNS) [6], [18],
have been suggested to reduce this overhead and resulted in over 1,000,000× speedup [12] compared to its first HE implementation [35]. However, even with such efforts, HE ops experience tens of thousands of slowdowns compared to the unencrypted ops [43]. Tackling this, prior works have sought hardware solutions to accelerate HE ops, including CPU extensions [11], [43], GPU [1]–[3], [42], FPGA [47], [48], [64], [65], and ASIC [66].
However, there still remains ample room for improvement left on the table when it comes to accelerating FHE. While more than 1,000 bootstrapping ops are necessary per single ResNet-20 inference [51], each bootstrapping op takes dozens of seconds on the state-of-the-art CPU implementation [32]. GPU performs better, but it still takes hundreds of millisec- onds to perform a single bootstrapping [42]. Furthermore, prior works on custom hardware acceleration [47], [48], [64], [65] do not support HE parameters that allow bootstrapping. F1 [66] demonstrated bootstrapping time of CKKS but had limited throughput because its design was not focused on the FHE situation with frequent bootstrapping. These limitations prevent them from being adopted for complex applications.
We propose BTS, a bootstrapping-oriented FHE accelerator that is Bootstrappable, Technology-driven, and Secure. First, we identify the limitations in designing an HE accelerator im- posed by contemporary fabrication technology, analyzing the implications of various conflicting requirements for the per- formance and security of FHE under such constrained design space. This allows us to pinpoint the appropriate optimization targets and requirements of designing an FHE accelerator. Sec- ond, we establish a design principle of populating a massive number of small processing elements (PEs) in a grid instead of a few large PEs (Section IV). We base our principle on the observation that the amount of the two types of parallelism (residue-polynomial-wise and coefficient-wise) changes when it comes to the FHE optimal parameters. Finally, we optimize BTS microarchitecture by i) increasing the utilization rates of compute units via coarse-grained pipelining among different HE functions and ii) exploiting the computation patterns of HE functions for intelligent data mapping among the PEs with minimized NoC traffic (Section V).
Through these detailed studies, BTS achieves 5,714× speedup in multiplicative throughput against F1, the state-of- the-art ASIC implementation, when bootstrapping is properly considered. Also, BTS significantly reduces the training time
arXiv:2112.15479v1 [cs.CR] 31 Dec 2021
LIST OF SYMBOLS USED TO DESCRIBE CKKS [20].
B. CKKS: an emerging HE scheme
CKKS first encodes a message that is a vector of com- plex numbers, into a plaintext m(X) = N−1 ciXi, which
Q0 , …, Qdnum−1
p0 , …, pk−1
evk(r) rot
l Lboot k dnum λ
Definition
(Prime) moduli product =
(Prime) moduli
Modulus factors
is a polynomial in a cyclotomic polynomial ring RQ =
Special (prime) moduli product = k−1 pi i=0
Special (prime) moduli
Evaluation key (evk) for HMult
evk for HRot with a rotation amount of r The degree of a polynomial
Maximum (multiplicative) level
Current (multiplicative) level of a ciphertext Levels consumed at bootstrapping
The number of special prime moduli Decomposition number
Security parameter of a given CKKS instance
ZQ[X]/(XN+1). The coefficients {ci} are integers modulo Q and the number of coefficients (or degree) is up to N where N is a power-of-two integer, typically ranging from 210 to 218. For a given N, a message with up to N/2 complex numbers can be packed into a single plaintext in CKKS. Each element within a packed message is referred to as a slot. After encoding (or packing), element-wise multiplication (mult) and addition between two messages can be done through polynomial op- erations between plaintexts. CKKS then encrypts a plaintext m(X) ∈ RQ into a ct ∈ R2Q based on the following equation,
ct=(b(X),a(X))=(a(X)·s(X)+m(X)+e(X),a(X))
where s(X) ∈ RQ is a secret key, a(X) ∈ RQ is a random polynomial, and e(X) is a small error polynomial whose coefficients follow a discrete Gaussian distribution. CKKS decrypts ct by computing m′(X)=ct·(1,−s(X))=m(X)+ e(X), which is approximately same as m(X) with errors.
The main bottleneck in HE is the high computational com- plexity of polynomial ops. As each coefficient of a polynomial is a large integer (having up to 1,000s of bits) and the degree is high (even surpassing 100,000), an op between two polyno- mials has high compute and data-transfer cost. To reduce the computational complexity, variants of HE schemes [6], [18] have been proposed to use the residue number system (RNS). For example, Full-RNS CKKS [18] sets Q as the product of word-sized (prime) moduli {qi }0≤i≤L , where Q = Li=0 qi for a given integer L, called maximum (multiplicative) level. Using Chinese Remainder Theorem (Eq. 1), we represent a polynomial in RQ with residue polynomials in {Rqi }0≤i≤L, whose coefficients are residues obtained by performing mod- ulo qi (represented as [·]qi ) on the big coefficients:
[a(X)]Q→([a(X)]q0,…,[a(X)]qL) where Q=qi (1) i
Then, we can convert an op involving two polynomials to the ops between residue polynomials with small coefficients (< 64 bits) corresponding to the same prime modulus qi, avoiding costly big-integer arithmetic with carry propagation. Although Full-RNS CKKS was shown to provide about 8× improve- ment in performance (execution time) over CKKS [18], the performance overhead still remains very high [1], [12], [42], [64]. In this paper, we leverage Full-RNS CKKS as our CKKS implementation which represents a polynomial in RQ as an N×(L+1) matrix of residues, and a ct as a pair of such matrices.
C. Primitive operations (ops) of CKKS
Primitive HE ops of CKKS are introduced here, which can be combined to create more complex HE ops such as linear transformation and convolution. Given two ciphertexts ct0, ct1 where cti = (bi(X), ai(X)) and bi(X) = ai(X) · s(X ) + mi (X ), the HE ops can be summarized as follows.
HAdd performs an element-wise addition of ct0 and ct1:
of logistic regression [36] compared to CPU (by 1,306×) and GPU (by 27×) implementations, and can execute a ResNet-20 inference 5,556× faster than prior CPU implementations [51].
In this paper, we make the following key contributions:
• We provide a detailed analysis of the interplay of HE parameters impacting the performance of FHE accelerators.
• We propose BTS, a novel accelerator architecture equipped with massively parallel compute units and NoCs tailored to
the mathematical traits of FHE ops.
• BTS is the first accelerator targeting practical bootstrapping,
enabling unbounded multiplicative depth, which is essential for complex workloads.
II. BACKGROUND
We provide a brief overview of HE and CKKS [20] in par- ticular. Table I summarizes the key parameters and notations we use in this paper.
A. Homomorphic Encryption (HE)
HE enables direct computation on encrypted data, referred to as ciphertexts (cts), without decryption. There are two types of HE. Somewhat HE (SHE) supports a limited number of operations (ops) on a ct due to the noise accumulated after the ops. In contrast, Fully HE (FHE) allows unlimited number of ops on cts through bootstrapping [34] that “refreshes” a ct and lowers the impact of noise. There are popular FHE schemes [13], [14], [20], [23], [33] that support different types of data that can be encrypted and different kinds of ops. While other schemes support integer [13], [14], [33] or gate- level ops [23], CKKS [20] supports fixed-point complex (real) numbers. Since many real-world applications require arith- metic with real numbers, CKKS has become one of the most important algorithms among cutting-edge HE schemes [13], [14], [20], [33]. In this paper, we focus on accelerating CKKS ops; however, our proposed architecture is applicable to other popular HE schemes (e.g., BGV [13] and BFV [6], [14], [33]) as they share similar core ops.
ctadd = (b0(X) + b1(X), a0(X) + a1(X)) (2)
HMult consists of two steps, tensor product and key- switching. Tensor product first creates a tuple of polynomials (d0(X), d1(X), d2(X)):
d0(X) = b0(X) · b1(X)
d1(X) = a0(X) · b1(X) + a1(X) · b0(X) (3) d2(X) = a0(X) · a1(X)
By computing (d0 (X ), d1 (X ), d2 (X ))·(1, −s(X ), s(X )2 ), we
recover m0 (X )·m1 (X ), albeit with error terms. Key-switching
recombines the tensor-product result to be decryptable with
(1,−s(X)) using a public key, called evaluation key (evk).
An evk is a ct in R2PQ with a larger modulus PQ, where P =
(k−1 pi) ≥ Q for given special (prime) moduli p0, . . . , pk−1. i=0
We express an evk as a pair of N×(k+L+1) matrices. HMult is then computed using Eq. 4, which involves key-switching with an evk for mult, evkmult:
of HMult ops that can be performed without bootstrapping, and current (multiplicative) level l is the number of remaining HMult operations that can be performed on the ct. Thus, a ct with a level l is represented as a pair of N×(l+1) matrices.
FHE features a bootstrapping op that restores the multiplica- tive level (l) of a ct to enable more ops. For practical usage of HE with a complex sequence of HE ops, bootstrapping must be commonly performed. A bootstrapping itself consists of hundreds of primitive HE ops, consuming Lboot levels. Therefore, L should be larger than Lboot, and having a larger L is beneficial since it requires less frequent bootstrapping ops to execute a HE application with a fixed multiplicative depth. Lboot depends on the bootstrapping algorithm and typically ranges from 10 to 20 — larger Lboot permits using more precise and faster bootstrapping algorithms but at the cost of more frequent bootstrapping [12], [16], [37], [52]. The bootstrapping algorithm we use in this paper is based on [37] with updates to meet the latest security and precision requirements [12], [19], [54], and has Lboot of 19. Another CKKS-specific constraint is that the moduli qi’s and the special moduli pi’s must be large enough to tolerate the error accumulated during bootstrapping, whose typical values range from 240 to 260 [22], [32].
E. Modern algorithmic optimizations in CKKS & Tmult,a/slot
The level of security for the HE scheme is determined by the λ parameter as it determines the minimum logarithmic-time complexity for an attack [19] to deduce the message from a ctwithoutthesecretkey.Inthiswork,wetargetλof128bits, similar to recent HE studies [12], [52], [54] and libraries [32], [60]. A prior study, F1 [66], provided a substandard [4] level of security under 80 bits for CKKS bootstrapping and used smaller cts which simplifies the microarchitecture. λ is a strictly increasing function of N/log P Q [29].
Key-switching is an expensive function, taking most of the time in HRot and HMult [42]. We adopt the state-of-the-art generalized key-switching technique [37], which balances L, computational cost, and λ. [37] factorizes the moduli product Q into Q = Q0 ·...·Qdnum−1 (see Eq. 7) for a given integer dnum (decomposition number). It decomposes a ct into dnum slices, each consisting of the residue polynomials corresponding to the prime moduli (qi’s) that together compose the modulus factor Qj . We perform key-switching on each slice in RQj and later accumulate them. The special moduli product P should only satisfy P ≥ Qj for each Qj , thus we can choose smaller P, leading to higher λ. Because P has decreased, we can instead choose the same λ with higher Q, and accordingly, higher L values to apply more HE ops between bootstrapping.
ctmult =(d0(X),d1(X))+P−1(d2(X)·evkmult)
key-switching
HRot circularly shifts a message vector by slots. When a ct encrypts a message vector z = (z0 , ..., zN/2−1 ), after applying HRot by a rotation amount r, the rotated ciphertext ctrot encryptsz(r)=(zr,...,zN/2−1,z0,...,zr−1).HRotcon- sists of automorphism and key-switching. ct=(b(X),a(X)) is mapped to ct′ = (b(X 5r ), a(X 5r )) by automorphism. This moves coefficients of a polynomial through a mapping
i → σr(i), where i is the index of a coefficient ci and σr is: σr :i→i·5r modN (i=0,1,...,N−1) (5)
Similar to HMult, key-switching brings back ct′, which
was only decryptable with (1, −s(X 5r )) by automorphism,
to be decryptable with (1, −s(X )). An HRot with a different
rotation amount each requires a separate evk, evk(r) . HRot rot
is computed as follows:
ct = (b(X5r ), 0) + P −1(a(X5r ) · evk(r) ) (6)
HE applications require other HE ops, such as addition or mult of a ct with a scalar (CAdd, CMult) or a polynomial (PAdd, PMult) of unencrypted, constant values. Additions are performed by adding the scalar or polynomial to b(X), and mults are performed by multiplying the scalar or polynomial to both b(X) and a(X).
D. Multiplicative level and HE bootstrapping
The error included in a ct is amplified during HE ops; in particular, HMult multiplies the error e(X) with other terms (e.g., m0 (X ) and m1 (X )) and can result in an explosion of the error if not treated properly. CKKS performs HRescale to mitigate this explosion and keeps the error tolerable by dividing the ct with the last prime modulus qL [18]. After HRescale, the qL residue polynomial is no longer used, and the ct is reduced in size. The ct continues losing the residues of qL−1 , . . . , q1 with each HRescale during executing an HE application until only one residue polynomial is left when no additional HMult can be performed on the ct. L, or the max- imum multiplicative level, determines the maximum number
Q=q0·...·qL+1−1·qL+1 ·...·q2L+1−1·...·q(dnum−1)L+1 ·...·qL+1 dnum dnum dnum dnum
(7) Q0 Q1 Qdnum−1
A major challenge of the generalized key-switching is that different evks (evk0 , ..., evkdnum−1 ) must be prepared for each factor Qj where each evk is a pair of N×(k+L+1) matrices and k is set to (L+1)/dnum. Thus, the aggregate evk size becomes 2 · N · (L+1) · (dnum+1), linearly increasing with
dnum. The overall computational complexity of a single HE op also increases with dnum. Therefore, choosing an appropriate dnum is crucial for performance.
Changing the HE parameter set has mixed effects on the performance of HE ops. Decreasing N reduces computational complexity and memory usage. However, we should lower L and Q to sustain security, which requires more frequent bootstrapping. Also, since a ct with degree N can encode only up to N/2 message slots by packing, throughput degrades.
Jung et al. [42] introduced a metric called amortized mult time per slot (Tmult,a/slot), which is calculated as follows:
(a) Maximum level L
Fig. 1. (a) L and (b) a single evk size vs. dnum for four different N (polynomial degree) values and a fixed 128b security target. Normalized-dnum of 0 means dnum = 1 and normalized-dnum of 1 means dnum = max (i.e., k = 1). Interpolated results are used for points with non-integer dnum values. The dotted line in (a) represents a minimum required level of 11 for bootstrapping.
Tboot + L− (l) Tmult,a/slot = l=1
where Tboot is the bootstrapping time, Tmult(l) is the time to perform HMult at a level l. This metric first calculates the average cost of mult including the overhead of bootstrapping, then divides it with the number of slots in a ct (N/2). Thus, Tmult,a/slot effectively captures the reciprocal throughput of a CKKS instance (CKKS scheme with a certain parameter set).
III. TECHNOLOGY-DRIVEN PARAMETER SELECTION OF BOOTSTRAPPABLE ACCELERATORS
A. Technology trends regarding memory hierarchy
Domain-specific architectures (e.g., deep-learning accelera- tors [41], [49], [55]) are often based on custom logic and op- timized dataflow to provide high computation. In addition, the memory capacity/bandwidth requirements of the applications are exploited in the design of the memory hierarchy. Recently, on-chip SRAM capacity has scaled significantly [5] such that 100’s of MBs of on-chip SRAM is feasible, providing 10’s of TB/s of SRAM bandwidth [41], [49], [62]. While the band- width of the main-memory has also increased, its aggregate throughput is still more than an order of magnitude lower than on-chip SRAM bandwidth [59], achieving a few TB/s of throughput even with high-bandwidth memory (HBM).
Similar to other domain-specific architectures [17], [41], HE applications also follow deterministic computational graphs [66] and the locality of input and output cts of HE ops can be maximized through software scheduling [30]. Thus, cts can be reused by exploiting a large amount of on-chip SRAM enabled by technology scaling. However, even with the increasing on-chip SRAM capacity, we observe that the size of on-chip SRAM is still insufficient to store evks, rendering the off-chip memory bandwidth to become a crucial bottleneck for modern CKKS that supports bootstrapp
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com