MgX: Near-Zero Overhead Memory Protection with an Application to Secure DNN Acceleration
, , , and G. School of Electrical and Computer Engineering, Cornell University, Ithaca, NY {wh399, mu94, zhiruz,
In this paper, we propose MgX, a near-zero overhead mem- ory protection scheme for hardware accelerators. MgX mini- mizes the performance overhead of off-chip memory encryp- tion and integrity verification by exploiting the application- specific aspect of accelerators. Accelerators tend to explicitly manage data movement between on-chip and off-chip mem- ory, typically at an object granularity that is much larger than cache lines. Exploiting these accelerator-specific char- acteristics, MgX generates version numbers used in memory encryption and integrity verification only using on-chip state without storing them in memory, and also customizes the granularity of the memory protection to match the granularity used by the accelerator. To demonstrate the applicability of MgX, we present an in-depth study of MgX for deep neural network (DNN) and also describe implementations for H.264 video decoding and genome alignment. Experimental results show that applying MgX has less than 1% performance over- head for both DNN inference and training on state-of-the-art DNN architectures.
1. INTRODUCTION
Copyright By PowCoder代写 加微信 powcoder
As the technology scaling slows down, computing systems are increasingly relying on hardware accelerators to improve performance and energy efficiency. For example, modern ML models such as deep neural networks (DNNs) are often quite compute-intensive and increasingly run on hardware accel- erators [8, 29] for both performance and energy efficiency. Similarly, hardware accelerators are widely used for other compute-intensive workloads such as video decoding, sig- nal processing, cryptographic operations, genome assembly, etc. This paper proposes a novel off-chip memory protection scheme for hardware accelerators, named MgX (Memory guard for Xelerators), using secure DNN acceleration as a primary example application.
In many applications, the hardware accelerators may pro- cess private or sensitive data, which need strong security protection. For example, ML algorithms often require collect- ing, storing, and processing a large amount of personal and potentially private data from users to train a model. More- over, due to its high computational demand, both training and inference are often performed on a remote server rather than a client device such as a smartphone, implying that the private data and ML models need to be stored in a remote server. Unfortunately, in traditional computing systems, private user data may be easily exposed or misused by the remote server if it is either compromised or malicious.
A promising approach to provide strong confidentiality and integrity guarantees even under untrusted software and poten-
Figure 1: Secure ML acceleration — A secure accelerator keeps all sensitive information including inputs, outputs, training data, and ML model parameters (weights) encrypted.
tial physical tampering is to rely on trusted hardware to create a hardware-protected execution environment. This approach has primarily been studied in the context of general-purpose processors in the past. This paper considers extending this approach to accelerators. Figure 1 illustrates the approach in the context of a DNN accelerator. In order to protect sensi- tive data, the secure DNN accelerator keeps all confidential information including inputs, outputs, training data, and net- work parameters (weights) in an encrypted form outside of a trusted hardware boundary such as a custom ASIC, an FPGA accelerator, or an accelerator IP in an SoC. Each secure ac- celerator contains a unique private key that can only be used by the accelerator hardware itself. Users can authenticate the accelerator remotely using the corresponding public key and a certificate from the accelerator vendor and also send their private data and model parameters encrypted, which can only be decrypted and processed by the trusted accelerator. The secure accelerator also ensures that the ML computation can- not be tampered with by protecting the integrity of off-chip data. In this way, the secure DNN accelerator can ensure that private user data and weights cannot be accessed by an adversary even if they control the entire software stack on the system that contains the accelerator or can even physically access the off-chip DRAM.
The cryptographic protection of off-chip memory, namely memory encryption and integrity verification, represents an essential technology to enable the hardware-protected secure execution environment. The off-chip memory protection also represents the main source of performance overhead in the traditional secure processor designs [14, 38, 50, 53]. For a general-purpose processor, the memory protection schemes need to be able to handle any sequence of memory accesses to arbitrary memory locations, and typically protect memory accesses at a cache-block granularity. Each cache block is encrypted before written back to memory, and decrypted on a read. To hide decryption latency, the counter-mode encryption is often used, where a counter value (CTR) is encrypted with a block cipher to generate an encryption pad that is XORed with data for encryption. In secure processors,
arXiv:2004.09679v1 [cs.CR] 20 Apr 2020
the counter value is typically a concatenation of the memory address and a version number (VN) that increments on each write. The version number for each encrypted block is stored in memory. To protect integrity of off-chip memory, either a message authentication code (MAC) or a cryptographic hash needs to be attached to each cache block in memory. Moreover, in order to ensure freshness and prevent replay attacks, the integrity verification requires a tree of MACs. Unfortunately, the additional VN and MAC accesses can lead to non-trivial bandwidth and performance overhead for memory-intensive workloads.
In this paper, we show that memory encryption and in- tegrity verification can be performed with almost no perfor- mance overhead for an application-specific accelerator by customizing protection to the accelerator-specific memory access pattern. We make key observations that the application- specific accelerators typically move data between on-chip and off-chip memory at a larger granularity than a cache block, and that the off-chip accesses are explicitly performed by the accelerator following a relatively simple control flow. The coarse-granularity data movements implies that the version numbers for memory encryption and the MACs for integrity verification can be maintained at a coarse granularity to re- duce the overhead. Moreover, the relatively simple mem- ory access patterns and the smaller number of version num- bers suggest that version numbers can often be either stored on-chip or generated from the on-chip state without storing them in off-chip memory.
We study the memory access behaviors of DNNs such as convolutional neural networks (CNNs) and recurrent neu- ral networks (RNNs) for both inference and training, and show how the version numbers can be determined even when dynamic pruning is used. By generating version numbers on- chip and performing protection at an application-specific granularity, MgX can eliminate most of overhead for off-chip memory protection; no version number is stored in the off- chip memory, no integrity tree is needed, and each MAC/hash protects a large amount of data instead of one cache block. We also study the applicability of MgX for H.264 video de- coding and genome assembly acceleration using open-source RTL implementations, and found that version numbers can also be calculated from on-chip state.
We evaluate the overhead of MgX in the context of secure DNN accelerators using ChaiDNN [58], an open-source DNN accelerator from Xilinx, as the baseline. The experimental results show that MgX can provide memory encryption and integrity verification with almost no overhead in both perfor- mance and off-chip memory traffic. On the other hand, apply- ing the existing general-purpose protection schemes lead to 20-30% overhead and even higher overhead with lower mem- ory bandwidth. MgX also reduces the on-chip area overhead of the traditional memory protection schemes as it does not require any caches for version numbers (VNs) and MACs.
This paper makes the following major contributions:
• We propose MgX, a near-zero overhead memory protec- tion scheme for accelerators. MgX minimizes the per- formance overhead of memory protection by assigning counter values for data and performing coarse-grained memory protection.
• We demonstrate the applicability of MgX by showing
(a) The traditional memory encryption and integrity verification scheme — The plaintext (U) is encrypted with the CTR which consists of the address (PA) and a VN. The MACs and VNs associated with the encrypted data (V) are stored in DRAM. A Merkle tree is built for the VNs to guarantee freshness.
(b) MgX — The VN is generated on-chip by the getVN function, and therefore eliminating the off-chip storage for VNs and the Merkle tree. The MAC is calculated over each object to reduce the overhead of integrity verification.
Figure 2: Memory encryption and integrity verification.
a concrete implementation of MgX for DNN, and de- tailed analyses of an H.264 video decoder and a genome assembly accelerator.
• We evaluate the secure DNN accelerator with MgX and show that the overhead is less than 1% for both DNN inference and training on the state-of-the-art models.
MGX: LOW-OVERHEAD MEMORY PRO-
TECTION FOR ACCELERATORS
CTR = (PA, VN) 𝑆𝑀𝑔𝑋 𝐼𝐷𝑜𝑏𝑗 getVN
MACU VN On/Off Chip V
This section provides the background on the state-of-the- art in memory protection, and presents the proposed memory protection scheme for accelerators and its security analysis.
2.1 Memory Protection Basics
Memory protection schemes typically use symmetric-key block cipher such as Advanced Encryption Standard (AES) [40] to encrypt off-chip memory for confidentiality and MACs (or hashes) for integrity.
2.1.1 Memory Encryption
For memory encryption, as depicted in Figure 2(a), existing techniques [18, 23, 50] typically use the counter mode so that the AES operation can be overlapped with memory accesses. The counter-mode encryption requires a non-repeating value to be used for each encryption under the same AES key. In this paper, we will call this value counter. In a secure proces- sor, the counter value is often consists of the physical address (PA) of a data block (e.g., a cache block) that will be en- crypted and a (per-block) version number that is incremented on each memory write (for the block). When a data block is written, the memory protection unit increments the version number and then encrypts the data. When a data block is read, the memory protection unit retrieves the version number used to encrypt the data block and then decrypts the block. Let kE NC , U , V be the AES encryption key, plaintext, and cipher-
Encrypted Data
Host Interface
DRAM Interface
Accelerator
Functional Unit
𝑿𝒄𝒆𝒍 𝑰𝑫𝒐𝒃𝒋 Address r/w Plain -text
Cipher -text
VN Generator
DRAM Controller
Memory Protection Unit
Enc & IV Engine
text, respectively. The AES encryption can be formulated as follows, where || represents bit-wise concatenation.
V = U ⊕ AESkENC (PA||VN) (1)
Because a general-purpose processor can have an arbitrary memory access pattern that depends on the program that is executing, the version number for each data block, which represents the number of writes to that block, can be any value at a given time. As a result, a general-purpose secure processor typically needs to store the version numbers in memory along with encrypted data in order to determine the correct version number for a later read. Moreover, to avoid re-using the same counter value, the AES key needs to change once the version number reaches its maximum, which implies that the size of the version number needs to be large enough to avoid frequent re-encryption. For example, the memory encryption engine in Intel SGX [18] uses a 56-bit version number per each 64-byte data block, which introduces 11% storage and bandwidth overhead. In general, the version numbers cannot fit on-chip and are stored in DRAM.
2.1.2 Integrity Verification
To prevent off-chip data from being altered by an attacker, integrity verification cryptographically checks if the value from the off-chip memory is the most recent value written to the address by the processor. For this purpose, a MAC of the data value, the memory address, and the version number can be computed and stored for each data block on a write, and checked on a read from DRAM. However, only checking the MAC cannot guarantee the freshness of the data; a replay attack can replace the data and the corresponding VN and MAC in DRAM with stale values without being detected. In order to defeat the replay attack, a Merkle tree (i.e., hash tree) [16] needs to be used to verify the MACs hierarchically in a way that the root of the tree is stored on-chip. As shown in Figure 2(a), a state-of-the-art method [43] uses a Merkle tree to protect the integrity of the version numbers in memory, and includes a VN in a MAC to ensure the freshness of data. Previous works propose to use HMAC-SHA-1 [43], Carter-Wegman MAC [18], and AES-GCM [11] as the hash function. Let us denote the key for hash function, plaintext, andciphertextaskIV,U,V,respectively.TheMACofadata block can be calculated as:
MAC = HkIV (V, PA||VN) (2)
The overhead of integrity verification is nontrivial as it requires traversing the tree stored in the off-chip memory. To mitigate this overhead, integrity verification engines typically use a cache to store recently verified MACs.
2.2 Intuition
The main overhead of traditional memory encryption and integrity verification comes from storing and accessing the VNs and MACs in the off-chip memory. Hardware acceler- ators, especially the memory-intensive ones such as video encoding/decoding, neural network, and DNA sequencing accelerators, often requires accessing a large amount of data in memory. Naïvely applying the traditional general-purpose memory protection scheme to those accelerators can lead to non-trivial performance overhead.
Figure 3: The secure accelerator architecture with MgX.
For a specialized accelerator, the memory access pattern is also customized for a particular application. Each accelerator has a list of application-specific data structures such as arrays that it keeps in memory. For performance, instead of relying on caches, accelerators often explicitly move data between on-chip memory and DRAM at an object granularity. In most cases, the size of an object is much larger than a cache line, and the number of objects is relatively small compared to the cache-line-sized blocks in memory. The overhead of memory protection can be reduced significantly if a version number is allocated per object instead of per cache block.
In addition to coarser memory access granularity, accel- erators also tend to have simpler memory access patterns compared to typical programs on CPUs. Control-intensive applications are often not a great fit for hardware accelera- tion, and the on-chip control unit of an accelerator needs to manage data movements between on-chip and off-chip mem- ory. In that sense, an accelerator’s memory access pattern can often be encoded in a small amount of memory and the on-chip state of the accelerator contains most of the infor- mation needed to determined off-chip access patterns. This implies that an accelerator itself can often determine version numbers without off-chip memory.
We propose to leverage these observations to optimize off-chip memory protection by increasing the granularity of protection to match the data movement granularity and generating version numbers from the on-chip state instead of storing them in memory. In other words, an accelerator or its designer needs to choose the protection granularity and provide version numbers to a memory protection unit. We call this memory protection scheme as MgX. If version numbers can be efficiently determined using on-chip state at run-time, they no longer need to be stored in DRAM, which also makes the Merkle tree unnecessary. The performance overhead of the memory encryption and integrity verifica- tion in MgX is largely removed as they require no off-chip memory accesses for the VNs and MACs for the VNs. The only extra memory accesses come from reading and writ- ing the MACs for verifying the integrity of data blocks. We can also lower the MAC overhead by applying MACs at an object granularity, where a MAC is calculated for each mem- ory object that an accelerator reads/writes at a time. In this way, the memory encryption and integrity verification can be performed with almost zero overhead.
2.3 MgX Scheme
MgX provides application-specific memory protection by matching the access granularity of an accelerator and generat- ing VNs using an on-chip state. In MgX, the accelerator itself
is modified to choose the protection granularity and generate a VN when it issues a memory request. Instead of storing the VNs in the off-chip memory, the version number generator, as depicted in Figure 3, holds the MgX state in an on-chip memory and produces the VNs of objects based on the MgX state, the on-chip accelerator state, and the object identifier of each object. The size of the on-chip state depends on the memory access pattern of the accelerator and the number of objects existing in the application. The VN generator con- sists of two main functions — version generation (getVN) and state update functions (updateS). For memory read and write operations, getVN calculates the version number of an object based on the object identifier (IDob j ), the on-chip accelerator state (SXcel), and the on-chip MgX state (SMgX).
VNIDobj =getVN(SMgX, SXcel, IDobj) (3)
The state update function is called when the on-chip state needs to be updated. The on-chip state is updated based on the current MgX and accelerator states.
updateS(SMgX, SXcel) (4)
As shown in Figure 2(b), once the VN for reading/writing an object is generated, the Enc and IV engines can encrypt, decrypt, and verify that object using the same equations in (1) and (2). The Enc and IV engines in MgX use standard AES counter-mode and keyed hash. As the VNs are generated on-chip and do not need to be verified, the MgX scheme does not need the Merkle tree in the off-chip memory. For security and correctness, the version number generation must satisfy the following requirements.
• security: The generated version number must be differ- ent for each write to a particular memory address.
• correctness: The generated version number for a read must match the value used for the most recent write to the same address, a requirement for correct decryption.
Note that sharing a version number among multiple mem- ory locations does not sacrifice security as the counter value to a block cipher (counter value) in the counter mode already includes a memory address in addition to a version number. Also, note that generating version numbers in MgX does not require static memory access patterns. Reads do not affect the version number no matter how irregular they are. Writes can also happen in an arbitrary order using one version num- ber as long as they occur once per each address. Skipping writes and only using a portion of an object can also be done with one version number per object as long as the skipped locations do not need to be read later. Finally, the version numbers can be stored on-chip as long as they fit.
2.4 Security Analysis
Encryption – MgX uses the same AES counter-mode en- cryption that is used by the traditional memory encryp
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com