Yunkun Liao, Jingya Wu, Wenyan Lu, Xiaowei Li, Guihai Yan SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 2024/12/3 #### Contens **Background and Motivation** 01 **Our Design: DPU-driven RCSCA Detection** 02 **Evaluation** 03 **Conclusion** 04 # Background and Motivation ## Datacenter Networking Demands Tensorflow Nccl Spark Memcached Distributed Transactions Disaggregated Memory Redis RPC Microservice Low-latency and high-throughput network communication to support distributed applications. Growing mismatch between increasing network speed and stalling CPU speedups.<sup>1</sup> Hardware-based kernel bypass network for $\mu$ s-scale latency. ## RDMA Fits the Datacenter Networking Demands RDMA: Remote Direct Memory Access - Bypass kernel - Hardware offload - Zero copy Lower latency & Higher throughput & Reduced CPU consumption #### RDMA Becomes Essential in the Cloud Figure 1: Traffic statistics of all Azure public regions between January 18 and February 16, 2023. Traffic was measured by collecting switch counters of server-facing ports on all Top of Rack (ToR) switches. Around 70% of traffic was RDMA. Cloud service providers heavily deploy RDMA in their datacenters. However, RDMA is originally designed for HPC. #### RDMA Has Security Issues Torsten Hoefler<sup>®</sup>, ETH Zürich Duncan Roweth, Keith Underwood, and Robert Alverson, Hewlett Packard Enterprise Mark Griswold, Vahid Tabatabaee, Mohan Kalkunte, and Surendra Anubolu, Broadcom Siyuan Shen, ETH Zürich Moray McLaren, Google Abdul Kabbani and Steve Scott, Microsoft #### **Issue 7: Security** RoCE is known to have several security issues, <sup>16,17</sup> especially in multitenant contexts. Many of those issues stem from the fact that protocol security, authentication, and encryption have played a minor role at the design time. Yet, today, such properties are much more important. RDMA exposes **new attack surface** to malicious tenants. #### An Attack Surface: PTE Caches in RDMA NIC - RDMA NIC (RNIC) maintains page table entries (PTEs) for zero-copy data transfer. - Frequently-accessed PTEs are cached on the chip. PTE Cache Hit: one on-chip memory read latency. PTE Cache Miss: at least one PCle round-trip latency. Can be used for timing-based cache side-channel attack. #### RNIC Cache Side-channel Attack Threat Model □ Server hosts data in memory exported via RDMA □ The attacker and the victim can be on different machines □ The attacker wants to learn the access pattern of the victim □ The attacker cannot observe the network traffic Victim GET Key-50 **GET Key-50 Attacker** Key-value Client Store Server Attacker: Did the victim access Key-50? #### RCSCA Procedure 1. Reverse-engineer RNIC PTE Cache Structure. 2. Evict+Reload Side-channel Attack. PTE cache structure of Mellanox CX-4 RNIC If read latency < Threshold, Victim accessed *B*. ## Existing Switch-centric RCSCA Detection - End-host CPU cannot detect RCSCA due to CPU bypass. - End-host RNIC cannot provide programmable compute to detect RCSCA. - Network-core programmable switch relies on a backend server to detect RCSCA<sup>[1]</sup>. Is the switch-centric design good enough? ## Profiling Switch-centric RCSCA Detection - □ Switch-centric RCSCA detection (Bedrock<sup>[1]</sup>) is slower than the attacker (Pythia<sup>[2]</sup>). □ Sensitive information may be leaked. - ☐ Bedrock becomes much slower if there are many attacker-victim pairs. - □ SCADET execution time contributes a lot to the detection latency. How to minimize the detection latency to minimize the information leakage? - [1] Xing, Jiarong, et al. "Bedrock: Programmable Network Support for Secure {RDMA} Systems." 31st USENIX Security Symposium (USENIX Security 22). 2022. - [2] Tsai, Shin-Yeh, Mathias Payer, and Yiying Zhang. "Pythia: remote oracles for the masses." 28th USENIX Security Symposium (USENIX Security 19). 2019. # Our Design: DPU-driven RCSCA Detection #### Insight: Host-centric RCSCA Detection Is Better However, the traditional RNIC cannot be architected to detect RCSCA. # Key Opportunity: Data Processing Unit NVIDIA DPU architecture<sup>[3]</sup> YUSUR K2Pro DPU architecture<sup>2</sup> - Data Processing Unit (DPU): combines the functionality of traditional RNICs with programmable compute and storage. - We choose the FPGA-based DPU to demonstrate the DPU-driven RCSCA detector. #### Design of DPU-driven RCSCA Detector The attacker blocker is located in the programmable packet processor. - Commodity programmable packet processor has a flow table. - Commodity programmable packet processor supports the "Match+Action". #### Design of DPU-driven RCSCA Detector - We position the trace analyzer off the critical data path of the RDMA core to minimize RDMA performance overhead. - The SCADET detector is highly optimized to match the incoming request speed. # SCADET Algorithm Accelerating the SCADET algorithm is required. [4] Sabbagh, Majid, et al. "Scadet: A side-channel attack detection tool for tracking prime-probe." 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 2018. # SCADET Implementation: Storage Optimizations #### Requirements: - Hardware-friendly data structure: fixedlength lists in continuous memory region. - Large capacity for 1000s sets: DRAM. - Fast access: BRAM-based flow cache. # SCADET Implementation: Computation Optimizations ``` If (Time_Diff < Intra_Group_Threshold) { Search_Tag_List: For (...) {For (...)} Search_Counter_List: For (...) {For (...)} If (Insert_New_Tag && Tag_List is Full) { Search_QP_with_Minimum_Counter { For (...) {For (...)} } } }</pre> Critical path of the SCADET logic ``` We use FPGA's logic flexibility to parallelize the loops in the critical path. # Evaluation #### **Experiments Setup** - RNIC model: We assume the RDMA PTE cache of the NVIDIA Mellanox ConnectX-4131A has 1024 sets and 128 ways per set according to the Pythia's reverse engineering result. - Switch-centric baseline: We set up a real system to emulate and evaluate Bedrock. - Our DPU-driven RCSCA detector: We develop the trace analyzer with Xilinx HLS language and measure its cycle-accurate performance with Vitis 2022.1. - Traces: - Trace-1: Evict+Reload memory traces - Trace-2: prefills the tag list before the injection of the Evict+Reload memory traces (Trace-1), the worst workload for the DPU-driven detector. - Trace-3: memory access trace following uniform distribution that emulates a regular user #### Low RCSCA Detection Latency - DPU-driven design reduces the detection latency by more than 84.1%. - DPU-driven design leaks zero sensitive information. - The three proposed architectural optimizations for FPGA-based SCADET are essential. # Zero Performance Overhead and Small FPGA Resource Consumption Performance overhead: 0 FPGA resource overhead: small, affordable for datacenter-grade FPGA. The detector is off the RDMA datapath. | Resource type | Percentage (based on VU9P) | |---------------|----------------------------| | Flip Flop | 1% | | Lookup Table | 6% | | BRAM | ~0 | | DSP | ~0 | # Conclusion #### Conclusion and Outlook - 1. RDMA security is a critical concern, especially given the widespread deployment of RDMA in public cloud. - 2. We have identified the weaknesses inherent in the current switch-centric designs for RDMA security. - 3. We advocate for the benefits of a host-centric approach, driven by the capabilities of emerging Data Processing Units (DPUs). - 4. Our demonstration of the substantial performance improvement in a DPU-driven design, exemplified by the detection of RNIC cache side-channel attacks, underscores its potential. Takeaway: Smart DPU-driven Edge, Dumb Core.