Utilizing the IOMMU Scalably

Omer Peleg
Utilizing the IOMMU Scalably

Research Thesis

Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Computer Science

Omer Peleg

Submitted to the Senate
of the Technion — Israel Institute of Technology
Av 5775 Haifa August 2015
This research was carried out under the supervision of Prof. Dan Tsafrir, in the Faculty of Computer Science.

Some results in this thesis have been published as a paper by the author and research collaborators in a conference during the course of the author’s research period:


**Acknowledgements**

I would like to thank my parents, Nitzan and Ofer, as well as my brother Amit for being there for me all along. I would like to thank the amazing people I met in the Technion, for helping out when I needed it the most and for making the time I spent in this amazing place ever more enjoyable. I also want to thank Ben Serebrin, for giving me this challenge.

Last but not least, I would like to thank Adam Morrison. Without your help, I doubt any of this would have seen the light of day.

The generous financial help of the Israeli Ministry of Science & Technology is gratefully acknowledged.
# Contents

List of Figures and Tables

Abstract 1

Abbreviations and Notations 4

1 Introduction 5

2 Background and State of the Art 9
   2.1 Background: IOMMUs 9
      2.1.1 IOMMU protection 10
   2.2 Performance Analysis of Present Designs 10
      2.2.1 Linux IOVA Allocation 11
      2.2.2 Linux IOTLB Invalidation 13
      2.2.3 Linux Page Table Management 14
      2.2.4 Linux Kernel Synchronization 15
      2.2.5 IOMMU Management in Other OSes 15

3 Our Designs 19
   3.1 Scalable IOVA Allocation 19
      3.1.1 Dynamic Identity Mapping 19
      3.1.2 IOVA-kmalloc 21
      3.1.3 Scalable IOVA Allocation with Magazines 23
   3.2 Scalable IOTLB Invalidation 24

4 Evaluation 25
   4.1 Performance 25
   4.2 Implementation Complexity 33

5 Related Work 35

6 Conclusion 37

Hebrew Abstract i
List of Figures and Tables

Figure 1.1 Parallel netperf throughput (Linux): Performance meltdown due to dynamic IOMMU mapping updates. .......................... 6

Figure 2.1 Throughput and cycle breakdown of time spent in IOMMU management on a 16-core highly parallel netperf RR workload. ............. 11
Figure 2.2 EiovaR highly parallel RR throughput with and without scalable invalidation .......................................................... 12

Figure 4.1 Highly parallel RR throughput ........................................ 27
Figure 4.2 1 netperf TCP RR instance per core latency test ............... 27
Figure 4.3 memcached throughput .................................................. 28
Figure 4.4 Apache throughput, 1KB file ......................................... 29
Figure 4.5 netperf stream throughput, 16KB messages .................... 30
Figure 4.6 netperf stream throughput, 64 byte messages ................. 30
Figure 4.7 netperf stream throughput and average payload sizes, 64 byte messages with TCP_NODELAY ........................................ 31
Figure 4.8 Page table memory over time ....................................... 32
Figure 4.9 Highly parallel netperf txns/sec with strict IOTLB invalidation ... 33

Table 2.1 Comparison of IOMMU management designs in current OSes . . . 16
Table 4.1 Implementation complexity of our designs .......................... 34
Abstract

I/O memory management units (IOMMUs) provided by modern hardware allow the operating system to enforce memory protection controls on the direct memory access (DMA) operations of its I/O devices. An IOMMU translation management design must scalably handle frequent concurrent updates of IOMMU translations made by multiple cores, which occur in high throughput I/O workloads such as multi-Gb/s networking. Today, however, operating systems experience performance meltdowns when using the IOMMU in such workloads.

This work explores scalable IOMMU management designs and addresses the two main bottlenecks we find in current operating systems: (1) assignment of I/O virtual addresses (IOVAs), and (2) management of the IOMMU’s translation lookaside buffer (TLB).

We propose three approaches for scalable IOVA assignment: (1) dynamic identity mappings, which eschew IOVA allocation altogether, (2) allocating IOVAs using the kernel’s kmalloc, and (3) per-core caching of IOVAs allocated by a globally-locked IOVA allocator. We further describe a scalable IOMMU TLB management scheme that is compatible with all these approaches.

Evaluation of our designs under Linux shows that (1) they achieve 88.5%–100% of the performance obtained without an IOMMU, (2) they achieve similar latency to that obtained without an IOMMU, (3) scalable IOVA allocation methods perform comparably to dynamic identity mappings because of their more efficient IOMMU page table management, and (4) kmalloc provides a simple solution with high performance, but can suffer from unbounded page table blowup if empty page tables are not freed.
Abbreviations and Notations

I/O : Input/Output
MMU : Memory Management Unit
IOMMU : I/O Memory Management Unit
TLB : Translation Lookaside Buffer
IOTLB : I/O Translation Lookaside Buffer
VA : Virtual Address
IOVA : I/O Virtual Address
OS : Operating System
DMA : Direct Memory Access
NIC : Network Interface Card
TX : Transmit
RX : Receive
PT : Page Table
PTE : Page Table Entry
PDE : Page Directory Entry
PFN : Page Frame Number
RR : Request-Response
IP : The Internet Protocol
TCP : The Transport Control Protocol
HTTP : The Hypertext Transfer Protocol
VM : Virtual Machine
API : Application Programming Interface
Trans : Transaction
Sec : Second
TPS : Transactions Per Second
Mem : Memory
KB : Kilo Byte
MB : Mega Byte
GB : Giga Byte
Gb : Giga bit
Gb/sec : Giga bit per-second
Mhz : Mega Hertz
Ghz : Giga Hertz
Chapter 1

Introduction

Modern hardware provides an I/O memory management unit (IOMMU) [AMD11, ARM13, IBM, Int14] that mediates direct memory accesses (DMAs) by I/O devices in the same way that a processor’s MMU mediates memory accesses by instructions. The IOMMU interprets the target address of a DMA as an I/O virtual address (IOVA) [Lin] and attempts to translate it to a physical address, blocking the DMA if no translation (installed by the OS) exists.

IOMMU thus enable the OS to restrict a device’s DMAs to specific physical memory locations, and thereby protect the system from errant devices [CGA09, KRS09], malicious devices [BDK05, CG14, Woj08], and buggy drivers [BBC+06, CYC+01, HBG+07, LUSG04, SBL05, WRW+08], which may misconfigure a device to overwrite system memory. This intra-OS protection [WRC08] is recommended by hardware vendors [Hil09, KRS09] and implemented in existing OSes [App13, Bot, IBM13, Mam07]. OSes can employ intra-OS protection both in non-virtual setups, having direct access to the physical IOMMU, and in virtual setups, by exposing the IOMMU to the VM through hardware-supported nested IOMMU translation [AMD11, Int14], by paravirtualization [BYXO+07, LUSG04, SLQP07, WRC08], or by full emulation of the IOMMU interface [ABYTS11].

Intra-OS protection requires each DMA operation to be translated with a transient IOMMU mapping [Bot] dedicated to the DMA, which is destroyed once it completes so that the device cannot access the memory further [KRS09, MHJ]. For example, a network interface card (NIC) driver maps the buffers it inserts into the NIC’s receive (RX) rings to receive packets. Once a packet arrives (via DMA), the driver unmaps the packet’s buffer.

These transient dynamic IOMMU mappings pose a performance challenge for driving high-throughput I/O workloads. Such workloads require dynamic mappings to be created and destroyed millions of times a second by multiple cores concurrently, since a single core often cannot sustain high enough throughput [Lei09]. This work specifically targets multi-Gb/s networking—NICs providing 10–40 Gb/s (and soon 100 Gb/s)—as a representative demanding case.
Current OSes melt down under load when the IOMMU is enabled in such a workload. Figure 1.1 demonstrates the problem on Linux. It shows the combined throughput of 270 netperf instances running a request-response workload (described in chapter 4) on a 16-core x86 server with a 10 Gb/s NIC, which we use as a running example throughout this work. Mediating DMAs by the IOMMU imposes only negligible overhead by itself, as evidenced by the throughput obtained when the IOMMU is configured with static identity mappings that pass DMA requests through the IOMMU unchanged. In contrast, when IOMMU management is enabled, throughput drops significantly and ceases to increase with the number of cores. This occurs with both Linux’s stock IOMMU subsystem or the recent EiovaR [MAT15] optimization, which targets the sequential performance of Linux’s IOMMU subsystem (see §2.2.1). Other OSes suffer from similar scalability problems (§2.2.5).

This work thus considers the IOMMU management problem of designing a subsystem supporting high-throughput concurrent updates of IOMMU mappings. We analyze the bottlenecks in current IOMMU management designs (§2.2) and explore the trade-offs in the design space of their solutions (chapter 3). Our designs address the two main bottlenecks we find in current OSes: IOVA assignment and IOTLB invalidation.

**IOVA assignment** Creating an IOMMU mapping for a physical memory location requires the OS to designate a range of IOVAs that will map to the physical location. The driver will later configure the device to DMA to/from the IOVAs. OSes presently use a dedicated, centralized (lock-protected) IOVA allocator that becomes a bottleneck when accessed concurrently.

We propose three designs for scalable IOVA assignment (§3.1), listed in decreasingly radical order: First, dynamic identity mapping eliminates IOVA allocation altogether by using a buffer’s physical address for its IOVA. Consequently, however, maintaining the
IOMMU page tables requires more synchronization than in the other designs. Second, *IOVA-kmalloc* eliminates only the specialized IOVA allocator by allocating IOVAs using the kernel’s optimized *kmalloc* subsystem. This simple and efficient design is based on the observation that we can treat the addresses that *kmalloc* returns as IOVA page numbers. Finally, *per-core IOVA caching* keeps the IOVA allocator, but prevents it from becoming a bottleneck by using *magazines* [BA01] to implement per-core caches of free IOVAs, thereby satisfying allocations without accessing the IOVA allocator.

**IOTLB invalidation**  Destroying an IOMMU mapping requires invalidating relevant entries in the IOTLB, a TLB that caches IOMMU mappings. The Linux IOMMU subsystem amortizes the invalidation cost by batching multiple invalidation requests and then performing a single global invalidation of the IOTLB instead. The batching data structure is lock-protected and quickly becomes a bottleneck. We design a compatible scalable batching data structure as a replacement (§ 3.2).

**Design space exploration**  We evaluate the performance, page table memory consumption and implementation complexity of our designs (chapter 4). We find that (1) our designs achieve 88.5%–100% of the throughput obtained without an IOMMU, (2) our designs achieve similar latency to that obtained without an IOMMU, (3) the savings dynamic identity mapping obtains from not allocating IOVAs are negated by its more expensive IOMMU page table management, making it perform comparably to scalable IOVA allocation, and (4) IOVA-kmalloc provides a simple solution with high performance, but it can suffer from unbounded page table blowup if empty page tables are not reclaimed (as in Linux).

**Contributions**  This work makes four contributions:

- Identifying IOVA allocation and IOTLB invalidation as the bottlenecks in the IOMMU management subsystems of current OSes.

- Three designs for scalable IOVA allocation: (1) dynamic identity mappings, (2) IOVA-kmalloc, and (3) per-core caching of IOVAs, as well as a scalable IOTLB invalidation scheme.

- Evaluation of the new and existing designs on several high throughput I/O workloads.

- Design space exploration: we compare the performance, page table memory consumption and implementation complexity of the proposed designs.
Chapter 2

Background and State of the Art

2.1 Background: IOMMUs

The IOMMU mediates accesses to main memory by I/O devices, much like the MMU mediates the memory accesses performed by instructions. IOMMUs impose a translation process on each device DMA. The IOMMU interprets the target address of the DMA as an I/O virtual address (IOVA) [Lin], and attempts to translate it to a physical address using per-device address translations (or mappings) previously installed by the OS. If a translation exists, the DMA is routed to the correct physical address; otherwise, it is blocked.

In the following, we provide a high-level description of Intel’s x86 IOMMU operation [Int14]. Other architectures are conceptually similar [AMD11, ARM13, IBM].

IOMMU translations  The OS maintains a page table hierarchy for each device, implemented as a 4-level radix tree (as with MMUs). Each radix tree node is a 4 KB page. An inner node (page directory) contains 512 pointers (page directory entries, or PDEs) to child radix-tree nodes. A leaf node (page table) contains 512 pointers (page table entries, or PTEs) to physical addresses. PTEs also encode the type of access rights provided through this translation, i.e., read, write or both.

The virtual I/O address space is 48-bit addressable. The 36 most significant bits of an IOVA are its page frame number (PFN), which the IOMMU uses (when it receives a DMA request) to walk the radix tree (9 bits per level) and look up the physical address and access rights associated with the IOVA. If no translation exists, the IOMMU blocks the DMA and interrupts the processor. Unlike the analogous case in virtual memory, this is not a page fault that lets the OS install a new mapping and transparently resume operation of the faulting access. Instead, the DMA is simply dropped. The I/O device observes this and may not be able to recover. ¹

¹I/O page fault standardization exists, but since it requires support from the device, it is not widely implemented or compatible with legacy devices [AMD11, PCI09, Roe].
IOTLB translation caching  The IOMMU maintains an IOTLB that caches IOVA translations. If the OS modifies a translation, it must *invalidate* (or *flush*) any TLB entries associated with the translation. The IOMMU supports individual invalidations as well as *global* ones, which flush all cached translations. The OS requests IOTLB invalidations using the IOMMU’s *invalidation queue*, a cyclic buffer in memory into which the OS adds invalidation requests and the IOMMU processes them asynchronously. The OS can request to be notified when an invalidation has been processed.

2.1.1 IOMMU protection

IOMMUs can be used to provide *inter- and intra-OS* protection [ABYTS11, WR12, WRC08, YBYW10]. IOMMUs are used for inter-OS protection in virtualized setups, when the host assigns a device for the exclusive use of some guest. The host creates a *static* IOMMU translation [WRC08] that maps guest physical pages to the host physical pages backing them, allowing the guest VM to directly program device DMAs. This mode of operation does not stress the IOMMU management code and is not the focus of this work.

We focus on intra-OS protection, in which the OS uses the IOMMU to restrict a device’s DMAs to specific physical memory locations. This protects the system from errant devices [CGA09, KRS09], malicious devices [BDK05, CG14, Woj08], and buggy drivers [BBC+06, CYC+01, HBG+07, LUSG04, SBL05, WRW+08].

Intra-OS protection is implemented via the DMA API [Bot, Lin, MHJ] that a device driver uses when programming the DMAs. To program a device DMA to a physical buffer, the driver must pass the buffer to the DMA API’s *map* operation. The map operation responds with a *DMA address*, and it is the DMA address that the driver must program the device to access.

Internally, the map operation (1) allocates an IOVA range the same size as the buffer, (2) maps the IOVA range to the buffer in the IOMMU, and (3) returns the IOVA to the driver. Once the DMA completes, the driver must *unmap* the DMA address, at which point the mapping is destroyed and the IOVA range deallocated.

High throughput I/O workloads can create and destroy such *dynamic* IOMMU mappings [Bot] millions of times a second, on multiple cores concurrently, and thereby put severe pressure on the IOMMU management subsystem implementing the DMA API.

2.2 Performance Analysis of Present Designs

Here we analyze the performance of current IOMMU management designs under I/O workloads with high throughput and concurrency. We use Linux/Intel-x86 as a representative study vehicle; other OSes have similar designs (§ 2.2.5). Our test workload is a highly parallel RR benchmark, in which a netperf [Jon95] server is handling 270
To analyze the overhead created by IOMMU management (shown in Figure 1.1), we break down the execution time of the parallel RR workload on 16 cores (maximum concurrency on our system) into the times spent on the subtasks required to create and destroy IOMMU mappings. Figure 2.1 shows this breakdown. For comparison, the last 3 bars show the breakdown of our scalable designs (§§ 3.1–3.2).

We note that this parallel workload provokes pathological behavior of the stock Linux IOVA allocator. This behavior, which does not exist in other OSes, causes IOVA allocation to hog $\approx 60\%$ of the execution time. The recent EiovaR optimization [MAT15] addresses this issue, and we therefore use Linux with EiovaR as our baseline. We discuss this further in § 2.2.1.

In the following, we analyze each IOMMU management subtask and its associated overhead in the Linux design: IOVA allocation (§ 2.2.1), IOTLB invalidations (§ 2.2.2), and IOMMU page table management (§ 2.2.3).

### 2.2.1 Linux IOVA Allocation

In Linux, each device is associated with an IOVA allocator that is protected by a coarse-grained lock. Each IOVA allocation and deallocation for the device acquires its allocator’s lock, which thus becomes a sequential bottleneck for frequent concurrent IOVA allocate/deallocate operations. Figure 2.1 shows that IOVA allocation lock acquisition time in the baseline accounts for 31.9% of the cycles. In fact, this is only because IOVA allocations are throttled by a different bottleneck, IOTLB invalidations (§ 2.2.2). Once
we address the invalidations bottleneck, the Iova allocation bottleneck becomes much more severe, accounting for nearly 70% of the cycles. Notably, even when eliminating the IOTLB bottleneck, EiovaR is not able to achieve a substantially higher throughput, as evident in Figure 2.2.

![Figure 2.2. EiovaR highly parallel RR throughput with and without scalable invalidation](image)

This kind of design—acquiring a global lock for each operation—would turn Iova allocation into a sequential bottleneck no matter which allocation algorithm is used once the lock is acquired. We nevertheless discuss the Linux Iova allocation algorithm itself, since it has implications for IOMMU page table management.

The Iova allocator packs allocated Iovas as tightly as possible towards the end of the virtual I/O address space. This minimizes the number of page tables required to map allocated Iovas—an important feature, because Linux rarely reclaims a physical page that gets used as an IOMMU page table (§2.2.3).

To achieve this, the Iova allocator uses a red-black tree that holds pairwise-disjoint ranges of allocated virtual I/O page numbers. This allows a new Iova range to be allocated by scanning the virtual I/O address space from highest range to lowest range (with a right-to-left traversal of the tree) until finding an unallocated gap that can hold the desired range. Linux attempts to minimize such costly linear traversals through a heuristic in which the scan starts from some previously cached tree node. This often finds a desired gap in constant time [MAT15].

Unfortunately, the Iova allocation patterns occurring with modern NICs can cause the heuristic to fail, resulting in frequent long linear traversals during Iova allocations [MAT15]. The EiovaR optimization avoids this problem by adding a cache

---

2We refer the reader to [MAT15] for the exact details of the heuristic, which are irrelevant for our purpose.
of recently freed IOVA ranges that can satisfy most allocations without accessing the tree [MAT15]. IOVA allocation time in Linux/EiovaR is thus comparable, if not superior, to other OSes. However, IOVA allocation remains a sequential bottleneck with EiovaR as well (Figure 2.1), since the EiovaR cache is accessed under the IOVA allocator lock.

2.2.2 Linux IOTLB Invalidation

Destroying an IOMMU mapping requires invalidating the IOTLB entry caching the mapping, both for correctness and for security. For correctness, if the unmapped IOVA gets subsequently remapped to a different physical address, the IOMMU will keep using the old translation and misdirect any DMAs to this IOVA. Security-wise, destroying a mapping indicates the device should no longer have access to the associated physical memory. If the translation remains present in the IOTLB, the device can still access the memory.

Unfortunately, waiting for the invalidation to complete prior to returning control from IOVA unmapping code is prohibitively expensive (chapter 4). In addition to the added latency of waiting for the invalidation to complete, issuing the invalidation command requires writing to the IOMMU invalidation queue—an operation that must be serialized and thus quickly becomes a bottleneck.

As a result, Linux does not implement this strict invalidation policy by default. Instead, it implements a deferred invalidation policy that amortizes the cost of IOTLB invalidation across multiple unmappings. Here, an unmap operation buffers an invalidation request for its IOVA in a global flush queue data structure and returns without waiting for the invalidation to complete. Periodically (every 10 ms) or after batching 250 invalidation requests, Linux performs a single global IOTLB invalidation that empties the entire IOTLB (possibly flushing valid entries as well). Once the global invalidation completes, the IOVA allocator is invoked to deallocate each IOVA range buffered in the flush queue.

Thus, deferred invalidation maintains correctness, but trades off some security—creating a window of time in which a device can access unmapped IOVAs—in exchange for performance. In practice, unmapped physical pages rarely get reused immediately upon returning from the unmap function. For example, a driver may unmap multiple pages—possibly triggering a global invalidation—before returning control to the system. Thus, deferred invalidation appears to be a pragmatic trade-off, and other OSes use similar mechanisms (§ 2.2.5).

While deferred invalidation amortizes the latency of waiting for invalidations to complete, the flush queue is protected by a single (per IOMMU) invalidation lock. As with the IOVA allocation lock, this is a non-scalable design that creates a bottleneck—21.7% of the cycles are spent waiting for the invalidation lock in our experiment.
Masking the IOVA allocator bottleneck Interestingly, the IOTLB invalidation bottleneck throttles the rate of IOVA allocation/deallocation operations, and thereby masks the severity of the IOVA allocator bottleneck. Deallocations are throttled because they occur while processing the flush queue—i.e., under the invalidation lock—and are therefore serialized. Allocations (mappings) are throttled because they are interleaved with unmappings. Since unmapping is slow because of the IOTLB bottleneck, the interval between mappings increases and their frequency decreases.

Once we eliminate the IOTLB invalidation bottleneck, however, pressure on the IOVA allocation lock increases and with it the severity of its performance impact. Indeed, as Figure 2.1 shows, adding scalable deferred IOTLB invalidation (§ 3.2) to Linux/EiovaR increases IOVA lock waiting time by 2.1×.

2.2.3 Linux Page Table Management

IOMMU page table management involves two tasks: updating the page tables when creating/destroying mappings, and reclaiming physical pages that are used as page tables when they become empty.

Updating Page Tables

We distinguish between updates to last-level page tables (leafs in the page table tree) and page directories (inner nodes).

For last-level page tables, the Linux IOVA allocator enables synchronization-free updates. Because each mapping is associated with a unique IOVA page range, updates of distinct mappings involve distinct page table entries. Further, the OS does not allow concurrent mapping/unmapping of the same IOVA range. Consequently, it is safe to update entries in last-level page tables without locking or atomic operations.

Updating page directories is more complex, since each page directory entry (PDE) may map multiple IOVA ranges (any range potentially mapped by a child page table), and multiple cores may be concurrently mapping/unmapping these ranges. A new child page table is allocated by updating an empty PDE to point to a new page table. To synchronize this action in a parallel scenario, we allocate a physical page $P$ and then attempt to point the PDE to $P$ using an atomic operation. This attempt may fail if another core has pointed the PDE to its own page table in the mean time, but in this case we simply free $P$ and use the page table installed by the other core. Deallocation of page tables is more complex, and is discussed in the following section.

The bottom line, however, is that updating page tables is relatively cheap. Page directories are updated infrequently and these updates rarely experience conflicts. Similarly, last-level page table updates are cheap and conflict free. As a result, page table updates account for about 2.8% of the cycles in the system (Figure 2.1).
Page Table Reclamation

To reclaim a page table, we must first be able to remove any reference (PDE) to it. This requires some kind of synchronization to atomically (1) determine that the page table is empty, (2) remove the pointer from the parent PDE to it, (3) prevent other cores from creating new entries in the page table in the mean time. In addition, because the IOTLB could cache PDEs [Int14], we can only reclaim the physical page that served as the now-empty page table after invalidating the IOTLB. Before that, it is not safe to reclaim this memory or to map any other IOVA in the range controlled by it.

Due to this complexity, the Linux design sidesteps this issue and does not reclaim page table memory, unless the entire region covered by a PDE is freed in one unmap action. Thus, once a PDE is set to point to some page $P$, it is unlikely to ever change, which in turn reduces the number of updates that need to be performed for PDEs. This simple implementation choice is largely enabled by the IOVA allocation policy of packing IOVA ranges close to the top of the address space. This policy results in requiring a minimal number of page tables to map the allocated IOVA ranges, which makes memory consumption by IOMMU page tables tolerable.

2.2.4 Linux Kernel Synchronization

Mapping and unmapping operations may be invoked by code servicing system interrupts, which is not allowed to sleep. For that reason, a very simple synchronization mechanism called a spinlock is used whenever locking is required, both in existing code and in any of the designs we describe later on.

A spinlock is a synchronization mechanism allowing only a single holder at any given time, which can be represented by a state of one bit in memory indicating whether the spinlock is currently free or taken. Whenever a thread wishes to acquire the spinlock, it attempts to atomically check the spinlock’s state and if it was free, change its status to taken. If the thread succeeds in changing the state from free to taken, it now holds the spinlock. If it fails, some other thread is holding the spinlock and the current thread would just execute this atomic operation in a loop (or spin) over and over again until the thread currently holding the spinlock marks it as free again.

2.2.5 IOMMU Management in Other OSes

This section compares the Linux/Intel-x86 IOMMU management design to the designs used in the FreeBSD, Solaris\(^3\), and Mac OS X systems. Table 2.1 summarizes our findings, which are detailed below. In a nutshell, we find that (1) all OSes have scalability bottlenecks similar to—or worse than—Linux, (2) none of the OSes other than FreeBSD reclaim IOMMU page tables, (3) FreeBSD is the only OS to implement strict IOTLB invalidation. The other OSes loosen their intra-OS protection guarantees for increased

\(^3\)Our source code references are to illumos, a fork of OpenSolaris. However, the code in question dates back to OpenSolaris.
IOVA allocation

All systems use a central globally-locked allocator, which is invoked by each IOVA allocation/deallocation operation, and is thus a bottleneck. The underlying allocator in FreeBSD is a red-black tree of allocated ranges, similarly to Linux. However, FreeBSD uses a different traversal policy, which usually finds a range in logarithmic time [MAT15]. Solaris uses the Vmem resource allocator [BA01], which allocates in constant time. Mac OS X uses two allocators, both logarithmic—a buddy allocator for small (≤ 512 MB) ranges and a red/black tree allocator for larger ranges.

IOTLB invalidation

FreeBSD is the only OS that implements strict IOTLB invalidations, i.e., waits until the IOTLB is invalidated before completing an IOVA unmap operation. The other OSes defer invalidations, although differently than Linux: Solaris does not invalidate the IOTLB when unmapping. Instead, it invalidates the IOTLB when mapping an IOVA range, to flush any previous stale mapping. This creates an unbounded window of time in which a device can still access unmapped memory. An unmap on Mac OS X buffers an IOTLB invalidation request in the cyclic IOMMU invalidation queue and returns without waiting for the invalidation to complete. All these designs acquire the lock protecting the IOMMU invalidation queue for each operation, and thus do not scale.

Page table management

Linux has the most scalable IOMMU page table management scheme—exploiting IOVA range disjointness to update last-level PTEs without locks and inner PDEs with atomic operations. In contrast, FreeBSD performs page table manipulations under a global lock. Solaris uses a more fine-grained technique, protecting each page table with a read/write lock. However, the root page table lock is acquired by every operation and thus becomes a bottleneck, since even acquisitions of a read/write lock in read mode create contention on the lock’s shared cache line [LZC14]. Finally,
Mac OS X updates page tables under a global lock when using the buddy allocator, but outside of the lock—similarly to Linux—when allocating from the red-black tree.

**Page table reclamation**  Similarly to Linux, Solaris does not detect when a page table becomes empty and thus does not attempt to reclaim physical pages that serve as page tables. Mac OS X bounds the number of IOMMU page tables (and therefore the size of I/O virtual memory supported) and allocates them on first use while holding the IOVA allocator lock. Mac OS X does not reclaim page tables as well. Only FreeBSD actively manages IOMMU page table memory; it maintains a count of used PTEs in each page table, and frees the page table when it becomes empty.
Chapter 3

Our Designs

3.1 Scalable IOVA Allocation

Here we describe three designs for scalable IOVA assignment, exploring several points in this design space: (1) dynamic identity mapping (§ 3.1.1), which does away with IOVA allocation altogether, (2) IOVA-kmalloc (§ 3.1.2) which implements IOVA allocation but does away with its dedicated allocator, and (3) scalable IOVA allocation (§ 3.1.3), which uses magazines [BA01] to alleviate contention on the IOVA allocator using per-core caching of freed IOVA ranges.

3.1.1 Dynamic Identity Mapping

The fastest code is code that never runs. Our dynamic identity mapping design applies this principle to IOVA allocation. We observe that ordinarily, the buffers that a driver wishes to map are already physically contiguous. We thus propose to use such a physically contiguous buffer’s physical address as its IOVA when mapping it, resulting in an identity (1-to-1) mapping in the IOMMU. Due to intra-OS protection reasons, when the driver unmaps a buffer we must destroy its identity mapping. We therefore refer to this scheme as dynamic identity mappings—while the same buffer will always use the same mapping, this mapping is dynamically created and destroyed to enforce protection of the buffer’s memory.

Dynamic identity mapping eliminates the work and locking involved in managing a distinct space of IOVAs, replacing it with the work of translating a buffer’s kernel virtual address to a physical address. In most OSes, this is an efficient and scalable operation—e.g., adding a constant to the kernel virtual address. However, while dynamic identity mapping completely eliminates the IOVA allocation bottleneck, it turns out to have broader implications for other parts of the IOMMU management subsystem, which we now discuss.

Page table entry reuse Standard IOVA allocation associates each mapping with a distinct IOVA. As a result, multiple mappings of the same page (e.g., for different
buffers on the same page) involve different page table entries (§ 2.2.3). Dynamic identity mapping breaks this property: a driver—or concurrent cores—mapping a previously mapped page will need to update exactly the same page table entries (PTEs).

While repeated mapping operations will always write the same value (i.e., the physical address of the mapped page), unmapping operations pose a challenge. We need to detect when the last unmapping occurs, so we can clear the PTE. This requires maintaining a reference count for each PTE. We use 10 of the OS-reserved bits in the last-level PTE structure [Int14] to maintain this reference count. When the reference count hits zero, we clear the PTE and request an IOTLB invalidation.

Because multiple cores may concurrently update the same PTE, all PTE updates—including reference count maintenance—require atomic operations. In other words, we need to (1) read the PTE, (2) compute the new PTE value, (3) update it with an atomic operation that verifies the PTE has not changed in the mean time, and (4) repeat this process if the atomic operation fails.

**Conflicting access permissions** A second problem posed by not having unique PTEs for each mapping is how to handle mappings of the same physical page with conflicting access permission (read, write, or both). For example, two buffers may get allocated by the network stack on the same page—one for RX use, which the NIC should write to, and one for TX, which the NIC should only read. To maintain intra-OS protection, we must separately track mappings with different permissions, e.g., so that once all writable mappings of a page are destroyed, no PTE with write permissions remains. Furthermore, even when a mapping with write permissions exists, we want to avoid promoting PTEs that should only be used to read (and vice-versa), as this kind of access should not happen during the normal course operation for a properly functioning device and should be detected and blocked.

To solve this problem, we exploit the fact that current x86 processors support 48-bit I/O virtual memory addresses, but only 46-bits of physical memory addresses. The two most significant bits in each IOVA are thus “spare” bits, which we can use to differentiate mappings with conflicting access permissions. Effectively, we partition the identity mapping space into regions: three regions, for read, write and read/write mappings, and a fourth fallback region that contains addresses that cannot be mapped with identity mappings—as discussed next.

**Non-contiguous buffers** Some mapping operations are for physically non-contiguous buffers. For example, Linux’s scatter/gather mapping functions map non-consecutive physical memory into contiguous virtual I/O memory. Since identity mappings do not address this use case, we fall back to using the IOVA allocator in such cases. To avoid conflicts with the identity mappings, IOVAs returned by the IOVA allocator are used in the fourth fallback region.
**PTE reference count overflow** We fall back to standard IOVA allocation for mappings in which the 10-bit reference count overflows. That is, if we encounter a PTE whose reference count cannot be incremented while creating a dynamic identity mapping, we abort (decrementing the references of any PTEs previously incremented in the mapping process) and create the mapping using an IOVA obtained from the IOVA allocator.

**Page table memory** Because physical addresses mapped by drivers are basically arbitrary, we lose the property that IOVAs are densely packed. Consequently, more memory may be required to hold page tables. For example, if the first 512 map operations are each for a single page, IOVA allocation will map them all through the same last-level page table. With physical addresses, we may need at least a page table per map operation. Unfortunately, the physical pages used to hold these page tables will not get reclaimed (§ 2.2.3). In the worst case, page table memory consumption may keep increasing as the system remains active.

### 3.1.2 IOVA-kmalloc

Our IOVA-kmalloc design explores the implications of treating IOVA allocation as a general allocation problem. Essentially, to allocate an IOVA range of size $R$, we can allocate a block of $R$ bytes in physical memory using the system’s `kmalloc` allocator, and use the block’s address as an IOVA (our actual design is much less wasteful, as discussed below). The addresses `kmalloc` returns are unique per allocation, and thus IOVA-kmalloc maintains the efficient conflict-free updates of IOMMU page tables enabled by the original IOVA allocator.

One crucial difference between `kmalloc` and IOVA allocation is that IOVAs are abstract—essentially, opaque integers—whereas `kmalloc` allocates physically contiguous memory. It might therefore seem that the IOVA-kmalloc approach unacceptably wastes memory, since allocating $R$ bytes to obtain an $R$-byte IOVA range doubles the system’s memory consumption. Fortunately, we observe that the IOVA allocator need only allocate virtual I/O page frame numbers (PFNs), and not arbitrary ranges. That is, given a physical buffer to map, we need to find a range of pages that contains this buffer. This makes the problem tractable: we can treat the physical addresses that `kmalloc` returns as PFNs. That is, to map a range of $R$ bytes, we `kmalloc` $\lceil R/4096 \rceil$ bytes and interpret the address of the returned block $B$ as a range of PFNs (i.e., covering the IOVA range of $[4096B, 4096(B + \lceil R/4096 \rceil)]$).

While this still wastes physical memory, the overhead is only 1 byte per virtual I/O page, rounded up to 8 bytes (the smallest unit `kmalloc` allocates internally). By comparison, the last-level PTE required to map a page is itself 8 bytes, so the memory blowup caused by IOVA-kmalloc allocating actual physical memory is never greater than the memory overhead incurred by stock Linux for page table management.
In fact, being a full-blown memory allocator that manages actual memory (not abstract integers) might actually turn out to be advantageous for kmalloc, as this property enables it to use the many techniques and optimizations in the memory allocation literature. In particular, kmalloc implements a version of slab allocation [Bon94], a fast and space-efficient allocation scheme. One aspect of this technique is that kmalloc stores the slab that contains metadata associated with an allocated address in the page structure of the address. This allows kmalloc to look up metadata in constant time when an address gets freed. In contrast, the Linux IOVA allocator has to maintain metadata externally, in the red-black tree, because there is no physical memory backing an IOVA. It must thus access the globally-locked tree on each deallocation.

**Page table memory blowup** The main downside of IOVA-kmalloc is that we have no control over the allocated addresses. Since kmalloc makes no effort to pack allocations densely, the number of page tables required to hold all the mappings will be larger than with the Linux IOVA allocator. Moreover, if the same physical address is mapped, unmapped, and then mapped again, IOVA-kmalloc may use a different IOVA when remapping. Because Linux does not reclaim page table memory, the amount of memory dedicated to page tables can grow without bound. In contrast, dynamic identity mapping always allocates the same IOVA to a given physical buffer. However, in an OS that manages page table memory, unbounded blowup cannot occur.

To summarize, IOVA-kmalloc represents a classic time/space trade-off—we relinquish memory in exchange for the efficiency and scalability of kmalloc, which is highly optimized due to its pervasive use throughout the kernel. Notably, these advantages come for free, in terms of implementation complexity, debugging and maintenance, since kmalloc is already there, performs well, and is trivial to use.

**Handling IOVA collisions** IOVAs are presently 48 bits wide [Int14]. x86 hardware presently supports 46-bits of physical address space [Int13]. Thus, because we treat kmalloc addresses as virtual I/O PFNs, IOVA-kmalloc may allocate two addresses that collide when interpreted as PFNs in the 48-bit I/O virtual address space. That is, we have $2^{36}$ possible IOVA PFNs since pages are 4 KB, but kmalloc allocates from a pool of up to $2^{46}$ bytes of physical memory.

Such collisions are mathematically impossible on systems with at most 64 GB of physical memory (whose physical addresses are 36-bit). To avoid these collisions on larger systems, IOVA allocations can invoke kmalloc with the GFP_DMA flag. This flag instructs kmalloc to satisfy the allocation from a low memory zone whose size we can configure to be at most 64 GB.\(^1\)

As a side note, we can mention that as a middle of the road solution, a physical

---

\(^1\)In theory, the GFP_DMA memory zone must be accessible to legacy devices for DMA and thus addressable with 24 bits. But as we have an IOMMU, we only need to ensure that the IOVA ranges mapped to legacy devices fall into the 24-bit zone.
memory allocator implemented specifically for this purpose could address these problems. For once, it could be specifically allocated addresses in the 36-bit range, or alternatively, manage a higher level mapping of its superblocks (or equivalent large ranges) to a specific region of virtual memory. It could than also accommodate large pages with each allocated byte representing a large page of 2MB/1GB. Regarding the unused memory, with a dedicated allocator, this memory can be used to store the last level page tables themselves, so long as any metadata stored within these pages is kept to appear as invalid for the IOMMU.

**Why allocate addresses?** We could simply use a mapped buffer’s kernel virtual address as its IOVA PFN. However, we then lose the guarantee that different mappings obtain different IOVA PFNs (e.g., as the same buffer can be mapped twice). This is exactly the problem dynamic identity mapping tackles, only here we do not have “spare” bits to distinguish mappings with different access rights as the virtual address space is 48 bits.

### 3.1.3 Scalable IOVA Allocation with Magazines

Our final design addresses the IOVA allocator bottleneck head-on, turning it into a scalable allocator. The basic idea is to add a per-core cache of previously deallocated IOVA ranges. If most allocations can be satisfied from the per-core cache, the actual allocator—with its lock—will be invoked rarely.

Per-core caching poses several requirements. First, the per-core caches should be bounded. Second, they should efficiently handle producer/consumer scenarios observed in practice, in which one core continuously allocates IOVAs, which are later freed by another core. In a trivial design, only the core freeing IOVAs will build up a cache, while the allocating core will always invoke the underlying non-scalable allocator. We require that both cores avoid interacting with the IOVA allocator.

Magazines [BA01] provide a suitable solution. A magazine is an $M$-element per-core cache of objects—IOVA ranges, in our case—maintained as a stack of objects. Conceptually, a core trying to allocate from an empty magazine (or deallocate into a full magazine) can return its magazine to a globally-locked depot in exchange for a full (or empty) magazine. Magazines actually employ a more sophisticated replenishment policy which guarantees that a core can satisfy at least $M$ allocations and at least $M$ deallocations from its per-core cache before it must access the depot. Thus, a core’s miss rate is bounded by $1/M$.

We implement magazines on top of the original Linux IOVA allocator, maintaining separate magazines and depots for different allocation sizes. Thus, this design still controls page table blowup, as the underlying allocator densely packs allocated IOVAs.
### 3.2 Scalable IOTLB Invalidation

This section describes a scalable implementation of deferred IOTLB invalidations. Recall that Linux maintains a flush queue containing a globally ordered list of all IOVA ranges pending invalidation. We observe, however, that a global flush queue—with its associated lock contention—is overkill. There is no real dependency between the invalidation process of distinct IOVA ranges. Our only requirements are that until an entire IOVA range mapping is invalidated in the IOTLB:

1. The IOVA range will not be released back to the IOVA allocator, to avoid having it allocated again before the old mapping is invalidated.
2. The page tables that were mapping the IOVA range will not be reclaimed. Since page directory entries are also cached in the IOTLB, such a reclaimed page table might get reused and data that appears as a valid page table entry be written to it, and this data might then be used by the IOMMU.

**Our approach** We satisfy these requirements in a much more scalable manner by batching invalidation requests *locally* on each core instead of globally. Implementing this approach simply requires replicating the current flush queue algorithm on a per-core basis. With this design, the lock on a core’s flush queue is almost always acquired by the owning core. The only exception is when the queue’s global invalidation timeout expires—the resulting callback, which acquires the lock, may be scheduled on a different core. However, not only does such an event occur rarely (at most once every 10 ms), but it suggests that the IOMMU management subsystem is not heavily used in the first place.

The remaining source of contention in this design is the serialization of writes to the cyclic IOMMU invalidation queue—which is protected by a lock—in order to buffer global IOTLB invalidation requests. Fortunately, accesses to the invalidation queue are infrequent, with at most one invalidation every 250 invalidations or 10 ms, and short, as an invalidation request is added to the buffer and the lock is released; the core waits for the IOMMU to process its queued invalidation without holding the lock.

Security-wise, while we now might batch 250 invalidation requests per core, a destroyed IOVA range mapping will usually be invalidated in the IOTLB just as quickly as before. The reason is that *some* core performs a global IOTLB invalidation, on average, every 250 global invalidation requests. Thus, we do not substantially increase the window of time in which a device can access an unmapped physical page. Unmapped IOVA ranges may, however, remain in a core’s flush queue and will not be returned to the IOVA allocator for a longer time than the baseline design—they will be freed only when the core itself processes its local flush queue. We did not observe this to be a problem in practice, especially when the system experiences high IOMMU management load.
Chapter 4

Evaluation

Here we evaluate our designs (§§ 3.1–3.2) and explore the trade-offs they involve. We study two metrics: the performance obtained (§ 4.1), and the complexity of implementing the designs (§ 4.2).

4.1 Performance

Experimental setup We implement the designs in Linux 3.17.2. Our test setup consists of a client and a server, both Dell PowerEdge R430 rack machines. Each machine has dual 2.4 GHz Intel Xeon E5-2630 v3 8-core processors, for a total of 16 cores per machine (hyper-threading is disabled). Both machines are equipped with 32 GB 2133 MHz memory. The server is equipped with a Broadcom NetXtreme II BCM57810 10 Gb/s NIC. The client is equipped with an Intel 82599 10 Gb/s NIC and runs an unmodified Ubuntu 3.13.0-45 Linux kernel. The client’s IOMMU is disabled for all evaluations. The client and the server NICs are cross-connected to avoid network interference.

Methodology To obtain the best scalability, we (1) configure the NIC on the server to use 15 rings, which is the maximum number supported by the Broadcom NIC, allowing cores to interact with the NIC with minimal lock contention (no ring contention with <16 cores), and (2) distribute delivery of interrupts associated with different rings across different cores.\footnote{Default delivery is to core #0, which overloads this core.} Benchmarks are executed in a round-robin fashion, with each benchmark running once for 10 seconds for each possible number of cores, followed by all benchmarks repeating. The cycle is run five times and the reported results are the medians of the five runs.

Benchmarks In our benchmarks, we aim to use concurrent workloads that stress the IOMMU management subsystem.
Our first few benchmarks are based on netperf [Jon95], a prominent network analysis tool. In our highly parallel RR (request-response) benchmark, we run 270 instances of the netperf TCP RR test on the client, and limit the server side of netperf to the number of cores we wish to test on. Each TCP RR instance sends a TCP packet with 1 byte of payload to the server and waits for a 1 byte response before sending the next one. In all, our benchmark has 270 ongoing requests to the netperf server at any given time, which bring the server close to 100% CPU utilization even with IOMMU disabled. We report the total number of such request-response transactions the server responds to during the testing period.

The second benchmark we use netperf for is a parallel latency test. To achieve that, we run the netperf TCP RR test with as many instances as we allow server cores. This allows each netperf request-response connection to run on its own dedicated core on both the client and the server.

Our third benchmark is a netperf stream test. To test scalability, for each possible number of server cores — we limit the server to this number of cores and initiate this number of streams from the client. We demonstrate both the default stream configuration, as well as 64-byte send size and 64-byte send with the TCP_NODELAY option enabled.

Our first macro-benchmark is memcached [Fit04], a key-value store service used by web applications to speed up read operations on slow resources such as databases and remote API calls. To avoid lock contention within memcached, we run multiple instances, each pinned to run on a specific core. We measure memcached throughput using memslap [Ake], which distributes requests among the memcached instances. We configure memslap to run on the client with 16 threads and 256 concurrent requests (16 per thread). We use memslap’s default configuration of a 64-byte key, 1024-byte value and a ratio of 10%/90% SET/GET operations.

For our second macro-benchmark we use Apache [Fie, FK97], a prominent HTTP server implementation. Our benchmark repeatedly requests a specific 1KB file from the web server. Testing is carried out on the client using ApacheBench [apa], which we run 6 instances of, in parallel, each issuing 10 concurrent requests. Server side Apache instances are configured to run on a subset of the server’s cores. Our Apache server is configured to use the worker multi-processing module and with logs, mmap and HTTP accept filter disabled.

Results Figure 4.1 depicts the throughput achieved by our highly parallel RR benchmark. Without an IOMMU, Linux scales nicely and obtains 14.4× TPS with 16 cores than with a single core. Because of the IOMMU management bottlenecks, EiovaR-Linux does not scale as well, obtaining only a 3.8× speedup. In particular, while EiovaR obtains 86% of the No-IOMMU throughput with 1 core (due to the single-threaded overheads of IOMMU management), it only achieves 23% of the No-IOMMU throughput at 16 cores.

Our designs all perform at 93%–95% of No-IOMMU on a single core and scale
far better than EiovaR, with 16-core throughput being 93% (magazines and dynamic identity mapping) and 94% (IOVA-kmalloc) of the No-IOMMU results. Because of the overhead dynamic identity mapping adds to page table updates, it does not outperform the IOVA allocating designs—despite not performing IOVA allocation at all.

To summarize, in our designs IOMMU overhead is essentially constant and does not get worse with more concurrency. Most of this overhead consists of page tables updates, which are essential when managing the IOMMU in a transient manner.

Figure 4.1. Highly parallel RR throughput

Figure 4.2. netperf TCP RR instance per core latency test
In our parallel latency test, shown in Figure 4.2, we see that Linux’s latency increases with multiple cores, even without IOMMU, from 29µs with one instance to 41µs with 16 instances, each with its own core. While EiovaR starts within 1.2µs of Linux’s latency on one core, the contention caused by its locks causes a 10µs gap at 16 cores. For the most part, our designs are on par with Linux’s performance, actually achieving slightly shorter latency than No-IOMMU Linux on 16 cores.

The evaluation of memcached in Figure 4.3 paints a similar picture to Figure 4.1 with up to 12 cores. The IOMMU subsystem is indifferent to the different packet size in this workload (1052 bytes here, 1 byte in TCP RR) as the mappings are done per page and no data copying takes place. Starting at 12 cores, all of our designs achieve line rate and therefore close the gap with No-IOMMU performance. Here, too, IOVA-kmalloc demonstrates the best performance out of our designs by a small margin (in all but 2 cores and >12 cores, where they all achieve line rate), as it is the most highly optimized of the three. With a single core, all of our designs are between 89% (dynamic identity) and 91% (IOVA-kmalloc) of No-IOMMU performance. At 9 cores, after which No-IOMMU Linux throughput begins to near line rate, our designs’ relative throughput is between 89.5% (dynamic identity) and 91.5% (IOVA-kmalloc). EiovaR also stops scaling well before 16 cores, but as opposed to our designs, it does not do that due to achieving line rate, plateauing at 40% of it, starting with 9 cores.

Figure 4.4 demonstrates the case of an Apache web server handling 1KB requests and is also similar to our highly parallel netperf RR benchmark, with two notable exceptions. First, a user mode mutex in Apache prevents it from scaling indefinitely, regardless of IOMMU. Second, due to user code being more dominant in this benchmark,
the relative overhead caused by IOMMU management is smaller, with our designs performing between 96% (dynamic identity) and 97.5% (IOVA-kmalloc) of No-IOMMU performance with 16 cores. EiovaR’s relative overhead is also less pronounced due to the extra work being done for each packet in user mode, but even with an application with scalability issues of its own, EiovaR is not able to catch up and achieves 73% of No-IOMMU performance at 16 cores.

**Stream**  As can be seen in figure 4.5, a *netperf* TCP stream benchmark, operating under its default settings, performs quite well even on EiovaR. This is due to the fact that EiovaR is able to achieve line rate using just one *netperf* stream instance running on a one core utilized at 22%, leading to very limited contention in the parallel case. Even the stock Linux IOMMU implementation is able to support line rate with one core (albeit at 95% CPU utilization), and it is its “long-term ring interference” pathology [MAT15] that causes its performance to deteriorate when more and more rings are being put to use.

EiovaR’s ability to support a TCP stream workload efficiently is, to a great degree, due to the fact that the received TCP packets are very large, allowing line rate to be achieved with a relatively low number of TCP packets and therefore a relatively low number of IOMMU mappings. It is therefore interesting to see how EiovaR, as well as our designs, handle a stream workload which continually sends small chunks of data.

Figure 4.6 shows such a workload—a *netperf* TCP stream test which instances send 64 byte messages at a time. Throughput is indeed decreased with a low number of streams (and server cores handling them). This decrease, however, also occurs without the involvement of an IOMMU, and is indeed indifferent to IOMMU management.
overhead of all but stock Linux. This behaviour is due to the Nagle’s algorithm [Nag] employed by Linux’s TCP stack in the client, which intentionally delays the transmission of TCP segments with the intention of batching them and thereby reducing the relative overhead imposed by TCP/IP headers. Applying Nagle’s algorithm leads to most of packets actually sent by the client to be around 1500 bytes, allowing the server to process them in a relatively efficiently in regard to IOMMU mappings. On the other hand, our client is rather congested, as it is required to perform many system calls and

Figure 4.5. netperf stream throughput, 16KB messages

Figure 4.6. netperf stream throughput, 64 byte messages
related processing to send 64 byte chunks to its kernel. This overhead leads to the client not being able to sustain line rate until it uses 9 streams (on 9 different cores).

Nagle’s algorithm increases throughput by delaying ready-to-send TCP segments, but it therefore increases latency, which is why the Linux kernel enables the user to turn it off by setting the TCP_NODELAY socket option. Figure 4.7 shows the throughput achieved by our various designs using the same 64 byte send size benchmark as before, but with TCP_NODELAY set. As we can see, for the most part, until 7 cores, stock Linux outperforms Eiovar in throughput, which outperforms our designs and Linux with no IOMMU enabled. This is due to TCP_NODELAY being a latency optimization, where we are trying to send chunks as soon as possible at the expense of batching and resulting throughput. However, the unintentional delay caused by stock Linux and Eiovar ultimately results in the client’s network stack batching more user chunks into each packet, which can also be seen in Figure 4.7. And so, as latency increases, batching takes place, reducing per-payload-byte overhead and allowing higher throughput.

Page table memory consumption  Linux rarely reclaims a page used as an IOMMU page table (§ 2.2.3). Figure 4.8 illustrates the memory consumption dynamics this policy creates. We run 100 iterations of our parallel RR benchmark and plot the number of IOMMU page tables (in all levels) that exist in the system before the tests start (but after system and NIC initialization) and after each iteration.

Both Eiovar and our magazine-based design, which are based on Linux’s current IOVA allocator, minimize page table memory consumption by packing allocated IOVAs at the top of the address space. As Figure 4.8 shows, both of them, as well as stock
Linux, allocate most page tables at NIC initialization, when all RX rings are allocated and mapped, as well as all TX/RX descriptors and other metadata. Subsequently, some additional page tables get allocated to accommodate TX packets, with the bulk majority of these being allocated during the first iteration.

With our magazines design, some cores would cache freed IOVAs in their per-core caches, which might require other cores to allocate new IOVAs from the allocator and increase the number of page tables in the system, with the number of additional IOVAs needed bounded by the size of per-core caches. Empirically though, we do not see a substantial difference between magazines and Eiovar’s global caches, even though the magazines setup enqueues 4× more packets in the TX rings per-second. Overall, even with per-core caches and a more TX activity, our magazine design only requires 1.2× more page table memory than stock Linux.

While dynamic identity mapping does not guarantee an upper bound on the number of IOVAs it utilizes over time, it turns out that the APIs used by the driver to allocate buffers use caching themselves. Consequently, mapped addresses repeat across the iterations and the number of page tables does not explode. Still, 3.4× more page tables than stock Linux are allocated, since the addresses are not packed. Furthermore, even with this kind of caching in place, there is danger that as the system stays online, some addresses would be released and others allocated, resulting in different and arbitrary physical addresses and therefore more abandoned page tables.

In contrast to the other schemes, IOVA-kmalloc exhibits a blowup of page table memory use. Since the addresses IOVA-kmalloc receives from kmalloc are influenced by system-wide allocation activity, IOVA-kmalloc uses a much wider and constantly
changing set of IOVAs. This causes its page table memory consumption to grow in an almost linear fashion. This trend carries on as the system stays up. While running our entire benchmark suite, IOVA-kmalloc allocates $\approx 11$ GB worth of memory, in just a few hours of operation.

We conclude that while IOVA-kmalloc is the best performing design, by a slight margin, its use must be accompanied by memory reclamation of unused page tables. This underscores the need for managing the IOMMU page table memory in a scalable manner—§ 2.2.3 describes the challenges involved.

**Strict invalidation** Figure 4.9 shows the parallel RR benchmark throughput with strict IOTLB invalidation, i.e., full intra-OS protection. As invalidations are not deferred, our scalable IOTLB invalidation optimization is irrelevant for this case. While we observe very minor throughput difference between the IOVA allocation designs when using a small number of cores, even this difference is no longer evident when more than 6 cores are used, with all designs obtaining about $1/6$ the TPS of No-IOMMU Linux by 16 cores. In all designs, the contention over the invalidation queue lock becomes the dominating factor (§ 2.2.2). Thus, strict IOTLB invalidation appears incompatible with a scalable implementation.

### 4.2 Implementation Complexity

Table 4.1 reports the source code changes required to implement our designs in Linux 3.17.2. The table also estimates the time it would take us to re-implement each approach
<table>
<thead>
<tr>
<th>Design</th>
<th>Lines added</th>
<th>Lines deleted</th>
<th>Files</th>
<th>Impl. time</th>
</tr>
</thead>
<tbody>
<tr>
<td>Dynamic Identity Mapping (§ 3.1.1)</td>
<td>+397</td>
<td>-92</td>
<td>1</td>
<td>A few weeks</td>
</tr>
<tr>
<td>IOVA-kmalloc (§ 3.1.2)</td>
<td>+56</td>
<td>-44</td>
<td>1</td>
<td>A few hours</td>
</tr>
<tr>
<td>IOVA Allocation with Magazines (§ 3.1.3)</td>
<td>+362</td>
<td>-79</td>
<td>3</td>
<td>A few days</td>
</tr>
<tr>
<td>Scalable IOTLB Invalidation (§ 3.2)</td>
<td>+97</td>
<td>-37</td>
<td>1</td>
<td>A few hours</td>
</tr>
</tbody>
</table>

Table 4.1. Implementation complexity of our designs

from scratch. IOVA-kmalloc is the simplest design to implement, essentially replacing calls to the IOVA allocator with a `kmalloc()` call. Most of the line changes reported for IOVA-kmalloc are due to technical changes, replacing the `struct` representing an IOVA with an integer type. Implementation of the magazines design is a bit more complex, requiring an implementation of a magazine-based caching layer on top of the existing IOVA allocator, as well as optimizing its locking strategy. Dynamic identity mapping is the most complicated and intrusive of our IOVA assignment designs, as it calls for a surgical modification of page table management itself, maintaining mapping book-keeping within the table itself and synchronizing updates to the same mapping from multiple cores in parallel.
Chapter 5

Related Work

Most work addressing the poor performance associated with using IOMMUs tackles only sequential performance [ABYTS11, BYXO+07, Cas13, MABYT15, Tom08, WRC08, YBYW10, MAT15], for example by reducing the number of mapping operations [WRC08], deferring or offloading IOTLB invalidation work [ABYTS11], and improving the IOVA allocator algorithm [Cas13, MAT15, Tom08]. Cascardo [Cas13] does consider multicore workloads, but his proposal improves only the sequential part of the IOVA allocator. In contrast, our work identifies and attacks the scalability bottlenecks in current IOMMU management designs. In addition, our scalable IOVA allocation is orthogonal to sequential improvements in the IOVA allocator, since it treats it as a black box.

Clements et al. propose designs for scalable management of process virtual address spaces [CKZ12, CKZ13]. I/O virtual memory has simpler semantics than standard virtual memory, which allows us to explore simpler designs. In particular, (1) IOVA ranges cannot be partially unmapped, unlike standard mmap()ed ranges, and (2) IOVA mappings can exploit the preexisting address of the mapped buffers (as in dynamic identity mapping), while creation of a virtual memory region can occur before the allocation of the physical memory backing it.

Bonwick introduced magazines to improve scalability in the Vmem resource allocator [BA01]. Since Vmem is a general allocator, however, it does not minimize page table memory consumption of allocated IOVAs, in contrast to the specialized Linux allocator. Our IOVA-kmalloc design additionally shows that a dedicated resource allocator may not be required in the first place. Finally, part of our contribution is the re-evaluation of magazines on modern machines and workloads.
Chapter 6

Conclusion

Today, IOMMU-based intra-OS protection faces a performance challenge in high throughput and highly concurrent I/O workloads. In current OSes, the IOMMU management subsystem is not scalable and creates a prohibitive bottleneck.

Towards addressing this problem, we have explored the design space of scalable IOMMU management approaches. We proposed three designs for scalable IOVA assignment—dynamic identity mapping, IOVA-kmalloc, and per-core IOVA caching—as well as a scalable IOTLB invalidation scheme. Our designs achieve 88.5%–100% of the performance obtained without an IOMMU.

Our evaluation demonstrates the trade-offs of the different designs. Namely, (1) the savings dynamic identity mapping obtains from not allocating IOVAs are negated by its more expensive IOMMU page table management, making it perform comparably to scalable IOVA allocation, and (2) IOVA-kmalloc provides a simple solution with high performance, but suffers from unbounded page table blowup. This emphasizes the importance of managing the IOMMU page table memory in a scalable manner as well, which is interesting future work.
Bibliography


[CYC+01] Andy Chou, Junfeng Yang, Benjamin Chelf, Seth Hallem, and Dawson Engler. An empirical study of operating systems errors. In ACM


תרומות

לעבידת נושר תרומות:

• יסחית הקצבת חותנה יוזרואליות פסילת חותנה מוסמך החנהמונה צוואר יבק מהמכון

הנהלא تكون חותמות ק/פ במשרוכות חתונות הקימום.

• הנצחת שלושה עציובים שלימים לבחירת חותנה יוזרואליות עם עם יוץ נסים לפסילת

חותמות.

• הערכה של עציובים קיימים של yazıיות החתימה על יﺅרואליות רוזמטית עבורה מכפילים שלונים.

• סקירת מרחבי העציובים על ידי שוחרי ביצועים, הצהרת יוזרואליות על לבלאת חותמה ות RemoteExceptionים

ה.Minuteウェブサイトית

פסילת כתובות במטמון התרגום

לע מתן ליעל את החלק ההרגולר, היוספ עוזיהו גופזה שטפוף את הפיסול של כתובות התרגולויות בטבנמוא, ואומץ קים הקר.apply. לע מתן ליעל את החלק ההרגולר, היוספ עוזיהו גופזה שטפוף את הפיסול של כתובות התרגולויות בטבנמוא,アプリケーション צלול להסיפוס המפורט לע细菌(at) תרגום, ו- 3 בולServiceProvider לתwarf(fpablins) עימה את התהליך על ידי איסוף הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטום את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטים את הפרטי...
המערך העצוב, למפיס את הכותבים במרחבי עיצוב באמצעות הפניות להיזק"פ. עלמנת לפתור את הבעיה, אנו מומרים למימוש של הפרטים הבאים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערך ביצועים במערכ
In order to enable a technique (light/flash) of techniques (being faster, e.g., cards, network), various techniques are provided. These techniques are provided in a way that allows direct access to the memory, which is good for reading and writing, without the need for an intermediate system. Unlike the process of accessing memory, which is done under the control of the memory manager, access to memory is done by the technique (light/flash), without the possibility of controlling it.

Modern computers, with that, provide a memory manager for the technique (light/flash), which is connected to access memory in a consistent way. This memory manager is provided by the technique (light/flash) and can be translated into the technique (light/flash) text. If the translation is not correct, it will create a situation where the access is not possible for reading or writing. This feature allows the system to control access to a part of the technique (light/flash) memory and protect itself from incorrect and malicious techniques, such as bugs that may appear in it.

Protection is done by the system (Linux, Solaris, etc.) and is recommended to use the system.

The goal of this work is to be included in the technique (light/flash) and to provide a consistent solution for the memory mapping that is used in modern systems. The motivation for this work is to be included in modern systems, which support direct access to memory by the technique (light/flash), thus achieving a high level of performance. In the case of parallel work, this is done by mapping tables that map the memory in a consistent way.

At the Technion Computer Science Department, M.Sc. Thesis 2015-18, the following mapping is used:

- Apple OS X
- FreeBSD

This mapping includes the following features:

- Efficient and consistent memory mapping
- High performance
- Protection against malicious techniques
- High level of performance
- Parallel work

This work is done by the team at the Technion Computer Science Department, M.Sc. Thesis 2015-18.
שימוש סולמי בחידת גיוולט זוכרים של התכנית קלות/פלט

hibor al mahker

לשם مليולי חלקי של הדרישות לקבלת התואר

מניסים למ디ים במודעי המחשב

עומר פלג

הונג להנง הטכנולוגיה - מכון טכנולוגי לישראל

אב התחיש"ה חיפה אוגוסט 2015
℮ימום סולומיו ביהידה גיהון היזכרון של
התקני קלת/פלט

עומר פלג