|Title:||Rethinking the I/O Memory Management Unit (IOMMU)
|Currently accessibly only within the Technion network|
|Abstract:||Processes are encapsulated with virtual memory spaces and access the memory via virtual addresses (VAs) to ensure, among other things, that they access only those memory parts they have been explicitly granted access to. These memory spaces are created and maintained by the OS, and the translation from virtual to physical addresses (PAs) is done by the MMU. Analogously, I/O devices can be encapsulated with I/O virtual memory spaces and access the memory using I/OVAs, which are translated by the input/output memory management units (IOMMU) to physical addresses. This encapsulation increases system availability and reliability, since it prevents devices from overwriting any part of the memory, including memory that might be used by other entities. It also prevents rogue devices from performing errant or malicious access to the memory and ensures that buggy devices will not lose important data. Chip makers understood the importance of this and added IOMMUs to the chipsets of all servers and some PCs. However, this protection comes at the cost of performance degradation, which depends on the IOMMU design, the way it is programmed, and the workload. We found that Intel’s IOMMU degrades the throughput of I/O-intensive workloads by up to an order of magnitude. We investigate all the possible causes of IOMMU overhead and that of its driver and suggest a solution for each. First we identify that the complexity of the kernel subsystem in charge of IOVA allocation is linear in the number of allocated IOVAs and thus a major source of overhead. We optimize the allocation in a manner that ensures that the complexity is typically constant and never worse than logarithmic, and we improve the performance of the Netperf, Apache, and Memcached benchmarks by up to 4.6x. Observing that the IOTLB miss rate can be as high as 50%, we then suggest hiding the IOTLB misses with a prefetcher. We extend some of the state-of-the-art prefetchers to IOTLB and compare them. In our experiments we achieve a hit rate of up to 99% on some configurations and workloads. Finally, we observe that many devices such as network and disk controllers typically interact with the OS via circular ring buffers that induce a sequential, completely predictable workload. We design a ring IOMMU (rIOMMU) that leverages this characteristic by replacing the virtual memory page table hierarchy with a circular, flat table. Using standard networking benchmarks, we show that rIOMMU provides up to 7.56x higher throughput relative to the baseline IOMMU, and that it is within 0.77-1.00x the throughput of a system without IOMMU.|
|Copyright||The above paper is copyright by the Technion, Author(s), or others. Please contact the author(s) for more information|
Remark: Any link to this technical report should be to this page (http://www.cs.technion.ac.il/users/wwwb/cgi-bin/tr-info.cgi/2015/MSC/MSC-2015-10), rather than to the URL of the PDF files directly. The latter URLs may change without notice.
To the list of the MSC technical reports of 2015
To the main CS technical reports page
Computer science department, Technion