Picture of Dan Dan Tsafrir
Dept. of Computer Science
Technion - Israel Institute of Technology
 
home
contact
short bio
service
publications
students
courses
systems conferences by deadline
and by date
  

2017

  • Flash Drive Lifespan *is* a Problem
    Tao Zhang, Aviad Zuck, Donald E. Porter, Dan Tsafrir
    HotOS '17: ACM Workshop on Hot Topics in Operating Systems
    May, 2017, Whistler, Canada, to appear
    BibTeX
    @InProceedings{zhang17-fbrick,
    author = {Tao Zhang and Aviad Zuck and Donald E. Porter and Dan Tsafrir},
    title = {Flash {Drive} {Lifespan} *is* a {Problem}},
    booktitle = {ACM Workshop on Hot Topics in Operating Systems (HotOS)},
    year = 2017,
    month = {May},
    address = {Whistler, Canada},
    note = {(to appear)}
    }

  • Preserving Hidden Data with an Ever-Changing Disk
    Aviad Zuck, Udi Shriki, Donald E. Porter, Dan Tsafrir
    HotOS '17: ACM Workshop on Hot Topics in Operating Systems
    May, 2017, Whistler, Canada, to appear
    BibTeX
    @InProceedings{zuck17-everchng,
    author = {Aviad Zuck and Udi Shriki and Donald E. Porter and Dan Tsafrir},
    title = {Preserving {Hidden} {Data} with an {Ever-Changing} {Disk}},
    booktitle = {ACM Workshop on Hot Topics in Operating Systems (HotOS)},
    year = 2017,
    month = {May},
    address = {Whistler, Canada},
    note = {(to appear)}
    }

  • Page fault support for network controllers
    Ilya Lesokhin, Haggai Eran, Shachar Raindel, Guy Shapiro, Sagi Grimberg, Liran Liss, Muli Ben-Yehuda, Nadav Amit, Dan Tsafrir
    ASPLOS '17: ACM International Conference on Architectural Support for Programming Languages and Operating Systems
    April, 2017, Xi'an, China, to appear
    Abstract , BibTeX , PDF
    Direct network I/O allows network controllers (NICs) to expose multiple instances of themselves, to be used by untrusted software without a trusted intermediary. Direct I/O thus frees researchers from legacy software, fueling studies that innovate in multitenant setups. Such studies, however, overwhelmingly ignore one serious problem: direct memory accesses (DMAs) of NICs disallow page faults, forcing systems to either pin entire address spaces to physical memory and thereby hinder memory utilization, or resort to APIs that pin/unpin memory buffers before/after they are DMAed, which complicates the programming model and hampers performance.

    We solve this problem by designing and implementing page fault support for InfiniBand and Ethernet NICs. A main challenge we tackle—unique to NICs—is handling receive DMAs that trigger page faults, leaving the NIC without memory to store the incoming data. We demonstrate that our solution provides all the benefits associated with “regular” virtual memory, notably (1) a simpler programming model that rids users from the need to pin, and (2) the ability to employ all the canonical memory optimizations, such as memory overcommitment and demand-paging based on actual use. We show that, as a result, benchmark performance improves by up to 1.9x.
    @InProceedings{lesokhin17-npf,
    author = {Ilya Lesokhin and Haggai Eran and Shachar Raindel and Guy Shapiro and Sagi Grimberg and Liran Liss and Muli Ben-Yehuda and Nadav Amit and Dan Tsafrir},
    title = {Page fault support for network controllers},
    booktitle = {ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
    year = 2017,
    month = {April},
    address = {Xi'an, China},
    note = {(to appear)}
    }

  • Hardware and software support for virtualization   [book]
    Edouard Bugnion, Jason Nieh, Dan Tsafrir
    Morgan & Claypool Publishers
    Synthesis Lectures on Computer Architecture, February, 2017, pages 1–206, volume 12, number 1
    Abstract , BibTeX , Definitive
    This textbook focuses on the core question of the necessary architectural support provided by hardware to efficiently run virtual machines, and of the corresponding design of the hypervisors that run them. Virtualization is still possible when the instruction set architecture lacks such support, but the hypervisor remains more complex and must rely on additional techniques.

    Despite the focus on architectural support in current architectures, some historical perspective is necessary to appropriately frame the problem. The first half of the text provides the historical perspective of the theoretical framework developed four decades ago by Popek and Goldberg. It also describes earlier systems that enabled virtualization despite the lack of architectural support in hardware.

    As is often the case, theory defines a necessary—but not sufficient—set of features, and modern architectures are the result of the combination of the theoretical framework with insights derived from practical systems. The second half of the text describes state-of-the-art support for virtualization in both x86-64 and ARM processors. This textbook includes an in-depth description of the CPU, memory and I/O virtualization of these two processor architectures, as well as case studies on the Linux/KVM, VMware, and Xen hypervisors. It concludes with a performance comparison of virtualization on current-generation x86- and ARM-based systems across multiple hypervisors.
    @Book{bugnion17-virtbook,
    author = {Edouard Bugnion and Jason Nieh and Dan Tsafrir},
    title = {Hardware and software support for virtualization},
    publisher = {Morgan & Claypool Publishers},
    series = {Synthesis Lectures on Computer Architecture},
    volume = 12,
    number = 1,
    year = 2017,
    month = {February},
    pages = {1--206}
    }

2016

  • Hash, don't cache (the page table)
    Idan Yaniv, Dan Tsafrir
    SIGMETRICS '16: ACM SIGMETRICS International Conference on Measurement and Modeling
    June, 2016, Antibes Juan-les-Pins, France, pages 337–350
    Abstract , BibTeX , PDF , Definitive
    Radix page tables as implemented in the x86-64 architecture incur a penalty of four memory references for address translation upon each TLB miss. These 4 references become 24 in virtualized setups, accounting for 5%–90% of the runtime and thus motivating chip vendors to incorporate page walk caches (PWCs). Counterintuitively, an ISCA'10 paper found that radix page tables with PWCs are superior to hashed page tables, yielding up to 5x fewer DRAM accesses per page walk. We challenge this finding and show that it is the result of comparing against a suboptimal hashed implementation—that of the Itanium architecture. We show that, when carefully optimized, hashed page tables in fact outperform existing PWC-aided x86-64 hardware, shortening benchmark runtimes by 1%–27% and 6%–32% in bare-metal and virtualized setups, without resorting to PWCs. We further show that hashed page tables are inherently more scalable than radix designs and are better suited to accommodate the ever increasing memory size; their downside is that they make it more challenging to support such features as superpages.
    @InProceedings{yaniv16-hvsr,
    author = {Idan Yaniv and Dan Tsafrir},
    title = {Hash, don't cache (the page table)},
    booktitle = {ACM SIGMETRICS International Conference on Measurement and Modeling (SIGMETRICS)},
    year = 2016,
    month = {June},
    pages = {337--350},
    address = {Antibes Juan-les-Pins, France}
    }

  • Synopsis of the ASPLOS '16 wild and crazy ideas (WACI) invited-speakers session
    Dan Tsafrir
    ASPLOS '16: ACM International Conference on Architectural Support for Programming Languages and Operating Systems
    April, 2016, Atlanta, GA, pages 291–294
    Abstract , BibTeX , PDF , Definitive , Webpage
    The Wild and Crazy Ideas (WACI) session is a longstanding tradition at ASPLOS, soliciting talks that consist of forward-looking, visionary, inspiring, creative, far out or just plain amazing ideas presented in an exciting way. (Amusing elements in the presentations are tolerated ;-) but are in fact optional.)

    The first WACI session took place in 1998. Back then, the call for talks included a problem statement, which contended that “papers usually do not get admitted to [such conferences as] ISCA or ASPLOS unless the systems that they describe are mature enough to run [some standard benchmark suites, which] has a chilling effect on the idea generation process—encouraging incremental research” [1]. The 1998 WACI session turned out to be a great success. Its webpage states that ”there were 42 submissions [competing over] only eight time slots, [which resulted in] this session [having] a lower acceptance rate than the conference itself” [2].

    But the times they are a-changin' [3], and the WACI session no longer enjoys that many submissions (Figure 1), perhaps because nowadays there exist many forums for researchers to describe/discuss their preliminary ideas, including: the “hot topics in” workshops [4–7]; a journal like CAL, dedicated to early results [8]; main conferences soliciting short submissions describing “original or unconventional ideas at a preliminary stage” in addition to regular papers [9]; and the many workshops co-located with main conferences, like ISCA '15, which hosted thirteen such workshops [10].

    Regardless of the reason for the declining number of submissions, this time we've decided to organize the WACI session differently to ensure its continued high quality. Instead of soliciting talks via an open call and hoping for the best, we proactively invited speakers whom we believe are capable of delivering excellent WACI presentations. That is, this year's WACI session consists exclusively of invited speakers. Filling up the available slots turned out to be fairly easy, as most of the researchers we invited promptly accepted our invitation. The duration of each talk was set to be eight minutes (exactly as in the first WACI session from 1998) plus two minutes for questions.

    The talks are outlined below. We believe they are interesting and exciting, and we hope the attendees of the session will find them stimulating and insightful.
    @InProceedings{tsafrir16-waci,
    author = {Dan Tsafrir},
    title = {Synopsis of the {ASPLOS} '16 wild and crazy ideas {(WACI)} invited-speakers session},
    booktitle = {ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
    year = 2016,
    month = {April},
    pages = {291--294},
    address = {Atlanta, GA}
    }

  • True IOMMU protection from DMA attacks: when copy is faster than zero copy
    Alex Markuze, Adam Morrison, Dan Tsafrir
    ASPLOS '16: ACM International Conference on Architectural Support for Programming Languages and Operating Systems
    April, 2016, Atlanta, GA, pages 249–262
    Abstract , BibTeX , PDF , Definitive
    Malicious I/O devices might compromise the OS using DMAs. The OS therefore utilizes the IOMMU to map and unmap every target buffer right before and after its DMA is processed, thereby restricting DMAs to their designated locations. This usage model, however, is not truly secure for two reasons: (1) it provides protection at page granularity only, whereas DMA buffers can reside on the same page as other data; and (2) it delays DMA buffer unmaps due to performance considerations, creating a vulnerability window in which devices can access in-use memory.

    We propose that OSes utilize the IOMMU differently, in a manner that eliminates these two flaws. Our new usage model restricts device access to a set of shadow DMA buffers that are never unmapped, and it copies DMAed data to/from these buffers, thus providing sub-page protection while eliminating the aforementioned vulnerability window. Our key insight is that the cost of interacting with, and synchronizing access to the slow IOMMU hardware—required for zero-copy protection against devices—make {\em copying preferable to zero-copying}.

    We implement our model in Linux and evaluate it with standard networking benchmarks utilizing a 40\,Gb/s NIC. We demonstrate that despite being more secure than the safest preexisting usage model, our approach provides up to 5\times higher throughput. Additionally, whereas it is inherently less scalable than an IOMMU-less (unprotected) system, our approach incurs only 0%–25% performance degradation in comparison.
    @InProceedings{markuze16-cim,
    author = {Alex Markuze and Adam Morrison and Dan Tsafrir},
    title = {True {IOMMU} protection from {DMA} attacks: when copy is faster than zero copy},
    booktitle = {ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
    year = 2016,
    month = {April},
    pages = {249--262},
    address = {Atlanta, GA}
    }

  • vRIO: Paravirtual remote I/O
    Yossi Kuperman, Eyal Moscovici, Joel Nider, Razya Ladelsky, Abel Gordon, Dan Tsafrir
    ASPLOS '16: ACM International Conference on Architectural Support for Programming Languages and Operating Systems
    April, 2016, Atlanta, GA, pages 49–65
    Abstract , BibTeX , PDF , Definitive
    The traditional “trap and emulate” I/O paravirtualization model conveniently allows for I/O interposition, yet it inherently incurs costly guest-host context switches. The newer “sidecore” model eliminates this overhead by dedicating host (side)cores to poll the relevant guest memory regions and react accordingly without context switching. But the dedication of sidecores on each host might be wasteful when I/O activity is low, or it might not provide enough computational power when I/O activity is high. We propose to alleviate this problem at rack scale by consolidating the dedicated sidecores spread across several hosts onto one server. The hypervisor is then effectively split into two parts: the local hypervisor that hosts the VMs, and the remote hypervisor that processes their paravirtual I/O. We call this model vRIO—paraVirtual Remote I/O. We find that by increasing the latency somewhat, it provides comparable throughput with fewer sidecores and superior throughput with the same number of sidecores as compared to the state of the art. vRIO additionally constitutes a new, cost-effective way to consolidate I/O devices (on the remote hypervisor) while supporting efficient programmable I/O interposition.
    @InProceedings{kuperman16-vrio,
    author = {Yossi Kuperman and Eyal Moscovici and Joel Nider and Razya Ladelsky and Abel Gordon and Dan Tsafrir},
    title = {vRIO: {Paravirtual} remote {I/O}},
    booktitle = {ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
    year = 2016,
    month = {April},
    pages = {49--65},
    address = {Atlanta, GA}
    }

  • The nom profit-maximizing operating system
    Muli Ben-Yehuda, Orna Agmon Ben-Yehuda, Dan Tsafrir
    VEE '16: ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments
    April, 2016, Atlanta, GA, pages 145–160
    Abstract , BibTeX , PDF , Definitive
    In the near future, cloud providers will sell their users virtual machines with CPU, memory, network, and storage resources whose prices constantly change according to market-driven supply and demand conditions. Running traditional operating systems in these virtual machines is a poor fit: traditional operating systems are not aware of changing resource prices and their sole aim is to maximize performance with no consideration of costs. Consequently, they yield low profits.

    We present nom, a profit-maximizing operating system designed for cloud computing platforms with dynamic resource prices. Applications running on nom aim to maximize profits by optimizing simultaneously for performance and resource costs. The nom kernel provides them with direct access to the underlying hardware and full control over their private software stacks. Since nom applications know there is no single “best” software stack, they adapt their stacks' behavior on the fly according to the current price of available resources and their private utility from them, which differs between applications. We show that in addition to achieving up to 3.9x better throughput and up to 9.1x better latency, nom applications yield up to 11.1x higher profits when compared with the same applications running on Linux and OSv.
    @InProceedings{ben-yehuda16-nom,
    author = {Muli Ben-Yehuda and Orna {Agmon Ben-Yehuda} and Dan Tsafrir},
    title = {The nom profit-maximizing operating system},
    booktitle = {ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE)},
    year = 2016,
    month = {April},
    pages = {145--160},
    address = {Atlanta, GA}
    }

  • Bare-metal performance for virtual machines with exitless interrupts
    Nadav Amit, Abel Gordon, Nadav Har'El, Muli Ben-Yehuda, Alex Landau, Assaf Schuster, Dan Tsafrir
    CACM: Communications of the ACM
    January, 2016, pages 108–116, volume 59, number 1, CACM Research Highlight
    (See technical perspective by Steven Hand.)
    Abstract , BibTeX , PDF , Definitive
    Direct device assignment enhances the performance of guest virtual machines by allowing them to communicate with I/O devices without host involvement. But even with device assignment, guests are still unable to approach bare-metal performance, because the host intercepts all interrupts, including those generated by assigned devices to signal to guests the completion of their I/O requests. The host involvement induces multiple unwarranted guest/host context switches, which significantly hamper the performance of I/O intensive workloads. To solve this problem, we present ExitLess Interrupts (ELI), a software-only approach for handling interrupts within guest virtual machines directly and securely. By removing the host from the interrupt handling path, ELI improves the throughput and latency of unmodified, untrusted guests by 1.3x–1.6x, allowing them to reach 97%–100% of bare-metal performance even for the most demanding I/O-intensive workloads.
    @Article{amit16-elij,
    author = {Nadav Amit and Abel Gordon and Nadav Har'El and Muli Ben-Yehuda and Alex Landau and Assaf Schuster and Dan Tsafrir},
    title = {Bare-metal performance for virtual machines with exitless interrupts},
    journal = {Communications of the ACM (CACM)},
    volume = 59,
    number = 1,
    year = 2016,
    month = {January},
    pages = {108--116}
    }

2015

  • Virtual CPU validation
    Nadav Amit, Dan Tsafrir, Assaf Schuster, Ahmad Ayoub, Eran Shlomo
    SOSP '15: ACM Symposium on Operating Systems Principles
    October, 2015, Monterey, CA, pages 311–327
    Abstract , BibTeX , PDF , Definitive
    Testing the hypervisor is important for ensuring the correct operation and security of systems, but it is a hard and challenging task. We observe, however, that the challenge is similar in many respects to that of testing real CPUs. We thus propose to apply the testing environment of CPU vendors to hypervisors. We demonstrate the advantages of our proposal by adapting Intel's testing facility to the Linux KVM hypervisor. We uncover and fix 117 bugs, six of which are security vulnerabilities. We further find four flaws in Intel virtualization technology, causing a disparity between the observable behavior of code running on virtual and bare-metal servers.
    @InProceedings{amit15-hvfuz,
    author = {Nadav Amit and Dan Tsafrir and Assaf Schuster and Ahmad Ayoub and Eran Shlomo},
    title = {Virtual {CPU} validation},
    booktitle = {ACM Symposium on Operating Systems Principles (SOSP)},
    year = 2015,
    month = {October},
    pages = {311--327},
    address = {Monterey, CA}
    }

  • Securing self-virtualizing Ethernet devices
    Igor Smolyar, Muli Ben-Yehuda, Dan Tsafrir
    USENIX Security Symposium
    August, 2015, Washington, D.C., pages 335–350
    Abstract , BibTeX , PDF , Definitive
    Single root I/O virtualization (SRIOV) is a hardware/software interface that allows devices to “self virtualize” and thereby remove the host from the critical I/O path. SRIOV thus brings near bare-metal performance to untrusted guest virtual machines (VMs) in public clouds, enterprise data centers, and high-performance computing setups. We identify a design flaw in current Ethernet SRIOV NIC deployments that enables untrusted VMs to completely control the throughput and latency of other, unrelated VMs. The attack exploits Ethernet “pause” frames, which enable network flow control functionality. We experimentally launch the attack across several NIC models and find that it is effective and highly accurate, with substantial consequences if left unmitigated: (1) to be safe, NIC vendors will have to modify their NICs so as to filter pause frames originating from SRIOV instances; (2) in the meantime, administrators will have to either trust their VMs, or configure their switches to ignore pause frames, thus relinquishing flow control, which might severely degrade networking performance. We present the Virtualization-Aware Network Flow Controller (VANFC), a software-based SRIOV NIC prototype that overcomes the attack. VANFC filters pause frames from malicious virtual machines without any loss of performance, while keeping SRIOV and Ethernet flow control hardware/software interfaces intact.
    @InProceedings{smolyar15-sriovsec,
    author = {Igor Smolyar and Muli Ben-Yehuda and Dan Tsafrir},
    title = {Securing self-virtualizing {Ethernet} devices},
    booktitle = {USENIX Security Symposium},
    year = 2015,
    month = {August},
    pages = {335--350},
    address = {Washington, D.C.}
    }

  • Utilizing the IOMMU scalably
    Omer Peleg, Adam Morrison, Benjamin Serebrin, Dan Tsafrir
    ATC '15: USENIX Annual Technical Conference
    July, 2015, Santa Clara, CA, pages 549–562
    Abstract , BibTeX , PDF , Definitive
    IOMMUs provided by modern hardware allow the OS to enforce memory protection controls on the DMA operations of its I/O devices. An IOMMU translation management design must scalably handle frequent concurrent updates of IOMMU translations made by multiple cores, which occur in high throughput I/O workloads such as multi-Gb/s networking. Today, however, OSes experience performance meltdowns when using the IOMMU in such workloads.

    This paper explores scalable IOMMU management designs and addresses the two main bottlenecks we find in current OSes: (1) assignment of I/O virtual addresses (IOVAs), and (2) management of the IOMMU's TLB.

    We propose three approaches for scalable IOVA assignment: (1) dynamic identity mappings, which eschew IOVA allocation altogether, (2) allocating IOVAs using the kernel's kmalloc, and (3) per-core caching of IOVAs allocated by a globally-locked IOVA allocator. We further describe a scalable IOMMU TLB management scheme that is compatible with all these approaches.

    Evaluation of our designs under Linux shows that (1) they achieve 88.5%–100% of the performance obtained without an IOMMU, (2) they achieve similar latency to that obtained without an IOMMU, (3) scalable IOVA allocation and dynamic identity mappings perform comparably, and (4) kmalloc provides a simple solution with high performance, but can suffer from unbounded page table blowup.
    @InProceedings{peleg15-eskimo,
    author = {Omer Peleg and Adam Morrison and Benjamin Serebrin and Dan Tsafrir},
    title = {Utilizing the {IOMMU} scalably},
    booktitle = {USENIX Annual Technical Conference (ATC)},
    year = 2015,
    month = {July},
    pages = {549--562},
    address = {Santa Clara, CA}
    }

  • rIOMMU: Efficient IOMMU for I/O devices that employ ring buffers
    Moshe Malka, Nadav Amit, Muli Ben-Yehuda, Dan Tsafrir
    ASPLOS '15: ACM International Conference on Architectural Support for Programming Languages and Operating Systems
    March, 2015, Istanbul, Turkey, pages 355–368
    Abstract , BibTeX , PDF , Definitive
    The IOMMU allows the OS to encapsulate I/O devices in their own virtual memory spaces, thus restricting their DMAs to specific memory pages. The OS uses the IOMMU to protect itself against buggy drivers and malicious/errant devices. But the added protection comes at a cost, degrading the throughput of I/O-intensive workloads by up to an order of magnitude. This cost has motivated system designers to trade off some safety for performance, e.g., by leaving stale information in the IOTLB for a while so as to amortize costly invalidations.

    We observe that high-bandwidth devices—like network and PCIe SSD controllers—interact with the OS via circular ring buffers that that induce a sequential, predictable workload. We design a ring IOMMU (rIOMMU) that leverages this characteristic by replacing the virtual memory page table hierarchy with a circular, flat table. A flat table is adequately supported by exactly one IOTLB entry, making every new translation an implicit invalidation of the former and thus requiring explicit invalidations only at the end of I/O bursts. Using standard networking benchmarks, we show that rIOMMU provides up to 7.56x higher throughput relative to the baseline IOMMU, and that it is within 0.77–1.00x the throughput of a system without IOMMU protection.
    @InProceedings{malka15-riommu,
    author = {Moshe Malka and Nadav Amit and Muli Ben-Yehuda and Dan Tsafrir},
    title = {rIOMMU: {Efficient} {IOMMU} for {I/O} devices that employ ring buffers},
    booktitle = {ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
    year = 2015,
    month = {March},
    pages = {355--368},
    address = {Istanbul, Turkey}
    }

  • Efficient intra-operating system protection against harmful DMAs
    Moshe Malka, Nadav Amit, Dan Tsafrir
    FAST '15: USENIX Conference on File and Storage Technologies
    February, 2015, Santa Clara, CA, pages 29–44
    Abstract , BibTeX , PDF , Definitive
    Operating systems can defend themselves against misbehaving I/O devices and drivers by employing intra-OS protection. With “strict” intra-OS protection, the OS uses the IOMMU to map each DMA buffer immediately before the DMA occurs and to unmap it immediately after. Strict protection is costly due to IOMMU-related hardware overheads, motivating “deferred” intra-OS protection, which trades off some safety for performance.

    We investigate the Linux intra-OS protection mapping layer and discover that hardware overheads are not exclusively to blame for its high cost. Rather, the cost is amplified by the I/O virtual address (IOVA) allocator, which regularly induces linear complexity. We find that the nature of IOVA allocation requests is inherently simple and constrained due to the manner by which I/O devices are used, allowing us to deliver constant time complexity with a compact, easy-to-implement optimization. Our optimization improves the throughput of standard benchmarks by up to 5.5x. It delivers strict protection with performance comparable to that of the baseline deferred protection.

    To generalize our case that OSes drive the IOMMU with suboptimal software, we additionally investigate the FreeBSD mapping layer and obtain similar findings.
    @InProceedings{malka15-eiovar,
    author = {Moshe Malka and Nadav Amit and Dan Tsafrir},
    title = {Efficient intra-operating system protection against harmful {DMAs}},
    booktitle = {USENIX Conference on File and Storage Technologies (FAST)},
    year = 2015,
    month = {February},
    pages = {29--44},
    address = {Santa Clara, CA}
    }

2014

  • Experience with using the Parallel Workloads Archive
    Dror G. Feitelson, Dan Tsafrir, David Krakov
    JPDC: Journal of Parallel and Distributed Computing
    October, 2014, pages 2967–2982, volume 75, number 10
    Abstract , BibTeX , PDF , Definitive
    Workload data is an invaluable resource for the design and evaluation of computer systems. However, data quality is always an issue. Therefore the experience gained when using data is also a precious resource, and both the data and the experience should be recorded and made available to other researchers. We demonstrate this for the Parallel Workloads Archive, which records job-level usage data from many largescale parallel supercomputers, clusters, and grids. Data quality problems encountered include missing data, inconsistent data, erroneous data, system configuration changes during the logging period, and nonrepresentative user behavior. Some of these may be countered by filtering out the problematic data items. In other cases, being cognizant of the problems may affect the decision of which datasets to use.
    @Article{feitelson14-pwaexp,
    author = {Dror G. Feitelson and Dan Tsafrir and David Krakov},
    title = {Experience with using the {Parallel} {Workloads} {Archive}},
    journal = {Journal of Parallel and Distributed Computing (JPDC)},
    volume = 75,
    number = 10,
    year = 2014,
    month = {October},
    pages = {2967--2982}
    }

  • The rise of RaaS: resource-as-as-service cloud
    Orna Agmon Ben-Yehuda, Muli Ben-Yehuda, Assaf Schuster, Dan Tsafrir
    CACM: Communications of the ACM
    July, 2014, volume 57, number 7
    Abstract , BibTeX
    Over the next few years, a new model of buying and selling cloud computing resources will evolve. Instead of providers exclusively selling server-equivalent virtual machines for relatively long periods of time (as done in today's IaaS clouds), they will increasingly sell to clients individual resources (such as CPU, memory, and I/O resources), reprice them and adjust their quantity every few seconds. Instead of allocating resources on a First-come First-served basis, they will do so in accordance with market-driven supply-and-demand conditions. We term this nascent economic model of cloud computing the Resource-as-a-Service (RaaS) cloud, and we argue that its rise is the likely culmination of recent trends in the construction of Infrastructure-as-a-Service (IaaS) clouds and of the economic forces operating on both providers and clients.
    @Article{ben-yehuda14-raasj,
    author = {Orna {Agmon Ben-Yehuda} and Muli Ben-Yehuda and Assaf Schuster and Dan Tsafrir},
    title = {The rise of {RaaS:} resource-as-as-service cloud},
    journal = {Communications of the ACM (CACM)},
    volume = 57,
    number = 7,
    year = 2014,
    month = {July}
    }

  • VSwapper: A memory swapper for virtualized environments
    Nadav Amit, Dan Tsafrir, Assaf Schuster
    ASPLOS '14: ACM International Conference on Architectural Support for Programming Languages and Operating Systems
    March, 2014, Salt Lake City, UT, pages 349–366, HiPEAC Paper Award
    Abstract , BibTeX , PDF , Definitive
    The number of guest virtual machines that can be consolidated on one physical host is typically limited by the memory size, motivating memory overcommitment. Guests are given a choice to either install a “balloon” driver to coordinate the overcommitment activity, or to experience degraded performance due to uncooperative swapping. Ballooning, however, is not a complete solution, as hosts must still fall back on uncooperative swapping in various circumstances. Additionally, ballooning takes time to accommodate change, and so guests might experience degraded performance under changing conditions.

    Our goal is to improve the performance of hosts when they fall back on uncooperative swapping and/or operate under changing load conditions. We carefully isolate and characterize the causes for the associated poor performance, which include various types of superfluous swap operations, decayed swap file sequentiality, and ineffective prefetch decisions upon page faults. We address these problems by implementing VSwapper, a guest-agnostic memory swapper for virtual environments that allows efficient, uncooperative overcommitment. With inactive ballooning, VSwapper yields up to an order of magnitude performance improvement. Combined with ballooning, VSwapper can achieve up to double the performance under changing load conditions.
    @InProceedings{amit14-vswap,
    author = {Nadav Amit and Dan Tsafrir and Assaf Schuster},
    title = {VSwapper: {A} memory swapper for virtualized environments},
    booktitle = {ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
    year = 2014,
    month = {March},
    pages = {349--366},
    address = {Salt Lake City, UT}
    }

  • Memory swapper for virtualized environments
    Nadav Amit, Dan Tsafrir, Assaf Schuster
    Pending Patent Application Number: PCT/IL2015/050172
    February, 2014
    BibTeX , Html
    @Misc{amit14-vswappat,
    author = {Nadav Amit and Dan Tsafrir and Assaf Schuster},
    title = {Memory swapper for virtualized environments},
    howpublished = {Pending Patent Application Number: PCT/IL2015/050172},
    year = 2014,
    month = {February}
    }

2013

  • Deconstructing Amazon EC2 spot instance pricing
    Orna Agmon Ben-Yehuda, Muli Ben-Yehuda, Assaf Schuster, Dan Tsafrir
    TEAC: ACM Transactions on Economics and Computation
    September, 2013, pages 16:1–16:20, volume 1, number 3
    Abstract , BibTeX , Definitive
    Cloud providers possessing large quantities of spare capacity must either incentivize clients to purchase it or suffer losses. Amazon is the first cloud provider to address this challenge, by allowing clients to bid on spare capacity and by granting resources to bidders while their bids exceed a periodically changing spot price. Amazon publicizes the spot price but does not disclose how it is determined. By analyzing the spot price histories of Amazon's EC2 cloud, we reverse engineer how prices are set and construct a model that generates prices consistent with existing price traces. Our findings suggest that usually prices are not market-driven, as sometimes previously assumed. Rather, they are likely to be generated most of the time at random from within a tight price range via a dynamic hidden reserve price mechanism. Our model could help clients make informed bids, cloud providers design profitable systems, and researchers design pricing algorithms.
    @Article{ben-yehuda13-spotj,
    author = {Orna {Agmon Ben-Yehuda} and Muli Ben-Yehuda and Assaf Schuster and Dan Tsafrir},
    title = {Deconstructing {Amazon} {EC2} spot instance pricing},
    journal = {ACM Transactions on Economics and Computation (TEAC)},
    volume = 1,
    number = 3,
    year = 2013,
    month = {September},
    pages = {16:1--16:20}
    }

  • The nonkernel: a kernel designed for the cloud
    Muli Ben-Yehuda, Omer Peleg, Orna Agmon Ben-Yehuda, Igor Smolyar, Dan Tsafrir
    APSYS '13: ACM Asia-Pacific Workshop on Systems
    July, 2013, Singapore, pages 4:1–4:7
    Abstract , BibTeX , PDF , Definitive
    Infrastructure-as-a-Service (IaaS) cloud computing is causing a fundamental shift in the way computing resources are bought, sold, and used. We foresee a future whereby every CPU cycle, every memory word, and every byte of network bandwidth in the cloud would have a constantly changing market-driven price. We argue that, in such an environment, the underlying resources should be exposed directly to applications without kernel or hypervisor involve- ment. We propose the nonkernel, an architecture for operating system kernel construction designed for such cloud computing platforms. A nonkernel uses modern architectural support for machine virtualization to securely provide unprivileged user programs with pervasive access to the underlying resources. We motivate the need for the nonkernel, we contrast it against its predecessor the exokernel, and we outline how one could go about building a nonkernel operating system.
    @InProceedings{ben-yehuda13-nomish,
    author = {Muli Ben-Yehuda and Omer Peleg and Orna {Agmon Ben-Yehuda} and Igor Smolyar and Dan Tsafrir},
    title = {The nonkernel: a kernel designed for the cloud},
    booktitle = {ACM Asia-Pacific Workshop on Systems (APSYS)},
    year = 2013,
    month = {July},
    pages = {4:1--4:7},
    address = {Singapore}
    }

  • Using disk add-ons to withstand simultaneous disk failures with fewer replicas
    Eitan Rosenfeld, Nadav Amit, Dan Tsafrir
    WIVOSCA '13: Workshop on the Interaction amongst Virtualization, Operating Systems and Computer Architecture
    June, 2013, Tel Aviv, Israel
    Abstract , BibTeX , PDF , Definitive
    Contemporary storage systems that utilize replication often maintain more than two replicas of each data item, reducing the risk of permanent data loss due to simultaneous disk failures. The price of the additional copies is smaller usable storage space, increased network traffic, and higher power consumption. We propose to alleviate this problem with SimFail, a storage system that maintains only two replicas and utilizes per-disk “add-ons”, which are simple hardware devices equipped with relatively small memory that proxy disk I/O traffic. SimFail can significantly reduce the risk of data loss due to temporally adjacent disk failures by quickly copying atrisk data from disks to their add-ons. SIMFAIL can further eliminate the risk entirely by maintaining local parity information of disks on their add-ons (such that each add-on holds the parity of its own disk's data chunks). We postulate that SimFail may open the door for cloud providers to reduce the number of data replicas they use from three to two.
    @InProceedings{rosenfeld13-simfail,
    author = {Eitan Rosenfeld and Nadav Amit and Dan Tsafrir},
    title = {Using disk add-ons to withstand simultaneous disk failures with fewer replicas},
    booktitle = {Workshop on the Interaction amongst Virtualization, Operating Systems and Computer Architecture (WIVOSCA)},
    year = 2013,
    month = {June},
    address = {Tel Aviv, Israel}
    }

  • Using disk add-ons to withstand simultaneous disk failures with fewer replicas
    Dan Tsafrir, Eitan Rosenfeld
    Pending Patent Application Number: PCT/IL2014/050103
    January, 2013
    BibTeX , Html
    @Misc{tsafrir13-raidppat,
    author = {Dan Tsafrir and Eitan Rosenfeld},
    title = {Using disk add-ons to withstand simultaneous disk failures with fewer replicas},
    howpublished = {Pending Patent Application Number: PCT/IL2014/050103},
    year = 2013,
    month = {January}
    }

2012

  • The resource-as-a-service (RaaS) cloud
    Orna Agmon Ben-Yehuda, Muli Ben-Yehuda, Assaf Schuster, Dan Tsafrir
    HotCloud '12: USENIX Workshop on Hot Topics in Cloud Computing
    June, 2012, Boston, MA
    Abstract , BibTeX , PDF , Definitive
    Over the next few years, a new model of buying and selling cloud computing resources will evolve. Instead of providers exclusively selling server equivalent virtual machines for relatively long periods of time (as done in today's IaaS clouds), providers will increasingly sell individual resources (such as CPU, memory, and I/O resources) for a few seconds at a time. We term this nascent economic model of cloud computing the Resource-as-a-Service (RaaS) cloud, and we argue that its rise is the likely culmination of recent trends in the construction of IaaS clouds and of the economic forces operating on both providers and clients.
    @InProceedings{ben-yehuda12-raas,
    author = {Orna {Agmon Ben-Yehuda} and Muli Ben-Yehuda and Assaf Schuster and Dan Tsafrir},
    title = {The resource-as-a-service {(RaaS)} cloud},
    booktitle = {USENIX Workshop on Hot Topics in Cloud Computing (HotCloud)},
    year = 2012,
    month = {June},
    address = {Boston, MA}
    }

  • Feasibility of mutable replay for automated regression testing of security updates
    Ilia Kravets, Dan Tsafrir
    RESoLVE '12: Workshop on Runtime Environments, Systems, Layering and Virtualized Environments
    March, 2012, London, UK
    Abstract , BibTeX , PDF , Definitive
    Applying software patches runs the risk of somehow breaking the system, so organizations often prefer to avoid updates or to defer them until extensive testing is performed. But unpatched software might be compromised, and extensive testing might be tedious and time and resource consuming. We consider alleviating the problem by utilizing a new type of “mutable” deterministic execution replay, which would allow us to (1) run the software before and after the patch and to (2) compare the behavior of the two versions in a manner that tolerates legitimate differences that arise from security patches. The potential advantages of this approach are that: it is automatic; it requires no programming skills; it is language-agnostic; it works without source code; and it can be applied while the system is running and serving real production workloads. In this work, we do not implement mutable replay; rather, we assess its feasibility. We do so by systematically studying about 200 security patches applied to the most popular Debian/Linux packages and by considering how mutable replay should be implemented so as to tolerate execution differences that such patches generate.
    @InProceedings{kravets12-mreplay,
    author = {Ilia Kravets and Dan Tsafrir},
    title = {Feasibility of mutable replay for automated regression testing of security updates},
    booktitle = {Workshop on Runtime Environments, Systems, Layering and Virtualized Environments (RESoLVE)},
    year = 2012,
    month = {March},
    address = {London, UK}
    }

  • ELI: bare-metal performance for I/O virtualization
    Abel Gordon, Nadav Amit, Nadav Har'El, Muli Ben-Yehuda, Alex Landau, Assaf Schuster, Dan Tsafrir
    ASPLOS '12: ACM International Conference on Architectural Support for Programming Languages and Operating Systems
    March, 2012, London, UK, pages 411–422, Pat Goldberg Memorial Best Paper Award, HiPEAC Paper Award
    Abstract , BibTeX , PDF , Definitive
    Direct device assignment enhances the performance of guest virtual machines by allowing them to communicate with I/O devices without host involvement. But even with device assignment, guests are still unable to approach bare-metal performance, because the host intercepts all interrupts, including those interrupts generated by assigned devices to signal to guests the completion of their I/O requests. The host involvement induces multiple unwarranted guest/host context switches, which significantly hamper the performance of I/O intensive workloads. To solve this problem, we present ELI (ExitLess Interrupts), a software-only approach for handling interrupts within guest virtual machines directly and securely. By removing the host from the interrupt handling path, ELI manages to improve the throughput and latency of unmodified, untrusted guests by 1.3x–1.6x, allowing them to reach 97%–100% of bare-metal performance even for the most demanding I/O-intensive workloads.
    @InProceedings{gordon12-eli,
    author = {Abel Gordon and Nadav Amit and Nadav Har'El and Muli Ben-Yehuda and Alex Landau and Assaf Schuster and Dan Tsafrir},
    title = {ELI: bare-metal performance for {I/O} virtualization},
    booktitle = {ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
    year = 2012,
    month = {March},
    pages = {411--422},
    address = {London, UK}
    }

2011

  • Deconstructing Amazon EC2 spot instance pricing
    Orna Agmon Ben-Yehuda, Muli Ben-Yehuda, Assaf Schuster, Dan Tsafrir
    CloudCom '11: IEEE International Conference on Cloud Computing Technology and Science
    November, 2011, Athens, Greece
    Abstract , BibTeX , PDF
    Cloud providers possessing large quantities of spare capacity must either incentivize clients to purchase it or suffer losses. Amazon is the first cloud provider to address this challenge, by allowing clients to bid on spare capacity and by granting resources to bidders while their bids exceed a periodically changing spot price. Amazon publicizes the spot price but does not disclose how it is determined. By analyzing the spot price histories of Amazon's EC2 cloud, we reverse engineer how prices are set and construct a model that generates prices consistent with existing price traces. We find that prices are usually not market-driven as sometimes previously assumed. Rather, they are typically generated at random from within a tight price interval via a dynamic hidden reserve price. Our model could help clients make informed bids, cloud providers design profitable systems, and researchers design pricing algorithms.
    @InProceedings{ben-yehuda11-spotprice,
    author = {Orna {Agmon Ben-Yehuda} and Muli Ben-Yehuda and Assaf Schuster and Dan Tsafrir},
    title = {Deconstructing {Amazon} {EC2} spot instance pricing},
    booktitle = {IEEE International Conference on Cloud Computing Technology and Science (CloudCom)},
    year = 2011,
    month = {November},
    address = {Athens, Greece}
    }

  • vIOMMU: efficient IOMMU emulation
    Nadav Amit, Muli Ben-Yehuda, Dan Tsafrir, Assaf Schuster
    ATC '11: USENIX Annual Technical Conference
    June, 2011, Portland, Oregon, pages 73–86
    Abstract , BibTeX , PDF , Definitive
    Direct device assignment, where a guest virtual machine directly interacts with an I/O device without host intervention, is appealing, because it allows an unmodified (non-hypervisor-aware) guest to achieve near-native performance. But device assignment for unmodified guests suffers from two serious deficiencies: (1) it requires pinning of all the guest's pages, thereby disallowing memory overcommitment, and (2) it exposes the guest's memory to buggy device drivers.

    We solve these problems by designing, implementing, and exposing an emulated IOMMU (vIOMMU) to the unmodified guest. We employ two novel optimizations to make vIOMMU perform well: (1) waiting a few milliseconds before tearing down an IOMMU mapping in the hope it will be immediately reused “optimistic teardown”, and (2) running the vIOMMU on a sidecore, and thereby enabling for the first time the use of a sidecore by unmodified guests. Both optimizations are highly effective in isolation. The former allows bare-metal to achieve 100 of the two allows an unmodified guest to do the same.
    @InProceedings{amit11-viommu,
    author = {Nadav Amit and Muli Ben-Yehuda and Dan Tsafrir and Assaf Schuster},
    title = {vIOMMU: efficient {IOMMU} emulation},
    booktitle = {USENIX Annual Technical Conference (ATC)},
    year = 2011,
    month = {June},
    pages = {73--86},
    address = {Portland, Oregon}
    }

  • A method for guaranteeing program correctness using finegrained hardware speculative execution
    Dan Tsafrir, Robert W. Wisniewski
    Patent Number: US 9195550 B2 (granted: Nov 2015)
    February, 2011
    BibTeX , Html
    @Misc{tsafrir11-specupat,
    author = {Dan Tsafrir and Robert W. Wisniewski},
    title = {{A} method for guaranteeing program correctness using finegrained hardware speculative execution},
    howpublished = {Patent Number: US 9195550 B2 (granted: Nov 2015)},
    year = 2011,
    month = {February}
    }

2010

  • Using inaccurate estimates accurately
    Dan Tsafrir
    JSSPP '10: Workshop on Job Scheduling Strategies for Parallel Processing
    April, 2010, Atlanta, Georgia, pages 208–221, LNCS Volume 6253, keynote talk
    Abstract , BibTeX , PDF , Definitive
    Job schedulers improve the system utilization by requiring users to estimate how long their jobs will run and by using this information to better pack (or “backfill”) the jobs. But, surprisingly, many studies find that deliberately making estimates less accurate boosts (or does not affect) the performance, which helps explain why production systems still exclusively rely on notoriously inaccurate estimates.

    We prove these studies wrong by showing their methodology is erroneous. The studies model an estimate e as being correlated with r·F (where r is the runtime of the associated job, F is some "badness" factor, and bigger Fs imply increased inaccuracy). We show this model is invalid, because: (1) it conveys too much information to the scheduler; (2) it induces favoritism of short jobs; and (3) it is inherently different than real user inaccuracy, which associates 90% of the jobs with merely 20 estimate values, hindering the scheduler's ability to backfill.

    We conclude that researchers must stop using multiples of runtimes as estimates, or else their results would likely be invalid. We develop (and propose to use) a realistic model that preserves the estimates' modality and allows to soundly simulate increased inaccuracy by, e.g., associating more jobs with the maximal runtime allowed (an always-popular estimate, which prevents backfilling).
    @InCollection{tsafrir10-keynote,
    author = {Dan Tsafrir},
    title = {Using inaccurate estimates accurately},
    booktitle = {Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP)},
    year = 2010,
    month = {April},
    pages = {208--221},
    address = {Atlanta, Georgia},
    note = {LNCS Volume 6253}
    }

2009

  • SCARY iterator assignment and initialization revision 1
    Robert Klarer, Bjarne Stroustrup, Dan Tsafrir, Michael Wong
    ISO/IEC C++ Standards Committee Paper WG21/N2980=09-0170
    October, 2009, Santa Cruz, California
    Abstract , BibTeX , PDF , Definitive
    We propose a requirement that a standard container's iterator types have no dependency on any type argument apart from the container's value_type, difference_type, pointer type, and const_pointer type. In particular, the types of a standard container's iterators should not depend on the container's key_compare, hasher, key_equal, or allocator types.
    @InProceedings{klarer09-scary-rev1,
    author = {Robert Klarer and Bjarne Stroustrup and Dan Tsafrir and Michael Wong},
    title = {{SCARY} iterator assignment and initialization revision 1},
    booktitle = {ISO/IEC C++ Standards Committee Paper WG21/N2980=09-0170},
    year = 2009,
    month = {October},
    address = {Santa Cruz, California}
    }

  • Minimizing dependencies within generic classes for faster and smaller programs
    Dan Tsafrir, Robert W. Wisniewski, David F. Bacon, Bjarne Stroustrup
    OOPSLA '09: ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications
    October, 2009, Orlando, Florida, pages 425–444
    Abstract , BibTeX , PDF , Definitive
    Generic classes can be used to improve performance by allowing compile-time polymorphism. But the applicability of compile-time polymorphism is narrower than that of runtime polymorphism, and it might bloat the object code. We advocate a programming principle whereby a generic class should be implemented in a way that minimizes the dependencies between its members (nested types, methods) and its generic type parameters. Conforming to this principle (1) reduces the bloat and (2) gives rise to a previously unconceived manner of using the language that expands the applicability of compile-time polymorphism to a wider range of problems. Our contribution is thus a programming technique that generates faster and smaller programs. We apply our ideas to GCC's STL containers and iterators, and we demonstrate notable speedups and reduction in object code size (real application runs 1.2x to 2.1x faster and STL code is 1x to 25x smaller). We conclude that standard generic APIs (like STL) should be amended to reflect the proposed principle in the interest of efficiency and compactness. Such modifications will not break old code, simply increase flexibility. Our findings apply to languages like C++, C#, and D, which realize generic programming through multiple instantiations.
    @InProceedings{tsafrir09-inner,
    author = {Dan Tsafrir and Robert W. Wisniewski and David F. Bacon and Bjarne Stroustrup},
    title = {Minimizing dependencies within generic classes for faster and smaller programs},
    booktitle = {ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA)},
    year = 2009,
    month = {October},
    pages = {425--444},
    address = {Orlando, Florida}
    }

  • SCARY iterator assignment and initialization
    Robert Klarer, Bjarne Stroustrup, Dan Tsafrir, Michael Wong
    ISO/IEC C++ Standards Committee Paper WG21/N2913=09-0103
    July, 2009, Frankfurt, Germany
    Abstract , BibTeX , PDF , Definitive
    We propose a requirement that a standard container's iterator types have no dependency on any type argument apart from the container's value_type, difference_type, pointer type, and const_pointer type. In particular, the types of a standard container's iterators should not depend on the container's key_compare, hasher, key_equal, or allocator types.
    @InProceedings{klarer09-scary,
    author = {Robert Klarer and Bjarne Stroustrup and Dan Tsafrir and Michael Wong},
    title = {{SCARY} iterator assignment and initialization},
    booktitle = {ISO/IEC C++ Standards Committee Paper WG21/N2913=09-0103},
    year = 2009,
    month = {July},
    address = {Frankfurt, Germany}
    }

2008

  • Portably solving file races with hardness amplification
    Dan Tsafrir, Tomer Hertz, Dilma Da Silva, David Wagner
    TOS: ACM Transactions on Storage
    November, 2008, page 9, volume 4, number 3
    Superseded by , Abstract , BibTeX , PDF , Definitive
    The file-system API of contemporary systems makes programs vulnerable to TOCTTOU (time-of-check-to-time-of-use) race conditions. Existing solutions either help users to detect these problems (by pinpointing their locations in the code), or prevent the problem altogether (by modifying the kernel or its API). But the latter alternative is not prevalent, and the former is just the first step: programmers must still address TOCTTOU flaws within the limits of the existing API with which several important tasks cannot be accomplished in a portable straightforward manner. Recently, Dean and Hu~[2004] addressed this problem and suggested a probabilistic hardness amplification approach that alleviated the matter. Alas, shortly after, Borisov et al.~[2005] responded with an attack termed “filesystem maze” that defeated the new approach.

    We begin by noting that mazes constitute a generic way to deterministically win many TOCTTOU races (gone are the days when the probability was small). In the face of this threat, we (1) develop a new user-level defense that can withstand mazes, and (2) show that our method is undefeated even by much stronger hypothetical attacks that provide the adversary program with ideal conditions to win the race (enjoying complete and instantaneous knowledge about the defending program's actions and being able to perfectly synchronize accordingly). The fact that our approach is immune to these unrealistic attacks suggests it can be used as a simple and portable solution to a large class of TOCTTOU vulnerabilities, without requiring modifications to the underlying operating system.
    @Article{tsafrir08-tocttou:tos,
    author = {Dan Tsafrir and Tomer Hertz and Dilma Da Silva and David Wagner},
    title = {Portably solving file races with hardness amplification},
    journal = {ACM Transactions on Storage (TOS)},
    volume = 4,
    number = 3,
    year = 2008,
    month = {November},
    pages = 9
    }

  • Portably preventing file race attacks with user-mode path resolution
    Dan Tsafrir, Tomer Hertz, David Wagner, Dilma Da Silva
    Technical Report RC24572, IBM T. J. Watson Research Center
    June, 2008, Yorktown Heights, New York, submitted
    Abstract , BibTeX , PDF , Definitive
    The filesystem API of contemporary systems exposes programs to TOCTTOU (time of check to time of use) race-condition vulnerabilities, which occur between pairs of check/use system calls that involve a name of a file. Existing solutions either help programmers to detect such races (by pinpointing their location) or prevent them altogether (by altering the operating system). But the latter alternative is not prevalent, and the former is just the first step: programmers must still address TOCTTOU flaws within the limits of the existing API with which several important tasks can not be safely accomplished in a portable straightforward manner. The recent “filesystem maze” attack further worsens the problem by allowing adversaries to deterministically win races and thus refuting the common perception that the risk is small. In the face of this threat, we develop a new algorithm that allows programmers to effectively aggregate a vulnerable pair of distinct system calls into a single operation that is executed “atomically”. This is achieved by emulating one kernel functionality in user mode: the filepath resolution. The surprisingly simple resulting algorithm constitutes a portable solution to a large class of TOCTTOU vulnerabilities, without requiring modifications to the underlying operating system. In contrast to our previous work, the new algorithm is deterministic and incurs significantly less overhead.
    @TechReport{tsafrir08-path-res,
    author = {Dan Tsafrir and Tomer Hertz and David Wagner and Dilma Da Silva},
    title = {Portably preventing file race attacks with user-mode path resolution},
    institution = {IBM T. J. Watson Research Center},
    number = {RC24572},
    year = 2008,
    month = {June},
    address = {Yorktown Heights, New York},
    note = {(submitted for publication)}
    }

  • The murky issue of changing process identity: revising “setuid demystified”
    Dan Tsafrir, Dilma Da Silva, David Wagner
    USENIX ;login
    June, 2008, pages 55–66, volume 33, number 3
    Abstract , BibTeX , PDF , Definitive , Software
    Dropping unneeded process privileges promotes security, but is notoriously error-prone due to confusing set∗id system calls with unclear semantics and subtle portability issues. To make things worse, existing recipes to accomplish the task are lacking, related manuals can be misleading, and the associated kernel subsystem might contain bugs. We therefore proclaim the system as untrustworthy when it comes to the subject matter and suggest a defensive easy-to-use solution that addresses all concerns.
    @Article{tsafrir08-priv,
    author = {Dan Tsafrir and Dilma Da Silva and David Wagner},
    title = {The murky issue of changing process identity: revising ``setuid demystified''},
    journal = {USENIX ;login},
    volume = 33,
    number = 3,
    year = 2008,
    month = {June},
    pages = {55--66}
    }

  • Portably solving file TOCTTOU races with hardness amplification
    Dan Tsafrir, Tomer Hertz, David Wagner, Dilma Da Silva
    FAST '08: USENIX Conference on File and Storage Technologies
    February, 2008, San Jose, California, pages 189–206, awarded Best Paper, Pat Goldberg Memorial Best Paper Award
    Superseded by , Abstract , BibTeX , PDF , Definitive
    The file-system API of contemporary systems makes programs vulnerable to TOCTTOU (time of check to time of use) race conditions. Existing solutions either help users to detect these problems (by pinpointing their locations in the code), or prevent the problem altogether (by modifying the kernel or its API). The latter alternative is not prevalent, and the former is just the first step: programmers must still address TOCTTOU flaws within the limits of the existing API with which several important tasks can not be accomplished in a portable straightforward manner. Recently, Dean and Hu addressed this problem and suggested a probabilistic hardness amplification approach that alleviated the matter. Alas, shortly after, Borisov et al. responded with an attack termed “filesystem maze” that defeated the new approach.

    We begin by noting that mazes constitute a generic way to deterministically win many TOCTTOU races (gone are the days when the probability was small). In the face of this threat, we (1) develop a new user-level defense that can withstand mazes, and (2) show that our method is undefeated even by much stronger hypothetical attacks that provide the adversary program with ideal conditions to win the race (enjoying complete and instantaneous knowledge about the defending program's actions and being able to perfectly synchronize accordingly). The fact that our approach is immune to these unrealistic attacks suggests it can be used as a simple and portable solution to a large class of TOCTTOU vulnerabilities, without requiring modifications to the underlying operating system.
    @InProceedings{tsafrir08-tocttou,
    author = {Dan Tsafrir and Tomer Hertz and David Wagner and Dilma Da Silva},
    title = {Portably solving file {TOCTTOU} races with hardness amplification},
    booktitle = {USENIX Conference on File and Storage Technologies (FAST)},
    year = 2008,
    month = {February},
    pages = {189--206},
    address = {San Jose, California}
    }

  • Specialized execution environments
    Maria Butrico, Dilma Da Silva, Orran Krieger, Michal Ostrowski, Bryan Rosenburg, Dan Tsafrir, Eric Van Hensbergen, Robert W. Wisniewski, Jimi Xenidis
    OSR: ACM Operating Systems Review
    January, 2008, pages 106–107, volume 42, number 1
    BibTeX , PDF , Definitive
    @Article{butrico08-specialized,
    author = {Maria Butrico and Dilma Da Silva and Orran Krieger and Michal Ostrowski and Bryan Rosenburg and Dan Tsafrir and Eric Van Hensbergen and Robert W. Wisniewski and Jimi Xenidis},
    title = {Specialized execution environments},
    journal = {ACM Operating Systems Review (OSR)},
    volume = 42,
    number = 1,
    year = 2008,
    month = {January},
    pages = {106--107}
    }

2007

  • Reducing performance evaluation sensitivity and variability by input shaking
    Dan Tsafrir, Keren Ouaknine, Dror G. Feitelson
    MASCOTS '07: IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems
    October, 2007, Istanbul, Turkey, pages 231–237
    Abstract , BibTeX , PDF , Definitive
    Simulations sometimes lead to observed sensitivity to configuration parameters as well as inconsistent performance results. The question is then what is the true effect and what is a coincidental artifact of the evaluation. The shaking methodology answers this by executing multiple simulations under small perturbations to the input workload, and calculating the average performance result; if the effect persists we can be more confident that it is real, whereas if it disappears it was an artifact. We present several examples where the sensitivity that appears in results based on a single evaluation is eliminated or considerably reduced by the shaking methodology. While our examples come from evaluations of scheduling algorithms for supercomputers, we believe the method has wider applicability.
    @InProceedings{tsafrir07-shake,
    author = {Dan Tsafrir and Keren Ouaknine and Dror G. Feitelson},
    title = {Reducing performance evaluation sensitivity and variability by input shaking},
    booktitle = {IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)},
    year = 2007,
    month = {October},
    pages = {231--237},
    address = {Istanbul, Turkey}
    }

  • Secretly monopolizing the CPU without superuser privileges
    Dan Tsafrir, Yoav Etsion, Dror G. Feitelson
    USENIX Security Symposium
    August, 2007, Boston, Massachusetts, pages 239–256
    Abstract , BibTeX , PDF , Definitive , Coverage: Slashdot , Ars Technica , ACM TechNews
    We describe a “cheat” attack, allowing an ordinary process to hijack any desirable percentage of the CPU cycles without requiring superuser/administrator privileges. Moreover, the nature of the attack is such that, at least in some systems, listing the active processes will erroneously show the cheating process as not using any CPU resources: the “missing” cycles would either be attributed to some other process or not be reported at all (if the machine is otherwise idle). Thus, certain malicious operations generally believed to have required overcoming the hardships of obtaining root access and installing a rootkit, can actually be launched by non-privileged users in a straightforward manner, thereby making the job of a malicious adversary that much easier. We show that most major general-purpose operating systems are vulnerable to the cheat attack, due to a combination of how they account for CPU usage and how they use this information to prioritize competing processes. Furthermore, recent scheduler changes attempting to better support interactive workloads increase the vulnerability to the attack, and naive steps taken by certain systems to reduce the danger are easily circumvented. We show that the attack can nevertheless be defeated, and we demonstreate this by implementing a patch for Linux that eliminates the problem with negligible overhead.
    @InProceedings{tsafrir07-cheat,
    author = {Dan Tsafrir and Yoav Etsion and Dror G. Feitelson},
    title = {Secretly monopolizing the {CPU} without superuser privileges},
    booktitle = {USENIX Security Symposium},
    year = 2007,
    month = {August},
    pages = {239--256},
    address = {Boston, Massachusetts}
    }

  • The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops)
    Dan Tsafrir
    ExpCS '07: ACM Workshop on Experimental Computer Science
    June, 2007, San-Diego, California, page 4
    Abstract , BibTeX , PDF , Definitive
    The overhead of a context switch is typically associated with multitasking, where several applications share a processor. But even if only one runnable application is present in the system and supposedly runs alone, it is still repeatedly preempted in favor of a different thread of execution, namely, the operating system that services periodic clock interrupts. We employ two complementing methodologies to measure the overhead incurred by such events and obtain contradictory results.

    The first methodology systematically changes the interrupt frequency and measures by how much this prolongs the duration of a program that sorts an array. The overall overhead is found to be 0.5-1.5% at 1000 Hz, linearly proportional to the tick rate, and steadily declining as the speed of processors increases. If the kernel is configured such that each tick is slowed down by an access to an external time source, then the direct overhead dominates. Otherwise, the relative weight of the indirect portion is steadily growing with processors' speed, accounting for up to 85% of the total.

    The second methodology repeatedly executes a simplistic loop (calibrated to take 1ms), measures the actual execution time, and analyzes the perturbations. Some loop implementations yield results similar to the above, but others indicate that the overhead is actually an order of magnitude bigger, or worse. The phenomenon was observed on IA32, IA64, and Power processors, the latter being part of the \textsf{ASC Purple} supercomputer. Importantly, the effect is greatly amplified for parallel jobs, where one late thread holds up all its peers, causing a slowdown that is dominated by the per-node latency (numerator) and the job granularity (denominator). We trace the bizarre effect to an unexplained interrupt/loop interaction; the question of whether this hardware misfeature is experienced by real applications remains open.
    @InProceedings{tsafrir07-context,
    author = {Dan Tsafrir},
    title = {The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops)},
    booktitle = {ACM Workshop on Experimental Computer Science (ExpCS)},
    year = 2007,
    month = {June},
    pages = 4,
    address = {San-Diego, California}
    }

  • Backfilling using system-generated predictions rather than user runtime estimates
    Dan Tsafrir, Yoav Etsion, Dror G. Feitelson
    TPDS: IEEE Transactions on Parallel and Distributed Systems
    June, 2007, pages 789–803, volume 18, number 6
    Abstract , BibTeX , PDF , Definitive
    The most commonly used scheduling algorithm for parallel supercomputers is FCFS with backfilling, as originally introduced in the EASY scheduler. Backfilling means that short jobs are allowed to run ahead of their time provided they do not delay previously queued jobs (or at least the first queued job). To make such determinations possible, users are required to provide estimates of how long jobs will run, and jobs that violate these estimates are killed. Empirical studies have repeatedly shown that user estimates are inaccurate, and that system-generated predictions based on history may be significantly better. However, predictions have not been incorporated into production schedulers, partially due to a misconception (that we resolve) claiming inaccuracy actually improves performance, but mainly because underprediction is technically unacceptable: users will not tolerate jobs being killed just because system predictions were too short. We solve this problem by divorcing kill-time from the runtime prediction, and correcting predictions adaptively as needed if they are proved wrong. The end result is a surprisingly simple scheduler, which requires minimal deviations from current practices (e.g. using FCFS as the basis), and behaves exactly like EASY as far as users are concerned; nevertheless, it achieves significant improvements in performance, predictability, and accuracy. Notably, this is based on a very simple runtime predictor that just averages the runtimes of the last two jobs by the same user; counterintuitively, our results indicate that using recent data is more important than mining the history for similar jobs. All the techniques suggested in this paper can be used to enhance any backfilling algorithm, and are not limited to EASY.
    @Article{tsafrir07-pred,
    author = {Dan Tsafrir and Yoav Etsion and Dror G. Feitelson},
    title = {Backfilling using system-generated predictions rather than user runtime estimates},
    journal = {IEEE Transactions on Parallel and Distributed Systems (TPDS)},
    volume = 18,
    number = 6,
    year = 2007,
    month = {June},
    pages = {789--803}
    }

  • Fine grained kernel logging with KLogger: experience and insights
    Yoav Etsion, Dan Tsafrir, Scott Kirkpatrick, Dror G. Feitelson
    EuroSys '07: ACM Europran Conference on Computer Systems
    March, 2007, Lisbon, Portugal, pages 259–274
    Abstract , BibTeX , PDF , Definitive , Software
    Understanding the detailed behavior of an operating system is crucial for making informed design decisions. But such an understanding is very hard to achieve, due to the increasing complexity of such systems and the fact that they are implemented and maintained by large and diverse groups of developers. Tools like KLogger — presented in this paper — can help by enabling fine-grained logging of system events and the sharing of a logging infrastructure between multiple developers and researchers, facilitating a methodology where design evaluation can be an integral part of kernel development. We demonstrate the need for such methodology by a host of case studies, using KLogger to better understand various kernel subsystems in Linux, and pinpointing overheads and problems therein.
    @InProceedings{etsion07-klogger,
    author = {Yoav Etsion and Dan Tsafrir and Scott Kirkpatrick and Dror G. Feitelson},
    title = {Fine grained kernel logging with {KLogger:} experience and insights},
    booktitle = {ACM Europran Conference on Computer Systems (EuroSys)},
    year = 2007,
    month = {March},
    pages = {259--274},
    address = {Lisbon, Portugal}
    }

2006

  • Process prioritization using output production: scheduling for multimedia
    Yoav Etsion, Dan Tsafrir, Dror G. Feitelson
    TOMM: ACM Transactions on Multimedia Computing, Communications and Applications
    November, 2006, pages 318–342, volume 2, number 4
    Abstract , BibTeX , PDF , Definitive
    Desktop operating systems such as Windows and Linux base scheduling decisions on CPU consumption: processes that consume fewer CPU cycles are prioritized, assuming that interactive processes gain from this since they spend most of their time waiting for user input. However, this doesn't work for modern multimedia applications which require significant CPU resources. We therefore suggest a new metric to identify interactive processes by explicitly measuring interactions with the user, and we use it to design and implement a process scheduler. Measurements using a variety of applications indicate that this scheduler is very effective in distinguishing between competing interactive and noninteractive processes.
    @Article{etsion06-huc,
    author = {Yoav Etsion and Dan Tsafrir and Dror G. Feitelson},
    title = {Process prioritization using output production: scheduling for multimedia},
    journal = {ACM Transactions on Multimedia Computing, Communications and Applications (TOMM)},
    volume = 2,
    number = 4,
    year = 2006,
    month = {November},
    pages = {318--342}
    }

  • Stop polling! The case against OS ticks
    Dan Tsafrir, Yoav Etsion, Dror G. Feitelson
    OSDI '06: USENIX Symposium on Operating Systems Design and Implementation
    November, 2006, Seattle, Washington, Poster Session
    BibTeX , PDF , Definitive , Poster
    @InProceedings{tsafrir06-ticks,
    author = {Dan Tsafrir and Yoav Etsion and Dror G. Feitelson},
    title = {Stop polling! {The} case against {OS} ticks},
    booktitle = {USENIX Symposium on Operating Systems Design and Implementation (OSDI)},
    year = 2006,
    month = {November},
    address = {Seattle, Washington},
    note = {Poster Session}
    }

  • The dynamics of backfilling: solving the mystery of why increased inaccuracy may help
    Dan Tsafrir, Dror G. Feitelson
    IISWC '06: IEEE International Symposium on Workload Characterization
    October, 2006, San Jose, California, pages 131–141
    Abstract , BibTeX , PDF , Definitive
    Parallel job scheduling with backfilling requires users to provide runtime estimates, used by the scheduler to better pack the jobs. Studies of the impact of such estimates on performance have modeled them using a “badness factor” f ≥ 0 in an attempt to capture their inaccuracy (given a runtime r, the estimate is uniformly distributed in [r, (f+1) · r]). Surprisingly, inaccurate estimates (f>0) yielded better performance than accurate ones (f=0). We explain this by a “heel and toe” dynamics that, with f>0, cause backfilling to approximate shortest-job first scheduling. We further find the effect of systematically increasing f is V-shaped: average wait time and slowdown initially drop, only to rise later on. This happens because higher fs create bigger “holes” in the schedule (longer jobs can backfill) and increase the randomness (more long jobs appear as short), thus overshadowing the initial heel-and-toe preference for shorter jobs.

    The bottom line is that artificial inaccuracy generated by multiplying (real or perfect) estimates by a factor is (1) just a scheduling technique that trades off fairness for performance, and is (2) ill-suited for studying the effect of real inaccuracy. Real estimates are modal (90% of the jobs use the same 20 estimates) and bounded by a maximum (usually the most popular estimate). Therefore, when performing an evaluation, “increased inaccuracy” should translate to increased modality. Unlike multiplying, this indeed worsens performance as one would intuitively expect.
    @InProceedings{tsafrir06-bfdyn,
    author = {Dan Tsafrir and Dror G. Feitelson},
    title = {The dynamics of backfilling: solving the mystery of why increased inaccuracy may help},
    booktitle = {IEEE International Symposium on Workload Characterization (IISWC)},
    year = 2006,
    month = {October},
    pages = {131--141},
    address = {San Jose, California}
    }

  • Modeling, evaluating, and improving the performance of supercomputer scheduling
    Dan Tsafrir
    PhD: Technical Report 2006-78, School of Computer Science and Engineering, the Hebrew University
    September, 2006, Jerusalem, Israel
    Abstract , BibTeX , PDF , Definitive
    The most popular scheduling policy for parallel systems is FCFS with backfilling (a.k.a. “EASY” scheduling), where short jobs are allowed to run ahead of their time provided they do not delay previously queued jobs (or at least the first queued job). This mandates users to provide estimates of how long jobs will run, and jobs that violate these estimates are killed so as not to violate subsequent commitments. The de-facto standard of evaluating the impact of inaccurate estimates on performance has been to use a “badness factor” f ≥ 0, such that given a runtime r, the associated estimate is uniformly distributed in [r, r · (f+1)], or is simply r·(f+1). The underlying assumption was that bigger fs imply worse information.

    Surprisingly, inaccurate estimates (f>0) yield better performance than accurate ones (f=0), a fact that has repeatedly produced statements like “inaccurate estimates actually improve performance” or “what the scheduler doesn't know won't hurt it”, in many independent studies. This has promoted the perception that estimates are “unimportant”. At the same time, other studies noted that real user estimates are inaccurate, and that system-generated predictions based on history can do better. But predictions were never incorporated into production schedulers, partially due the aforementioned perception that inaccuracy actually helps, partially because suggested predictors were too complex, and partially because underprediction is technically unacceptable, as users will not tolerate jobs being killed just because system predictions were too short. All attempts to solve the latter technicality yielded algorithms that are inappropriate for many supercomputing settings (e.g. using preemption, assuming all jobs are restartable, etcetera).

    This work has four major contributions. First, we show that the “inaccuracy helps” common wisdom is merely an unwarranted artifact of the erroneous manner in which inaccurate estimates have been modeled, and that increased accuracy actually improves performance. Specifically, previously observed improvements turn out to be due to a “heel and toe” dynamics that, with f>0, cause backfilling to approximate shortest-job first scheduling. We show that multiplying estimates by a factor translates to trading off fairness for performance, and that this reasoning works regardless of whether the values being multiplied are actual runtimes (“perfect estimates”) or the flawed estimates that are supplied by users. We further show that the more accurate the values we multiply, the better the resulting performance. Thus, better estimates actually improve performance, and multiplying is in fact a scheduling policy that exercises the fairness/performance tradeoff. Regardless, multiplying is anything but representative of real inaccuracy, as outlined next.

    Our second contribution is developing a more representative model of estimates that, from now on, will allow for a valid evaluation of the effect of inaccurate estimates. It is largely based on noting that human users repeatedly use the same “round” values (ten minutes, one hour etc.) and on the invariant that 90% of the jobs use the same 20 estimates. Importantly, the most popular estimate is typically the maximal allowed. As a result, the jobs associated with this estimate cannot be backfilled, and indeed, the more this value is used, the more EASY resembles plain FCFS. Thus, to artificially increase the inaccuracy one should e.g. associate more jobs with the maximum (a realistic manipulation), not multiply by a greater factor (a bogus boost of performance).

    Our third contribution exploits the above understandings to devise a new scheduler that is able to automatically improve the quality of estimates and put this into productive use in the context of EASY, while preserving its attractive simple batch essence and refraining from any unacceptable assumptions. Specifically, the problem of underprediction is solved by divorcing kill-time from the runtime prediction, and correcting predictions adaptively at runtime as needed, if they are proved wrong. The result is a surprisingly simple scheduler, which requires minimal deviations from current practices, and behaves exactly like EASY as far as users are concerned. Nevertheless, it achieves significant improvements in performance, predictability, and accuracy. Notably, this is based on a very simple runtime predictor that just averages the runtimes of the last two jobs by the same user; counterintuitively, our results indicate that using recent data is more important than saving and mining the history for similar jobs, as was done by previous work. For further performance enhancements, we propose to exploit the “heel and toe” understanding: explicitly using a shortest job backfilled first (SJBF) backfilling order. This directly leads to a performance improvements similar to those previously attributed to stunts like multiplying estimates. By still preserving FCFS as the basis, we maintain EASY's appeal and enjoy both worlds: a fair scheduler that nevertheless backfills effectively.

    Finally, our fourth contribution has broader applicability, that transcends the supercomputing domain. All of the above results are based on the standard methodology of modeling and simulating real activity logs of production systems, which is routinely practiced in system-related research. The overwhelmingly accepted assumption underlying this methodology is that such real workloads are representative and reliable. We show, however, that real workloads may also contain anomalies that make them non-representative and unreliable. This is a special case of multi-class workloads, where one class is the “real” workload which we wish to use in the evaluation, and the other class contaminates the log with “bogus” data. We provide several examples of this situation, including an anomaly we call “workload flurries”: surges of activity with a repetitive nature, caused by a single user, that dominate the workload for a relatively short period. Using a workload with such anomalies in effect emphasizes rare and unique events (e.g. occurring for a few days out of two years of logged data), and risks optimizing the design decision for the anomalous workload at the expense of the normal workload. Thus, we claim that such anomalies should be removed from the workload before it is used in evaluations, and that ignoring them is actually an unjustifiable approach.
    @PhdThesis{tsafrir06-phd,
    author = {Dan Tsafrir},
    title = {Modeling, evaluating, and improving the performance of supercomputer scheduling},
    school = {School of Computer Science and Engineering, the Hebrew University (PhD Thesis)},
    year = 2006,
    month = {September},
    address = {Jerusalem, Israel},
    note = {Technical Report 2006-78}
    }

  • Session-based, estimation-less, and information-less runtime prediction algorithms for parallel and grid job scheduling
    David Talby, Dan Tsafrir, Zviki Goldberg, Dror G. Feitelson
    Technical Report 2006-77, School of Computer Science and Engineering, the Hebrew University
    August, 2006, Jerusalem, Israel
    Abstract , BibTeX , PDF , Definitive
    The default setting of most production parallel job schedulers is FCFS with backfilling. Under this setting, users must supply job runtime estimates, which are known as being highly inaccurate and inferior to system generated predictions. Recent research revealed how to utilize system predictions for backfilling, and this potential performance gain motivates searching for better prediction techniques. We present three prediction techniques using decreasing levels of information as is suitable for the situation at hand. The first is based on "user sessions": continuous temporal periods of per-user work. This algorithm exploits the entire long-term historical data of the workload, along with user runtime estimates. The second is "estimation-less", that is, uses historical data only, relieving users from the annoying need to supply estimates. The third is completely "informationless" and is suitable for cases in which neither historical information nor estimates are available, as happens in some grid environments. We evaluate the algorithms by simulating real data from production systems. We find all of them to be successful in terms of both accuracy and performance.
    @TechReport{talby06-predalg,
    author = {David Talby and Dan Tsafrir and Zviki Goldberg and Dror G. Feitelson},
    title = {Session-based, estimation-less, and information-less runtime prediction algorithms for parallel and grid job scheduling},
    institution = {School of Computer Science and Engineering, the Hebrew University},
    number = {2006-77},
    year = 2006,
    month = {August},
    address = {Jerusalem, Israel}
    }

  • Instability in parallel job scheduling simulation: the role of workload flurries
    Dan Tsafrir, Dror G. Feitelson
    IPDPS '06: IEEE International Parallel and Distributed Processing Symposium
    April, 2006, Rhodes Island, Greece, page 10
    Abstract , BibTeX , PDF , Definitive
    The performance of computer systems depends, among other things, on the workload. This motivates the use of real workloads (as recorded in activity logs) to drive simulations of new designs. Unfortunately, real workloads may contain various anomalies that contaminate the data. A previously unrecognized type of anomaly is workload flurries: rare surges of activity with a repetitive nature, caused by a single user, that dominate the workload for a relatively short period. We find that long workloads often include at least one such event. We show that in the context of parallel job scheduling these events can have a significant effect on performance evaluation results, e.g. a very small perturbation of the simulation conditions might lead to a large and disproportional change in the outcome. This instability is due to jobs in the flurry being effected in unison, a consequence of the flurry's repetitive nature. We therefore advocate that flurries be filtered out before the workload is used, in order to achieve stable and more reliable evaluation results (analogously to the removal of outliers in statistical analysis). At the same time, we note that more research is needed on the possible effects of flurries.
    @InProceedings{tsafrir06-bfly,
    author = {Dan Tsafrir and Dror G. Feitelson},
    title = {Instability in parallel job scheduling simulation: the role of workload flurries},
    booktitle = {IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
    year = 2006,
    month = {April},
    pages = 10,
    address = {Rhodes Island, Greece}
    }

  • Workload sanitation for performance evaluation
    Dror G. Feitelson, Dan Tsafrir
    ISPASS '06: IEEE International Symposium on Performance Analysis of Systems and Software
    March, 2006, Austin, Texas, pages 221–230
    Abstract , BibTeX , PDF , Definitive
    The performance of computer systems depends, among other things, on the workload. Performance evaluations are therefore often done using logs of workloads on current productions systems, under the assumption that such real workloads are representative and reliable; likewise, workload modeling is typically based on real workloads. We show, however, that real workloads may also contain anomalies that make them non-representative and unreliable. This is a special case of multi-class workloads, where one class is the “real” workload which we wish to use in the evaluation, and the other class contaminates the log with “bogus” data. We provide several examples of this situation, including a previously unrecognized type of anomaly we call “workload flurries”: surges of activity with a repetitive nature, caused by a single user, that dominate the workload for a relatively short period. Using a workload with such anomalies in effect emphasizes rare and unique events (e.g. occurring for a few days out of two years of logged data), and risks optimizing the design decision for the anomalous workload at the expense of the normal workload. Thus we claim that such anomalies should be removed from the workload before it is used in evaluations, and that ignoring them is actually an unjustifiable approach.
    @InProceedings{feitelson06-clean,
    author = {Dror G. Feitelson and Dan Tsafrir},
    title = {Workload sanitation for performance evaluation},
    booktitle = {IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)},
    year = 2006,
    month = {March},
    pages = {221--230},
    address = {Austin, Texas}
    }

  • System and method for backfilling with system-generated predictions rather than user runtime estimates
    Dan Tsafrir, Yoav Etsion, David Talby, Dror G. Feitelson
    Patent Number: US 8261283 B2 (granted: Sep 2012)
    February, 2006
    BibTeX , Html
    @Misc{tsafrir06-predpat,
    author = {Dan Tsafrir and Yoav Etsion and David Talby and Dror G. Feitelson},
    title = {System and method for backfilling with system-generated predictions rather than user runtime estimates},
    howpublished = {Patent Number: US 8261283 B2 (granted: Sep 2012)},
    year = 2006,
    month = {February}
    }

2005

  • System noise, OS clock ticks, and fine-grained parallel applications
    Dan Tsafrir, Yoav Etsion, Dror G. Feitelson, Scott Kirkpatrick
    ICS '05: ACM International Conference on Supercomputing
    June, 2005, Cambridge, Massachusetts, pages 303–312
    Abstract , BibTeX , PDF , Definitive
    As parallel jobs get bigger in size and finer in granularity, “system noise” is increasingly becoming a problem. In fact, fine-grained jobs on clusters with thousands of SMP nodes run faster if a processor is intentionally left idle (per node), thus enabling a separation of “system noise” from the computation. Paying a cost in average processing speed at a node for the sake of eliminating occasional processes delays is (unfortunately) beneficial, as such delays are enormously magnified when one late process holds up thousands of peers with which it synchronizes.

    We provide a probabilistic argument showing that, under certain conditions, the effect of such noise is linearly proportional to the size of the cluster (as is often empirically observed). We then identify a major source of noise to be indirect overhead of periodic OS clock interrupts (“ticks”), that are used by all general-purpose OSs as a means of maintaining control. This is shown for various grain sizes, platforms, tick frequencies, and OSs. To eliminate such noise, we suggest replacing ticks with an alternative mechanism we call “smart timers”. This turns out to also be in line with needs of desktop and mobile computing, increasing the chances of the suggested change to be accepted.
    @InProceedings{tsafrir05-noise,
    author = {Dan Tsafrir and Yoav Etsion and Dror G. Feitelson and Scott Kirkpatrick},
    title = {System noise, {OS} clock ticks, and fine-grained parallel applications},
    booktitle = {ACM International Conference on Supercomputing (ICS)},
    year = 2005,
    month = {June},
    pages = {303--312},
    address = {Cambridge, Massachusetts}
    }

  • Modeling user runtime estimates
    Dan Tsafrir, Yoav Etsion, Dror G. Feitelson
    JSSPP '05: Workshop on Job Scheduling Strategies for Parallel Processing
    June, 2005, Cambridge, Massachusetts, pages 1–35, Lecture Notes in Computer Science, Volume 3834
    Abstract , BibTeX , PDF , Definitive , Software
    User estimates of job runtimes have emerged as an important component of the workload on parallel machines, and can have a significant impact on how a scheduler treats different jobs, and thus on overall performance. It is therefore highly desirable to have a good model of the relationship between parallel jobs and their associated estimates. We construct such a model based on a detailed analysis of several workload traces. The model incorporates those features that are consistent in all of the logs, most notably the inherently modal nature of estimates (e.g.\ only 20 different values are used as estimates for about 90% of the jobs). We find that the behavior of users, as manifested through the estimate distributions, is remarkably similar across the different workload traces. Indeed, providing our model with only the maximal allowed estimate value, along with the percentage of jobs that have used it, yields results that are very similar to the original. The remaining difference (if any) is largely eliminated by providing information on one or two additional popular estimates. Consequently, in comparison to previous models, simulations that utilize our model are better in reproducing scheduling behavior similar to that observed when using real estimates.
    @InCollection{tsafrir05-est,
    author = {Dan Tsafrir and Yoav Etsion and Dror G. Feitelson},
    title = {Modeling user runtime estimates},
    booktitle = {Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP)},
    year = 2005,
    month = {June},
    pages = {1--35},
    address = {Cambridge, Massachusetts},
    note = {Lecture Notes in Computer Science, Volume 3834}
    }

  • A short survey of commercial cluster batch schedulers
    Yoav Etsion, Dan Tsafrir
    Technical Report 2005-13, School of Computer Science and Engineering, the Hebrew University
    May, 2005, Jerusalem, Israel
    Abstract , BibTeX , PDF , Definitive
    As high performance computing clusters are getting cheaper, they become more accessible. The various clusters are running a host of workload management software suites, which are getting more complex and offer cluster administrators numerous features, scheduling policies, job prioritization schemes, etc. In this paper we survey some of the common commercial workload managers on the market, covering their main features — specifically the scheduling policies and algorithms they support, their priority and queueing mechanisms, focusing on their default settings.
    @TechReport{etsion05-survey,
    author = {Yoav Etsion and Dan Tsafrir},
    title = {{A} short survey of commercial cluster batch schedulers},
    institution = {School of Computer Science and Engineering, the Hebrew University},
    number = {2005-13},
    year = 2005,
    month = {May},
    address = {Jerusalem, Israel}
    }

  • General purpose timing: the failure of periodic timers
    Dan Tsafrir, Yoav Etsion, Dror G. Feitelson
    Technical Report 2005-6, School of Computer Science and Engineering, the Hebrew University
    February, 2005, Jerusalem, Israel
    Abstract , BibTeX , PDF , Definitive
    All general-purpose commodity operating systems use periodic clock interrupts to regain control and measure the passage of time. This is ill-suited for desktop settings, as the ne-grained timing requirements of modern multimedia applications require a high clock rate, which may suffer from significant overhead. It is ill-suited for HPC environments, as asynchronous interrupts ruin the coordination among cluster nodes. And it is ill-suited for mobile platforms, as it wastes signicant energy, especially when the system is otherwise idle. To be truly general-purpose, systems should therefore switch to a mechanism that is closer to one-shot timers (set only for specic needs) while avoiding the potentially huge overhead they entail. With a careful design it is possible to achieve both high accuracy and low overhead, thus signicantly extending the applicability of general-purpose operating systems.
    @TechReport{tsafrir05-timing,
    author = {Dan Tsafrir and Yoav Etsion and Dror G. Feitelson},
    title = {General purpose timing: the failure of periodic timers},
    institution = {School of Computer Science and Engineering, the Hebrew University},
    number = {2005-6},
    year = 2005,
    month = {February},
    address = {Jerusalem, Israel}
    }

2004

  • Desktop scheduling: how can we know what the user wants?
    Yoav Etsion, Dan Tsafrir, Dror G. Feitelson
    NOSSDAV '04: ACM International Workshop on Network and Operating Systems Support for Digital Audio and Video
    June, 2004, Kinsale, Ireland, pages 110–115
    Abstract , BibTeX , PDF , Definitive
    Current desktop operating systems use CPU utilization (or lack thereof) to prioritize processes for scheduling. This was thought to be beneficial for interactive processes, under the assumption that they spend much of their time waiting for user input. This reasoning fails for modern multimedia applications. For example, playing a movie in parallel with a heavy background job usually leads to poor graphical results, as these jobs are indistinguishable in terms of CPU usage. Suggested solutions involve shifting the burden to the user or programmer, which we claim is unsatisfactory; instead, we seek an automatic solution. Our attempts using new metrics based on CPU usage failed. We therefore propose and implement a novel scheme of identifying interactive and multimedia applications by directly quantifying the I/O between an application and the user (keyboard, mouse, and screen activity). Preliminary results indicate that prioritizing processes according to this metric indeed solves the aforementioned problem, demonstrating that operating systems can indeed provide better support for multimedia and interactive applications. Additionally, once user I/O data is available, it opens intriguing new possibilities to system designers.
    @InProceedings{etsion04-huc-pri,
    author = {Yoav Etsion and Dan Tsafrir and Dror G. Feitelson},
    title = {Desktop scheduling: how can we know what the user wants?},
    booktitle = {ACM International Workshop on Network and Operating Systems Support for Digital Audio and Video (NOSSDAV)},
    year = 2004,
    month = {June},
    pages = {110--115},
    address = {Kinsale, Ireland}
    }

2003

  • Workload flurries
    Dan Tsafrir, Dror G. Feitelson
    Technical Report 2003-85, School of Computer Science and Engineering, the Hebrew University
    November, 2003, Jerusalem, Israel
    Abstract , BibTeX , PDF
    The performance of computer systems depends, among other things, on the workload. Performance evaluations are therefore often done using logs of workloads on current productions systems, under the assumption that such real workloads are representative and reliable; likewise, workload modeling is typically based on real workloads. However, real workloads may also contain anomalies that make them non-representative and unreliable. A previously unrecognized type of anomaly is workload urries: surges of activity with a repetitive nature, caused by a single user, that dominate the workload for a relatively short period. Under suitable conditions, such urries can have a decisive effect on performance evaluation results. The problem is that workloads containing such a urry are not representative of normal usage. Moreover, creating a statistical model based on such a workload or using it directly is also not representative of urries in general. This motivates the approach of identifying and removing the urries, so as to allow for an evaluation under normal conditions. We demonstrate this for several evaluations of parallel systems, showing that the anomalies in the workload as embodied by urries carry over to anomalies in the evaluation results, which disappear when the urries are removed. Such an evaluation can then be augmented by a separate evaluation of the deviation caused by the urry.
    @TechReport{tsafrir03-flurry,
    author = {Dan Tsafrir and Dror G. Feitelson},
    title = {Workload flurries},
    institution = {School of Computer Science and Engineering, the Hebrew University},
    number = {2003-85},
    year = 2003,
    month = {November},
    address = {Jerusalem, Israel}
    }

  • Effects of clock resolution on the scheduling of interactive and soft real-time processes
    Yoav Etsion, Dan Tsafrir, Dror G. Feitelson
    SIGMETRICS '03: ACM SIGMETRICS International Conference on Measurement and Modeling
    June, 2003, San Diego, California, pages 172–183
    Abstract , BibTeX , PDF , Definitive
    It is commonly agreed that scheduling mechanisms in general purpose operating systems do not provide adequate support for modern interactive applications, notably multimedia applications. The common solution to this problem is to devise specialized scheduling mechanisms that take the specific needs of such applications into account. A much simpler alternative is to better tune existing systems. In particular, we show that conventional scheduling algorithms typically only have little and possibly misleading information regarding the CPU usage of processes, because increasing CPU rates have caused the common 100 Hz clock interrupt rate to be coarser than most application time quanta. We therefore conduct an experimental analysis of what happens if this rate is significantly increased. Results indicate that much higher clock interrupt rates are possible with acceptable overheads, and lead to much better information. In addition we show that increasing the clock rate can provide a measure of support for soft real time requirements, even when using a general-purpose operating system. For example, we achieve a sub-millisecond latency under heavily loaded conditions.
    @InProceedings{etsion03-clock,
    author = {Yoav Etsion and Dan Tsafrir and Dror G. Feitelson},
    title = {Effects of clock resolution on the scheduling of interactive and soft real-time processes},
    booktitle = {ACM SIGMETRICS International Conference on Measurement and Modeling (SIGMETRICS)},
    year = 2003,
    month = {June},
    pages = {172--183},
    address = {San Diego, California}
    }

2002

  • Barrier synchronization on a loaded SMP using two-phase waiting algorithms
    Dan Tsafrir, Dror G. Feitelson
    IPDPS '02: IEEE International Parallel and Distributed Processing Symposium
    April, 2002, Fort Lauderdale, Florida, page 80
    Abstract , BibTeX , PDF , Definitive
    Little work has been done on the performance of barrier synchronization using two-phase blocking, as the common wisdom is that it is useless to spin if the total number of threads in the system exceeds the number of processors. We challenge this and show that it may be beneficial to spin-wait even if the number of threads is up to double the number of processors, especially if the waiting time is at least twice the context switch overhead (rather than being equal to it). We also characterize the alternating synchronization pattern that applications based on barriers tend to fall into, which is quite different from the patterns typically assumed in theoretical analyses.
    @InProceedings{tsafrir02-sync,
    author = {Dan Tsafrir and Dror G. Feitelson},
    title = {Barrier synchronization on a loaded {SMP} using two-phase waiting algorithms},
    booktitle = {IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
    year = 2002,
    month = {April},
    pages = 80,
    address = {Fort Lauderdale, Florida}
    }

2001

  • Barrier synchronization on a loaded SMP using two-phase waiting algorithms
    Dan Tsafrir
    MSc: Technical Report 2001-82, School of Computer Science and Engineering, the Hebrew University
    September, 2001, Jerusalem, Israel
    Abstract , BibTeX , PDF , Definitive
    Little work has been done on the performance of barrier synchronization using two-phase blocking, as the common wisdom is that it is useless to spin if the total number of threads in the system exceeds the number of processors. We challenge this view and show that it may be beneficial to spin-wait if the spinning period is set to be a bit more than twice the context switch overhead (rather than being equal to it). We show that the success of our approach is due to an inherent property of general-purpose schedulers, which tend to select threads that become unblocked for immediate execution. We find that this property causes applications based on barriers to fall into a previously unnoticed pattern, denoted “alternating synchronization”, which is quite different from the patterns typically assumed in theoretical analyses. By merely choosing an appropriate spinning period, we leverage alternating synchronization to implicitly nudge the system into simultaneously co-scheduling the application's threads, thereby dramatically reducing the overhead of synchronization and significantly improving the performance.
    @MastersThesis{tsafrir01-msc,
    author = {Dan Tsafrir},
    title = {Barrier synchronization on a loaded {SMP} using two-phase waiting algorithms},
    school = {School of Computer Science and Engineering, the Hebrew University (MSc Thesis)},
    year = 2001,
    month = {September},
    address = {Jerusalem, Israel},
    note = {Technical Report 2001-82}
    }