Technical Report MSC-2018-28

Title: Fault-Tolerant Operating System for Many-core Processors
Authors: Amit Fuchs
Supervisors: Avi Mendelson
PDFCurrently accessibly only within the Technion network
Abstract: Creating operating systems for many-core processors is a critical challenge. Contemporary multi-core systems inhibit scaling by relying on hardware-based cache-coherency and atomic primitives to guarantee consistency and synchronization when accessing shared memory. Moreover, projections on transistor scaling trends predict that hardware fault rates will increase by orders of magnitude and the microarchitecture alone could not provide adequate robustness in the exascale era. Resilience must be considered at all levels; operating systems cannot continue to assume that the processors are error-free.

A fault-tolerant distributed operating system is presented, designed to harness the massive parallelism in many-core distributed shared memory processors. It targets scale-out architectures with 1,000-10,000+ fault-prone cores on-chip and waives traditional hardware-based consistency over the shared memory. The operating system allows applications to remain oblivious to hardware faults and efficiently utilize all cores of exascale systems-on-chip without performing explicit synchronization.

To scale efficiently and reliably as the number of cores rapidly increases while their reliability decreases, the new operating system provides fault-tolerant task-level parallelism to applications through a coarse-grained data-flow programming model. A decentralized wait-free execution engine was created to maximize task parallelism, scalability, and resiliency over unreliable processing cores. It combines message-passing and shared memory without strong consistency guarantees. Fine-grained checkpoints are intrinsic at all levels, enabling on-the-fly recovery of application-level tasks in the case of hardware faults, automatically resuming their execution with minimal costs.

A prototype implementation of the new operating system was experimentally evaluated on a many-core full-system simulator, the presented results exemplify the characteristics and benefits of the new approach.

CopyrightThe above paper is copyright by the Technion, Author(s), or others. Please contact the author(s) for more information

Remark: Any link to this technical report should be to this page (, rather than to the URL of the PDF files directly. The latter URLs may change without notice.

To the list of the MSC technical reports of 2018
To the main CS technical reports page

Computer science department, Technion