Fault-Tolerant Operating System for Many-Core Processors

Amit Fuchs, M.Sc. Thesis Seminar
Wednesday, 6.12.2017, 10:30
Taub 601
Prof. A. Mendelson

This seminar presents a fault-tolerant distributed operating system designed to harness the massive parallelism in many-core (1,000-10,000+) distributed shared memory processors. In order to scale efficiently and reliably as cores count rapidly increase while their reliability decrease, the new operating system provides fault-tolerant task-level parallelism using coarse-grained data-flow principles. Combining message passing and shared memory, a wait-free decentralized execution engine was created that allows applications to implicitly utilize all cores of future exascale systems-on-chip. The system allows programs to remain oblivious to faults without requiring explicit synchronization or strong consistency guarantees over the shared memory. A prototype implementation of the new operating system was experimentally evaluated on a many-core full-system simulator, the presented results exemplify the characteristics and benefits of the new design.

Back to the index of events