Amit Fuchs, M.Sc. Thesis Seminar
Wednesday, 6.12.2017, 10:30
This seminar presents a fault-tolerant distributed operating system designed to harness the massive parallelism in many-core (1,000-10,000+) distributed shared memory processors.
In order to scale efficiently and reliably as cores count rapidly increase while their reliability decrease, the new operating system provides fault-tolerant task-level parallelism using coarse-grained data-flow principles.
Combining message passing and shared memory, a wait-free decentralized execution engine was created that allows applications to implicitly utilize all cores of future exascale systems-on-chip. The system allows programs to remain oblivious to faults without requiring explicit synchronization or strong consistency guarantees over the shared memory.
A prototype implementation of the new operating system was experimentally evaluated on a many-core full-system simulator, the presented results exemplify the characteristics and benefits of the new design.