Allen Clement (MMax Planck Institute for Software Systems in Kaiserslautern, Germany)
Wednesday, 14.5.2014, 11:30
The choice between Byzantine and crash fault tolerance is viewed as a fundamental design decision when building fault tolerant systems. We show that this dichotomy is not fundamental, and present a unified model of fault tolerance in which the number of tolerated faults of each type is a configuration choice. Additionally, we observe that a single fault is capable of devastating the performance of existing Byzantine fault tolerant replication systems. We argue that fault tolerant systems should, and can, be designed to perform well even when failures occur. In this talk I will expand on these two insights and describe our experience leveraging them to build a generic fault tolerant replication library that provides flexible fault tolerance and robust performance. We use the library to build a (Byzantine) fault tolerant version of the Hadoop Distributed File System.
Allen Clement is a Tenure-track faculty member at the Max Planck Institute for Software Systems. He received a Ph.D. from the University of Texas at Austin in 2010 and an A.B. in Computer Science from Princeton University. His research focuses on the challenges of building robust and reliable distributed systems. In particular, he has investigated practical Byzantine fault tolerant replication, systems robust to both Byzantine and selfish behaviors, consistency in geo-replicated environments, and how to leverage the structure of social networks to build Sybil-tolerant systems. When not in the lab he plays way too much ultimate and is coach for the German Mixed National Team.