Abstract: Reliability is a first-order concern in modern distributed systems. Even large, well-provisioned systems such as Gmail and Amazon Web Services can be brought down by failures, incurring millions of dollars of cost and hurting company reputation. Such service outages are typically caused by either hardware failures or software bugs. We have developed various techniques for dealing with both kinds of failures (e.g. replication, software testing), but those techniques come at a significant cost. For example, our replication techniques for handling hardware failures are incompatible with multithreaded execution, forcing a stark choice between reliability and performance. As for guarding against software failures, our only real option today is to test our system as best we can and hope we have not missed any subtle bugs. In principle there exists another option, formal verification, that fully addresses this problem, but its overhead in both raw performance and programming effort is considered way too impractical to adopt in real developments.
In this talk, I make the case for Sustainable Reliability, i.e. reliability techniques that provide strong guarantees without imposing unnecessary overhead that limits their practicality. My talk covers the challenges faced by both hardware and software failures and proposes novel techniques in each area. In particular, I will describe how we can reconcile replication and multithreaded execution by rethinking the architecture of replicated systems. The resulting system, Eve, offers an unprecedented combination of strong guarantees and high performance. I will also describe IronFleet, a new methodology that brings formal verification of distributed systems within the realm of practicality. Despite its strong guarantees, IronFleet incurs a modest overhead in both performance and programming effort.
Brief bio: Manos Kapritsos is a Postdoctoral Researcher at Microsoft Research in Redmond, WA. He received his Ph.D. from the University of Texas at Austin in 2014. His research focuses on designing reliable distributed systems, by applying fault-tolerant replication to combat machine failures and using formal verification to ensure software correctness.