In 2020-2022, Exascale systems will be put in service in multiple locations in the world. Studies and projections agree that these systems will suffer more frequent failures and data corruptions than current systems. The challenge is clear: how to make sure that Exascale application executions complete and provide correct results? Finding solutions to this problem is not trivial. In particular scaling existing solutions will not work. In this talk we present a novel general approach: exploring applications and systems to discover properties that could be leveraged to develop Exascale fault tolerance solutions. We will present the results of this approach in four domains: checkpoint/restart, failure prediction, fault tolerant protocols and silent data corruption detection. We will also discuss the limits of these solutions.
Franck Cappello is the Senior Computer Scientist at Argonne National Laboratory where he leads the research on fault tolerance/resilience for extreme scale systems. He is the director of the Inria-Illinois-ANL-BSC-JSC joint laboratory on extreme scale computing (http://publish.illinois.edu/jointlab-esc/) that explores and develops new software addressing key challenges of extreme scale numerical simulations and data analytics. He led the resilience road mapping efforts for the IESP (International Exascale Software Project) and EESI (European Exascale System Initiative). He also initiated and directed several international collaborations like the G8 "Enabling Climate Simulation at Exascale" project. Cappello received his Ph.D. from the University of Paris XI in 1994 and joined CNRS, the French National Center for Scientific Research. In 2003, he joined Inria, where he held the position of permanent senior researcher until 2013. In 2009, Cappello became visiting research professor at the University of Illinois at Urbana Champaign. He joined Argonne National Laboratory in April 2014.