When systems fail in the production environment, logs often form the only source of diagnostic information. In this talk, I will describe two tools we recently developed to ease the pain of postmortem debugging. The first tool, Stitch, answers the "how" in the title. It reconstructs the execution flows in the entire distributed software stack solely using the unstructured logs output by heterogeneous software components. Stitch is substantially different from all prior related tools in that it is capable of constructing a system model of an entire software stack without building any domain knowledge into Stitch. The second tool, Log20, answers the "where". It determines a near optimal placement of log printing statements under the constraint of adding less than a specified amount of performance overhead without any human involvement.
Bio: Ding Yuan is an assistant professor in the Electrical and Computer Engineering department at University of Toronto. His research focus is in systems software. His works on failure diagnosis have been used and licensed by Huawei and Microsoft. His research in software reliability has been disseminated in software industry, appearing in tens of industry conference presentations, multiple times on Hacker News, tens of blogs, and triggered thorough code reviews from HBase developers that lead to hundreds of patches.