Abstract. Knowing which part of a program processes which parts of an input can reveal the structure of the input as well as the structure of the program. In a URL "http://www.example.com/path/", for instance, the protocol “http", the host “www.example.com", and the path “path" would be handled by different functions and stored in different variables. Given a set of sample inputs, we use _dynamic tainting_ to trace the data flow of each input character, and aggregate those input fragments that would be handled by the same function into lexical and syntactical entities. The result is a _context-free grammar_ that accurately reflects valid input structure; as it draws on function and variable names, it can be as readable as textbook examples:
URL ::= PROTOCOL "://" HOST "/" PATH
PROTOCOL ::= “http” | “https” | …
HOST ::= /[a-zA-Z0-9.]+/
We expect inferred grammars to considerably ease the understanding of file and input formats. Their most important use, however, will be in automatic fuzz testing, where grammars can easily be turned into producers that help to quickly cover program features. Our grammar-based LANGFUZZ fuzzer is in daily use at Mozilla and has uncovered more than 4,000 defects so far; mining grammars automatically will bring such techniques to a wide range of programs. For details on our work on grammar mining, see https://www.st.cs.uni-saarland.de/models/autogram/
Bio. Andreas Zeller is a full professor for Software Engineering at Saarland University in Saarbrücken, Germany, since 2001. His research concerns the analysis of large software systems and their development process. In 2010, Zeller was inducted as Fellow of the ACM for his contributions to automated debugging and mining software archives, for which he also was awarded 10-year impact awards from ACM SIGSOFT and ICSE. In 2011, he received an ERC Advanced Grant, Europe's highest and most prestigious individual research grant, for work on specification mining and test case generation. In 2013, Zeller co-founded Testfabrik AG, a start-up on automatic testing of Web applications, where he chairs the supervisory board.