UW-Madison CS team wins CodRep!

A team consisting of Jordan Henkel (CS graduate student), Shuvendu Lahiri (from Microsoft Research), Ben Liblit, and Tom Reps (both CS faculty) has just won CodRep. CodRep is an international competition on machine learning on source code organized by the KTH Royal Institute of Technology in Stockholm, Sweden. Henkel will travel to KTH in Stockholm this fall for the award ceremony. The following is a short interview with Henkel discussing the team’s solution:

What’s the main idea behind your solution?  We treat finding the correct repair as a “learning to rank” problem. We use RankLib and the LambdaMart model to learn a ranking over feature vectors extracted for each line of a target file. We ended up submitting a model learned over 80% of all four datasets (and validated on 20%). We did four-fold cross validation (by dropping out a whole dataset at a time, and training on the other three) to convince ourselves that the learning-to-rank approach was generalizing and not overfitting too much. The feature vectors aren’t anything super special. Some features approximate parsing, such as whether the target line and the repair have the same balance of parentheses/curly braces. Other features are mostly based on different kinds of text distance.

What’s the crucial aspect of the algorithm in your opinion?  We thought a learning-based approach would be interesting. But we quickly ran into the issue that finding the right location to insert a repair is a needle-in-the-haystack prediction task. This makes training a binary classifier difficult. In the end, learning-to-rank algorithms seemed to do much better. Also, adding a post-processing step to try to parse the file with the top repair and reject it if it did not parse (but only if the file did originally parse) helped as one might expect. However, this add-on is costly in terms of running time. There are still many open questions. I’m sure we could do even better with better features. We never tried exploiting the fact that guessing a line “close” to the target would be valuable. I think those are interesting things to look at going forward.

Official Announcement: https://groups.google.com/forum/#!msg/codrep/xuIv05-b5xQ/wiz85NOZAgAJ