Most current data cleaning solutions solve only certain steps of the cleaning pipeline, not the entire pipeline. For example, many cleaning pipelines consist of two parts - a machine part that employs a cleaning algorithm, and a human part that employs a user to manually clean up the output of the algorithm. Current work has addressed only the machine part. This raises three serious problems. First, without guidance, users often end up doing something ad-hoc, suboptimal, and incorrect. Second, the machine part may optimize for the wrong goal, causing more work in the human part. Finally, focusing only on the machine part makes it very difficult to address the brittleness of the cleaning algorithm.
To address these problems, we argue for more attention to end-to-end data cleaning, focusing not just on the machine, but also on the human part. We propose an RDBMS-style solution strategy, which defines basic human operations, combines them with cleaning algorithms to form hybrid plans, estimates plan costs (e.g., in terms of human effort), then selects the best plan. As a case study, we examine the important problem of value normalization and develop an effective solution. Our work thus advocates for end-to-end data cleaning and shows that it is possible to apply an RDBMS-style approach to build complex machine-human solutions to this problem.