Mirek Riedewald : Scolopax: Supporting Exploratory Analysis of Scientific Data
Abstract: As the amount and complexity of data in many fields is rapidly increasing, new approaches are needed for exploratory analysis and scientific discovery. Our Scolopax system's goal is to address these challenges with novel techniques for large-scale parallel data management. In this talk, we will present an overview of Scolopax and then focus on parallel processing of joins. Joins combine information across data sets, e.g., to discover correlations. Our proposed join model simplifies reasoning about how to assign computation tasks to processors in MapReduce and other parallel environments. Using this model, we derive a surprisingly simple randomized algorithm, called 1-Bucket-Random, for implementing arbitrary theta-joins in a single MapReduce job. This algorithm only requires minimal statistics (input cardinality) and we provide proofs and strong evidence that for a variety of join problems, its latency is either close to optimal or the best realizable option. For some popular joins we show how to improve over 1-Bucket-Random by exploiting additional input statistics. Various aspects of Scolopax were published at premier data management and data mining venues like SIGMOD, VLDB, ICDE, ICML, and ICDM.
Bio: Mirek Riedewald received a Ph.D. in computer science from the University of California at Santa Barbara in 2002. After spending some time as a researcher at Cornell University and as a visiting researcher at Microsoft Research, he is now an Associate Professor at Northeastern University. Dr. Riedewald's research interests are in databases and data mining, with an emphasis on designing scalable techniques for data-driven science. Currently Dr. Riedewald is developing novel approaches for parallel data processing and for mining observational data. He has a track record of successful collaborations with scientists from different domains, including ornithology, physics, mechanical and aerospace engineering, and astronomy. His work has been published in the premier peer-reviewed data management research venues like ACM SIGMOD, VLDB, IEEE ICDE, and IEEE TKDE, as well as in domain science journals.
