Aarti Singh: Leveraging Information Structure to Overcome Data Deficiencies in Large Complex Systems
Despite great advances in high-throughput technologies, the scale and complexity of modern systems makes it impossible to perfectly monitor them. As a result, modern datasets are severely under-sampled, corrupted by noise and outliers, high-dimensional, and unordered. However, the information of interest is often structured (in the form of clusters, bi-clusters, graphs and other topological properties). Leveraging this information structure is key to overcoming the data deficiencies, and enabling robust and resource-efficient inference in large complex systems.
In this talk, I will demonstrate how information structure in the form of clusters can be extracted from large-scale, noisy and incomplete data. First, I will characterize robustness of a popular spectral clustering algorithm, and establish its near-optimality using information theoretic lower bounds. For large-scale datasets, it might be prohibitive to obtain or compute all the similarities. To address this, I will present a novel framework for "active" hierarchical clustering that uses few, selectively sampled similarities. Coupled with the robustness analysis, this yields a new efficient hierarchical spectral clustering method that only requires O(N log^2 N) selective similarities, instead of all O(N^2) pairwise similarities, to cluster N objects and runs in linear time. Finally, I will briefly mention robustness results for high dimensional clustering settings, where the clusters are small and characterized by only a few relevant features.
Bio:
Aarti Singh is an Assistant Professor in the Machine Learning Department at Carnegie Mellon University. She received a Ph.D. degree in Electrical Engineering from the University of Wisconsin-Madison in 2008 and was a Postdoctoral Research Associate at the Program in Applied and Computational Mathematics at Princeton University from 2008-2009. Her research brings together tools from machine learning, statistics and signal processing to develop theoretically sound and practically feasible methods for inference in large complex systems; with applications to sensor networks, epidemiology, drug-protein interaction discovery, and brain networks.
