UW-Madison Computer Sciences Department Makes Impact at NeurIPS 2023

NeurIPS, the Conference on Neural Information Processing Systems, is one of the most prestigious conferences in artificial intelligence (AI), held each year in December. This year, Computer Sciences department researchers at the University of Wisconsin-Madison had 30 papers accepted by the conference. Spread across numerous faculty, postdoctoral researchers, students, and external collaborators, these works show our department shaping research across the field.

Ilias Diakonikolas headshot

Prof. Ilias Diakonikolas, who co-authored numerous contributions, remarked, “These works significantly advance our understanding of both theoretical and empirical aspects of machine learning, including algorithmic aspects of robustness, information-computational tradeoffs, foundations of deep learning, convex and non-convex optimization, out-of-distribution detection, reinforcement learning, and learning for image segmentation.”

A paper by Prof. Diakonikolas with D. Kane, J. Lee, A. Pensia, and T. Pittas, chosen for a spotlight presentation at NeurIPS, gives the first efficient algorithm for learning arbitrary mixtures of Gaussian distribution, even when the data contains a majority of outliers – data points that are not part of the distributions.

Jelena Diakonikolas headshot

Another paper by Prof. Jelena Diakonikolas, with coauthors D. Chakrabarti and C. Kroer, describes a class of algorithms for extensive-form games in which changes are made in only a subset of the variables at each step – so-called block coordinate methods. Although such methods are used widely in practice, their worst-case behavior is usually worse than methods that use gradient information to take steps in all the variables at once. The methods of this paper do not suffer from this defect. Moreover, the authors show that faster practical behavior can be obtained by restarting the algorithm periodically. 

Frederic Sala

Prof. Fred Sala, in collaboration with T.-Z. Huang and H. Vishwakarma, considers economic aspects of training of machine learning models of very large scale, which is often performed across different organizations, each with their own “silo” of data. The paper shows that faster training may result, to the benefit of all parties, when these entities trade subsets of model parameters during the training process.

A paper by Prof. Yong Jae Lee with H. Liu, C. Li, and Q. Wu, also spotlighted at the meeting, introduces LLaVA (Large Language and Vision Assistant), a large multimodal model for general visual and language understanding. Shown a cartoon image, LLaVa is able to discern the image to be a parody of the Mona Lisa. 

Sharon Li headshotProf. Sharon Li and her coauthors Q. Wang, Z. Fang, Y. Zhang, F. Liu, and B. Han tackle the issue of detecting out-of-distribution (OOD) data, a critical issue in the training and deployment of machine learning models. It considers that the hypothesized OOD data may differ from the OOD data actually seen in practice, and does a theoretical analysis of how this discrepancy affects the performance of the model in flagging OOD data. 

These are just a sample of the accepted papers. We congratulate all members of department with papers in NeurIPS and their collaborators for doing this important research in one of the most impactful fields of modern science!

All accepted papers in alphabetical order by first author (*=UW-Madison researcher):

  • Block-Coordinate Methods and Restarting for Solving Extensive-Form Games
    D. Chakrabarti, J. Diakonikolas*, C. Kroer

    The paper provides a cyclic block coordinate method for solving extensive-form games. The method is the first block coordinate-type method for such problems and the results are surprising from at least two aspects: (1) that it is possible at all to get convergence results for a block coordinate-type method, as the feasible set is not block separable, and (2) it is the first cyclic method that provably does not converge slower than full vector-update methods in the worst case (all other cyclic methods are slower by a factor scaling polynomially with the number of blocks in the worst case). The paper also provides a restarting strategy that empirically speeds up the algorithm.
  • Skill-it! A Data-driven Skills Framework for Understanding and Training Language Models [spotlight]
    Mayee F Chen, Nicholas Roberts, Kush Bhatia, Jue WANG, Ce Zhang, Frederic Sala*, Christopher Re
    Data selection and quality is a major driver of performance in large language models (LLMs). This paper develops a framework for organizing data as a set of skills that should be “taught” to the model in a particular order. It proposes an online data sampling approach for more quickly achieving skill development in LLMs and shows substantial improvements in both continual pretraining and fine-tuning.
  • Mechanism Design for Collaborative Normal Mean Estimation [spotlight]
    Yiding Chen*, Xiaojin Zhu*, Kirthevasan Kandasamy*
  • Efficient Testable Learning of Halfspaces with Adversarial Label Noise
    I. Diakonikolas*, D. Kane, V. Kontonis*, S. Liu, N. Zarifis*
    This paper gives the first efficient algorithm for learning linear separation with adversarial labe noise in the testable framework, in which we can trust the accuracy of our hypothesis independent of the assumptions. https://arxiv.org/abs/2303.05485
  • First Order Stochastic Optimization with Oblivious Noise
    I. Diakonikolas*, S. Karmalkar*, J. Park, C. Tzamos
    This paper gives the first efficient method for stochastic first order optimization with noise, broadly generalizing the heavy-tailed setting.
  • Near-Optimal Algorithms for Gaussians with Huber Contamination: Mean Estimation and Linear Regression
    I. Diakonikolas*, D. Kane, A. Pensia, T. Pittas*

    This paper revisits the classical problems of mean estimation and linear regression of a multivariate Gaussian in Huber’s contamination model. It gives optimal algorithms for these basic tasks with respect to sample size, runtime, and accuracy guarantees.
  • Near-Optimal Bounds for Learning Gaussian Halfspaces with Random Classification Noise
    I. Diakonikolas*, J. Diakonikolas*, D.M. Kane, P. Wang*, N. Zarifis*
    This paper provides matching upper and lower bounds for “efficient” algorithms addressing one of the most fundamental learning problems: linear classification where the data points follow standard Gaussian distribution and the labels are determined by a biased halfspace + random classification noise. The paper provides a statistical query (SQ) lower bound, which applies to all standard approaches for addressing this problem, that shows that there is an inherent information-computation gap for this problem: all polynomial-time SQ algorithms in the worst case need more samples than suggested by the information-theoretic lower bound. A surprising aspect of the upper bound is that we obtain an optimization algorithm (a variant of Riemannian gradient descent) that obtains linear convergence (with geometrically reducing error) for a nonconvex piecewise-linear optimization problem on a sphere.
  • A Spectral Algorithm for List-Decodable Covariance Estimation in Relative Frobenius Norm [spotlight]
    I. Diakonikolas*, D. Kane, J. Lee, A. Pensia, T. Pittas*
    This paper gives a fast and provable algorithm for list-decodable covariance estimation in the presence of a majority of outliers. It implies the first efficient algorithm for robustly learning arbitrary mixtures of Gaussians that does not require solving a large convex program.
  • SQ Lower Bounds for Learning Mixtures of Linear Classifiers
    I. Diakonikolas*, D. Kane, Y. Sun*

    This paper proves an information-computation tradeoff showing that known algorithms for learning mixtures of linear classifiers are essentially best possible.
  • SQ Lower Bounds for Non-Gaussian Component Analysis with Weaker Assumptions
    I. Diakonikolas*, D. Kane, L. Ren*, Y. Sun*
    This paper greatly strengthens a general methodology (developed by the first two named authors and Alistair Stewart in FOCS 2017) to establish information-computation tradeoffs for high-dimensional statistical tasks. As a corollary, it gives new such tradeoffs for a range of ML problems.
  • Dream the Impossible: Outlier Imagination with Diffusion Models
    Xuefeng Du*, Yiyou Sun, Xiaojin (Jerry) Zhu*, Yixuan (Sharon) Li*

    Generating photo-realistic outliers in the high dimensional pixel space has been an open challenge for the field.  To tackle the problem, this paper proposes a new framework DREAM-OOD, which enables imagining photo-realistic outliers by way of diffusion models, provided with only the in-distribution (ID) data and classes. https://arxiv.org/abs/2309.13415
  • Conditional Mutual Information for Disentangled Representations in Reinforcement Learning [spotlight]
    Mhairi Dunion, Trevor McInroe, Kevin Luck, Josiah Hanna*, Stefano Albrecht
  • DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models
    Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee*, Kimin Lee

    Online fine-tuning with KL regularization is generally superior to supervised fine-tuning with respect to both image-text alignment and image quality.
  • Doubly Robust Peer-To-Peer Learning Protocol
    Nicholas Franzese, Adam Dziedzic, Christopher A. Choquette-Choo, Mark R. Thomas, Muhammad Ahmad Kaleem, Stephan Rabanser,
    Congyu Fang, Somesh Jha*, Nicolas Papernot, Xiao Wang
    This paper uses techniques from secure multiparty computation (SMC) to make peer-to-peer learning (such as various federated learning algorithms) more robust to malicious participants.
  • Grounding Neural Inference with Satisfiability Modulo Theories [spotlight]
    Matt Fredrikson, Kaiji Lu, Somesh Jha*, Saranya Vijayakumar, Vijay Ganesh, Zifan Wang 

    This paper enhances grounding in ML models by incorporating satisfiability modulo solvers (SMT), such as Z3.
  • Weitzman’s Rule for Pandora’s Box with Correlations
    Evangelia Gergatsouli*, Christos Tzamos*
    Pandora’s Box is a central problem in decision making under uncertainty that can model various real life scenarios, which in this work we revisit under correlated value distributions. We give a learnable algorithm that directly generalizes the original algorithm for independent distributions, while having almost tight approximation guarantees and simplifying the algorithms given in previous work. 
  • Embroid: Unsupervised Prediction Smoothing Can Improve Few-Shot Classification
    Neel Guha, Mayee F Chen, Kush Bhatia, Azalia Mirhoseini , Frederic Sala* , Christopher Re
    This paper shows how to improve prompt-based learning in large language models without any additional labeled data. The technique relies on smoothing: it identifies and fixes output errors by examining the consistency between neighboring predictions under the embedding spaces of smaller language models.
  • Restless Bandits with Average Reward: Breaking the Uniform Global Attractor Assumption [spotlight]
    Yige Hong, Qiaomin Xie (UW-Madison), Yudong Chen (UW-Madison), Weina Wang
  • Train ‘n Trade: Foundations of Parameter Markets
    Tzu-Heng Huang*, Harit Vishwakarma*, Frederic Sala* 
    Large-scale models, such as foundation models, are typically trained in silos by different organizations. Recent work on merging models suggests that entities training different models could trade subsets of parameters and integrate these into their own pipelines to speed up training. This paper introduces and studies a framework that makes such economic approaches to training possible.
  • The Gain from Ordering in Online Learning
    Vasilis Kontonis*, Mingchen Ma*, Christos Tzamos*

    We study fixed-design online linear regression, where the learner is allowed to choose the order of the examples to minimize the total regret of prediction. Our goal is to understand if one can efficiently order the examples approximately optimally. On the one hand, we show if there is no structural assumption on the dataset, then under exponential time hypothesis, no polynomial time algorithm can approximately find the optimal order within a factor of d^(1/polyloglog d). On the other hand, we show if the dataset has some natural structure, for example, drawn uniformly from the unit sphere, then a simple greedy algorithm is able to find the nearly optimal order efficiently.
  • Robust Second-Order Nonconvex Optimization and Its Application to Low Rank Matrix Sensing
    S.Li*, Y. Cheng, I. Diakonikolas*, J. Diakonikolas*, R. Ge, S. Wright*
    This paper provides the first algorithm for finding second-order stationary points in  nonconvex optimization in the outlier-robust setting, and with dimension-independent error guarantees. It further demonstrates how these general results can be applied to low rank matrix sensing problems, where outliers are likely to arise and second order stationarity suffices as a criterion of convergence.
  • Dissecting Chain-of-Thought: Compositionality through In-Context Filtering and Learning
    Yingcong Li, Kartik Sreenivasan*, Angeliki Giannou*, Dimitris Papailiopoulos*, Samet Oymak
    Chain-of-thought (CoT) is a method that enables language models to handle complex reasoning tasks by decomposing them into simpler steps. Our study investigates the impact of CoT on the ability of transformers to in-context learn a simple to study, yet general family of compositional functions: multi-layer perceptrons (MLPs). We find that the success of CoT can be attributed to breaking down in-context learning of a compositional function into two distinct phases: focusing on and filtering data related to each step of the composition and in-context learning the single-step composition function. Through both experimental and theoretical evidence, we demonstrate how CoT significantly reduces the sample complexity of in-context learning (ICL) and facilitates the learning of complex functions that non-CoT methods struggle with.
  • Visual Instruction Tuning [oral]
    Haotian Liu* (equal contribution), Chunyuan Li (equal contribution), Qingyang Wu, Yong Jae Lee* 

    This paper introduces LLaVA (Large Language and Vision Assistant), a large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.

    A dog is pictured as Mona Lisa.
    LLaVA is able to explain that the artwork is a parody of the Mona Lisa. Credit: Oliver Gal Mona Lisa Pet
  • Multi-task Representation Learning for Pure Exploration in Bilinear Bandits
    Subhojyoti Mukherjee*, Qiaomin Xie*, Josiah Hanna*, Robert Nowak*
  • Visual Instruction Inversion: Image Editing via Image Prompting
    Thao Nguyen, Yuheng Li*, Utkarsh Ojha*, Yong Jae Lee*

    This paper presents a framework for inverting visual prompts (i.e., before-and-after reference images) into editing instructions for controllable text-to-image generation.
  • Dissecting Knowledge Distillation: An Exploration of its Inner Workings and Applications
    Utkarsh Ojha* (equal contribution), Yuheng Li* (equal contribution), Anirudh Sundara Rajan* (equal contribution), Yingyu Liang*, Yong Jae Lee* 

    This paper presents a comprehensive study to try to answer the following question: what is the knowledge that gets distilled in knowledge distillation? That is, in what ways does the student become similar to the teacher? Does it start to localize objects in the same way? Does it get fooled by the same adversarial samples? Does its data invariance properties become similar?
  • State-Action Similarity-Based Representations for Off-Policy Evaluation
    Brahma Pavse*, Josiah Hanna*
  • Geometry-Aware Adaptation for Pretrained Models
    Nicholas Roberts*, Xintong Li, Dyah Adila*, Sonia Cromp*, Tzu-Heng Huang*, Jitian Zhao*, Frederic Sala*

    This paper studies ways to modify pretrained models by exploiting the geometry of the output space when a relational structure is provided (or obtained from the model itself). It replaces standard argmax prediction with a Frechet mean-based estimator that has favorable theoretical properties and improves zero-shot prediction in pretrained models like CLIP.
  • Provable Guarantees for Neural Networks via Gradient Feature Learning
    Zhenmei Shi*, Jenny Wei*, Yingyu Liang*

    This paper proposes a unified analysis framework for two-layer networks trained by gradient descent, including several existing analyses as special cases. It also sheds light on several interesting network learning phenomena.
  • Mitigating Source Bias for Fairer Weak Supervision
    Changho Shin*, Sonia Cromp*, Dyah Adila*, Frederic Sala*

    Weak supervision is an approach to crafting high-quality pseudolabels for training datasets by learning properties of and combining estimated labels from noisy, low-quality sources. While popular and effective, weak supervision-based labeling can result in biased or unfair pseudolabels, leading to biased downstream models. This is the first paper to study and mitigate unfairness in weak supervision, doing so via an optimal transport approach that also leads to state-of-the-art results on several weak supervision benchmarks.
  • A Graph-Theoretic Framework for Understanding Open-World Representation Learning [spotlight]
    Yiyou Sun*, Zhenmei Shi*, Yixuan (Sharon) Li*

    Open-world representation learning aims at inferring both known and novel classes in unlabeled data, by harnessing prior knowledge from a labeled set with known classes.  Despite its importance, there is a lack of theoretical foundations for this problem. This paper bridges the gap by formalizing a graph-theoretic framework tailored for the open-world setting, where the clustering can be theoretically characterized by graph factorization. Based on our graph formulation, we introduce a new algorithm called Spectral Open-world Representation Learning (SORL), and show that minimizing our loss is equivalent to performing spectral decomposition on the graph. Such equivalence allows us to derive a provable error bound on the clustering performance for both known and novel classes, and analyze rigorously when labeled data helps.
  • Promises and Pitfalls of Threshold-based Auto-labeling [spotlight]
    Harit Vishwakarma*, Heguang Lin*, Frederic Sala*, Ramya Vinayak*

    Auto-labeling, a form of self-training, is used to minimize manual labeling when creating datasets and has been adopted in industry, including in Amazon’s SageMaker Ground Truth. This work is the first to study the theoretical foundations of threshold-based auto-labeling, yielding two insights: on the positive side, weak models can be used with auto-labeling while still providing large and reliable output datasets, while on the negative side, auto-labeling often requires substantial amounts of labeled validation data.
  • Learning to Augment Distributions for Out-of-distribution Detection
    Qizhou Wang, Zhen Fang, Yonggang Zhang, Feng Liu, Yixuan (Sharon) Li*, Bo Han

    This paper derives learning theory analyzing the distribution discrepancy between the auxiliary and the unseen real OOD data, and how it affects the open-world OOD detection performance.
  • Segment Everything Everywhere All at Once
    Xueyan Zou*, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, Yong Jae Lee*

    This paper presents an interactive model for segmenting images.  The model can be prompted with visual prompts (points, marks, boxes, scribbles and image segments) and language prompts (text and audio).
  • Algorithm Selection for Deep Active Learning with Imbalanced Datasets
    Jifan Zhang*, Shuai Shao, Saurabh Verma, Robert Nowak*

    Active Learning studies how AI best directs humans to provide feedback and annotation. Yet these algorithms have wildly different performances on new datasets, making it impossible to pick the best algo to use in practice. We now have a solution.