Congratulations, all, on earning your PhDs! Dissertations are listed by date of completion.

If you’ve completed your PhD in Computer Sciences at UW-Madison, please contact communications@cs.wisc.edu and we will add you to this list. Thank you!

## 2021

This is an accordion element with a series of buttons that open and close related content panels.

## Alnammi, Moayad: Iterative Batched Screening and Active Learning in Drug Discovery

Abstract:

Finding safe and effective drugs in the form of compounds is a long and arduous campaign that can stretch to more than a decade. This is mainly exacerbated by the exponential amount of available compounds. The effectiveness of these compounds can only be confirmed via physical testing or screening. Instead of physically screening all available compounds, virtual screening is the use of computational methods to help in carefully selecting compounds for physical screens. In this thesis we explore three virtual screening options for hit finding via four projects. When historical screening data is available for a target of interest, one can leverage virtual screening methods that model the compound-target interaction data. Given a compound budget and a prospective pool, a question of interest is the effectiveness of such methods in prioritizing hit selection in a followup screen. In such a setting, which we name one round screening, we showcase in chapters 2 and 3 that such methods can succeed in finding hits that far exceed the expectation at random, even when the prospective pool exceeds a billion compounds. An alternative is iterative batched screening, whereby the compound budget is split across multiple rounds of screening. The focus in this setting is to develop a compound selection strategy with the goal of maximizing hits and the diversity of these hits at the end of the iterations. In chapter 5, we formalize this setting and propose a cluster based selection strategy that incorporates concepts from active and reinforcement learning. Following a sequence of rigorous retrospective experiments, a final strategy is used in an iterative screen on a prospective target that mimics a realistic setting. The results of this prospective iterative screen were satisfying, finding 20.28% of the hits in just 4.25% of the prospective pool. When no screening data is available for a target of interest, we focus on a particular setting where there exists compound interaction data on related targets. The goal is to leverage the related targets’ data to prioritize hits for the target of interest. This problem, dubbed the informer set problem, was introduced by Zhang et al. In chapter 4, we extend their work by providing solutions that were adapted from related fields like matrix completion and machine learning. As a further extension, we apply three scalable methods to a large dataset with missing values ranging from 9,000 to 74,000 compounds and 128 heterogeneous targets. The results expose the difficulties that can arise with such large datasets, however, the insights gained are valuable for directing future works. In each of these virtual screening options, we formalize the computational problems and objectives. We showcase the success of proposed solutions in realistic and prospective evaluations. In a one round screen, the objective of maximizing hits and the diversity of these hits in a followup screen is straightforward. However, in iterative batched screening, formalism is needed to give rise to a clear framework for developing strategies. This formalism enables us to compare previous iterative screening studies which motivated the development of our proposed novel strategy. Finally, with the continuous growth of online screening data repositories, there is interest in judging the effectiveness of informer set methods on large and incomplete datasets from diverse targets.

## Mahajan, Kshiteej: Fair and Efficient GPU Cluster Scheduling

Abstract:

Modern distributed machine learning (ML) training workloads benefit significantly from leveraging GPUs. However, significant contention ensues when multiple such workloads are run atop a shared cluster of GPUs. A key question is how to fairly apportion GPUs across workloads. We find that established cluster scheduling disciplines are a poor fit because of ML workloads’ unique attributes: ML jobs have long-running tasks that need to be gang-scheduled, and their performance is sensitive to tasks’ relative placement. We propose THEMIS, a new scheduling framework for ML training workloads. It’s GPU allocation policy enforces that ML workloads complete in a finish-time fair manner, a new notion we introduce. To capture placement sensitivity and ensure efficiency, THEMIS uses a two-level scheduling architecture where ML workloads bid on available resources that are offered in an auction run by a central arbiter. Our auction design allocates GPUs to winning bids by trading off fairness for efficiency in the short term, but ensuring finish-time fairness in the long term. Our evaluation on a production trace shows that THEMIS can improve fairness by more than 2.25X and is ~5% to 250% more cluster efficient in comparison to state-of-the-art schedulers.

## 2020

This is an accordion element with a series of buttons that open and close related content panels.

## Abashkumar, Anubhavnidhi: Towards Automated Network Policy Management

Abstract:

With the rapid and widespread growth of distributed computing, modern networks have become increasingly complex. To manage these networks, operators deploy a diverse set of traffic forwarding policies. Traditionally to deploy these policies, operators would manually configure each individual device. Today’s networks support a large number of devices. Some networks are highly complex as they support a variety of distributed routing protocols (e.g. BGP, OSPF, etc) and filtering mechanisms (e.g. ACLs, route filter, etc). Detecting policy violations by manually analyzing large number of devices and their complex configurations is not trivial. Manually updating or synthesizing the network to be policy-compliant is an even harder problem. In this dissertation, we present an array of automated verification and synthesis tools that will take us closer to the goal of automated network policy management. As our first contribution, we present a synthesis tool for Software Defined Networks, called Janus. Janus first extends a policy graph abstraction to support performance/QoS (e.g. bandwidth) and dynamic (stateful and temporal) policies. Next, it develops a set of Integer Linear Programs (ILP) to configure these policies. Our second contribution is a verification tool for legacy networks, called Tiramisu. Verification tools often make a tradeoff between performance and generality. By using rich, multi-layered graph abstraction with algorithmic choices that are customized per policy, Tiramisu can achieve both high performance and generality. Finally, our third contribution is a synthesis tool for legacy networks, called AED. AED is the first synthesis tool that ensures its policy-compliant updates also support multiple operator-defined network management objectives (e.g. minimize devices changes, avoid updating certain devices or features, etc). AED uses a novel Satifiability Modulo Theory (SMT) formulation that encodes both the structure and semantics of the network configurations.

## Breck, Jason: Enhancing Algebraic Program Analysis

Abstract:

Many programs have important functional-correctness properties that are most naturally expressed as sophisticated mathematical relationships between numerical variables. In addition to functional-correctness properties, many programs have important numerical properties that characterize their worst-case usage of resources such as time or memory. Proving either of those kinds of properties is a challenge for automated reasoning systems, because of both theoretical difficulties (e.g., the undecidability of non-linear integer arithmetic) and practical difficulties. Classical approaches to these problems include template-based methods, as well as iterative, fixed-point-finding analyzers based on abstract interpretation and abstraction refinement. This dissertation applies an alternative framework, called algebraic program analysis, to the problem of proving numerical properties of programs. In this framework, the main steps of the analysis of a program are encapsulated by an algebraic structure (i.e., a carrier set and a collection of operations), and results are obtained by evaluating expressions that are constructed from the operations of that algebraic structure. More specifically, this dissertation builds upon an instance of that framework called Compositional Recurrence Analysis (CRA), in which the loops of a program are analyzed by finding and solving recurrence relations. This dissertation explores several lines of research that enhance CRA. The first line of research produced a new analysis technique that we call Interprocedural Compositional Recurrence Analysis (ICRA). ICRA applies recurrence-solving in a uniform way to both (i) loops, and (ii) linearly recursive procedures, i.e., procedures that make at most one recursive call along any path through the procedure body. Furthermore, ICRA analyzes non-linearly recursive procedures using a technique that borrows ideas from Newton’s method and Gauss-Jordan elimination. Experiments show that ICRA, when applied to a collection of assertion-checking tasks, has broad overall strength compared with several state-of-the-art software model checkers. The second line of research improves upon ICRA’s analysis of non-linear mathematical relationships, allowing it to find invariants that include polynomial, exponential, and logarithmic relationships between program variables. The enhanced version of ICRA handles a wider class of recurrence relations (i.e., C-finite recurrences). One component of this enhancement is a new abstract domain called the wedge abstract domain, xi which is a modified version of the abstract domain of polyhedra in which the polyhedra have additional dimensions that correspond to non-linear terms over program variables. The operations of the wedge domain use Gröbner-basis methods for handling polynomial equations, and a collection of inference rules for handling non-linear inequalities. Another component of the enhancement is a new recurrence solver based on the operational calculus. Experiments show that the enhanced version of ICRA is able to prove non-linear assertions, and bounds on a program’s usage of resources, such as time or memory. The third line of research improves upon ICRA’s ability to analyze non-linearly recursive procedures, such as divide-and-conquer algorithms. This enhancement combines two streams of ideas from the literature on generating numerical invariants, namely: (1) template-based methods, and (2) recurrence-based methods. The new analysis technique uses a new kind of template, which we call a hypothetical summary, in which the unknowns are functions, and the analyzer finds and solves recurrence constraints on those unknowns. Experiments show that this analysis technique is effective at proving assertions and finding resource-usage bounds for non-linearly recursive procedures. For instance, it is able to show that (i) the time taken by merge-sort is O(n log(n)), and (ii) the time taken by Strassen’s algorithm is O(n log2 (7) ).

## Brown, David Bingham: Mutation Testing: Algorithms and Applications

Abstract: Software continues to be vital to the modern world, and as its ubiquity increases, its correctness becomes ever more valuable. Unfortunately, fundamental mathematical constraints on static analysis preclude the possibility of easy answers to the problem of correctness. Modern software engineering relies upon software testing

## Chowdhury, Amrita Roy: Crypto-Assisted Differential Privacy on Untrusted Servers

Abstract:

Differential privacy (DP) is currently the de-facto standard for achieving privacy in data analysis, which is typically implemented either in the “central” or “local” model. The local model has been more popular for commercial deployments as it does not require a trusted data collector. This increased privacy, however, comes at the cost of utility and algorithmic expressibility as compared to the central model. In this work, we propose, Cryptϵ, a system and programming framework that (1) achieves the accuracy guarantees and algorithmic expressibility of the central model (2) without any trusted data collector like in the local model. Cryptϵ achieves the “best of both worlds” by employing two noncolluding untrusted servers that run DP programs on encrypted data from the data owners. In theory, straightforward implementations of DP programs using off-the-shelf secure multi-party computation tools can achieve the above goal. However, in practice, they are beset with many challenges like poor performance and tricky security proofs. To this end, Cryptϵ allows data analysts to author logical DP programs that are automatically translated to secure protocols that work on encrypted data. These protocols ensure that the untrusted servers learn nothing more than the noisy outputs, thereby guaranteeing DP (for computationally bounded adversaries) for all Cryptϵ programs. Cryptϵ supports a rich class of DP programs that can be expressed via a small set of transformation and measurement operators followed by arbitrary post-processing. Further, we propose performance optimizations leveraging the fact that the output is noisy. We demonstrate Cryptϵ’s practical feasibility with extensive empirical evaluations on real world datasets.

## Drews, Samuel: Fairness, Correctness, and Automation

Abstract:

As software permeates our world in every imaginable way, from healthcare to policing to autonomous vehicles, algorithms play an expanding role in shaping our everyday lives. Indeed, automated decision-making, particularly in the form of machine learning, has profound transformative potential on our society; this power is accompanied by increasing concerns about the safety and fairness of the machine-learned models involved. We argue that such impactful or safety-critical settings necessitate exact guarantees about the quality of the systems involved. Accordingly, this dissertation tackles these problems using classic ideas from formal methods, since the artifacts involved—the learning algorithms and the models—are ultimately still programs. First, we attend to the interface between humans and machine-learned models: The means by which practitioners create models is to provide curated training examples to a learning algorithm; however, it is often unclear how subtleties in the training data translate into program behavior. As a cursory step towards formally, automatically reasoning about the relation between training data input and model output, we focus specifically on data-poisoning attacks, a problem in adversarial machine learning. We present Antidote, a tool that verifies if the composition of decision-tree learning and classification is robust to small perturbations in the training data. Antidote uses abstract interpretation to symbolically train all possible trees for an intractably large space of possible datasets, and in turn, to determine all possible classifications for a given input. Next, we turn our attention to the notion of algorithmic fairness. The research community has debated a wide variety of fairness definitions; we propose to express such definitions formally as probabilistic program properties. With the goal of enabling rigorous reasoning about fairness, we design a novel technique for verifying probabilistic properties that ix admits a wide class of decision-making programs, implemented in a tool we call FairSquare. FairSquare is the first verification tool for automatically certifying that a program meets a given fairness property. Our evaluation demonstrates FairSquare’s ability to verify fairness for a range of different machine-learned programs. Finally, in the event that a program is provably unfair (or more generally, that a program does not meet some probabilistic correctness property), we turn our attention to the problem of program synthesis for probabilistic programs: automatically constructing a program to meet a probabilistic specification of its desired behavior. We propose distribution-guided inductive synthesis (digits), a novel technique that (i) iteratively calls a synthesizer on finite sets of samples from a given distribution and (ii) verifies that the resulting program is both correct and minimally different from a target program (e.g. an unfair model to be repaired). digits has strong convergence guarantees rooted in computational learning theory; our evaluation shows that digits is indeed able to synthesize a range of programs, including repairs to those that FairSquare verified were unfair.

## Ganesan, Aishwarya: Consistency-Aware Durability

Abstract:

Modern distributed storage systems are emerging as the primary choice for storing massive amounts of critical data that we generate today. A central goal of these systems is to ensure data durability, i.e., these systems must keep user data safe under all scenarios. To achieve high levels of durability, most modern systems store redundant copies of data on many machines. When a client wishes to update the data, the distributed system takes a set of actions to update these redundant copies, which we refer to as the system’s durability model. At one end of the durability model spectrum, data is immediately replicated and persisted on many or all servers. While this immediate durability model offers strong guarantees, it suffers from poor performance. At the other end, data is only lazily replicated and persisted, eventually making it durable; this approach provides excellent performance but poor durability guarantees. The choice of durability model also influences what consistency models can be realized by the system. While immediate durability enables strong consistency, only weaker models can be realized upon eventual durability. Thus, in this dissertation, we seek to answer the following question: is it possible for a durability model to enable strong consistency guarantees, yet also deliver high performance? In the first part of this dissertation, we study the behavior of eight xxi popular modern distributed systems and analyze whether they ensure data durability when the storage devices on the replicas fail partially, i.e., sometimes return corrupted data or errors. Our study reveals that redundancy does not provide fault tolerance; a single storage fault can result in catastrophic outcomes such as user-visible data loss, unavailability, and spread of corruption. In the second part, to address the fundamental tradeoff between consistency and performance, we propose consistency-aware durability or Cad, a new way to achieving durability in distributed systems. The key idea behind Cad is to shift the point of durability from writes to reads. By delaying durability upon writes, Cad provides high performance; however, by ensuring the durability of data before serving reads, Cad enables the construction of strong consistency models. Finally, we introduce cross-client monotonic reads, a novel and strong consistency property that provides monotonic reads across failures and sessions. We show that this property can be efficiently realized upon Cad, while other durability models cannot enable this property with high performance. We also demonstrate the benefits of this new consistency model.

## Govind, Yash: CloudMatcher: Toward a Cloud Service for Entity Matching

Abstract:

As data science applications proliferate, more and more lay users must perform data integration (DI) tasks, which used to be done by sophisticated CS developers. Thus, it is increasingly critical that we develop hands-off DI services, which lay users can use to perform such tasks without asking for help from developers. We propose to demonstrate such a service. Specifically, we will demonstrate CloudMatcher, a hands-off cloud/crowd service for entity matching (EM). To use CloudMatcher to match two tables, a lay user only needs to upload them to the CloudMatcher’s Web page then iteratively label a set of tuple pairs as match/no-match. Alternatively, the user can enlist a crowd of workers to label the pairs. In either case, the lay user can easily perform EM end-to-end without having to involve any developers. CloudMatcher has been used in several domain science projects at UW-Madison and at several organizations, and is scheduled to be deployed in a large company in Summer 2018. In the demonstration we will show how easy it is for lay users to perform EM (either via interactive labeling or crowdsourcing), how users can easily create and experiment with a range of EM workflows, and how CloudMatcher can scale to many concurrent users and large datasets.

## Jang, Uyeong: On the Classification Problem against Adversarial Examples

Abstract:

This dissertation explores the problem of robust classification against adversarial examples. While modern classification algorithms achieve high accuracy on innocuous samples, they show significantly lower performances after adding imperceptibly small perturbations, so-called adversarial perturbations, to those innocuous samples. Designing a robust classification against such adversarial examples is a crucial topic in machine learning community to make machine learning algorithms applicable in real-world situations. To deeply understand adversarial examples, this dissertation delves into two different aspects about this topic – attack side and defense side. On the attack side, we focus on attack algorithms generating adversarial examples for a given classifier. To understand how adversarial perturbations are generated, we design a new gradient descent algorithm – an algorithm that uses the gradient of a given classifier with respect to the input. Based on Newton’s method, our algorithm achieves quicker convergences to adversarial points. Also, to reflect human imperceptibility, we propose a new set of metrics that are based on contemporary computer vision techniques. By measuring the performance of various attacks with these metrics, we empirically demonstrate that our new algorithm is comparably effective in fooling a classification algorithm. On the first part of the defense side, our main goal is to improve the robustness of classification algorithms by reinforcing existing defense strategies. We first start from a well-known formulation of adversarial training by Madry et al. (2017). From a thorough analysis, we figure out that, under a various instantiations the formulation, the confidence of a classifier on its prediction can be exploited as a discriminator between correct classifications and wrong classifications. Based on this result, we propose a framework that reinforces the adversarial robustness of a given base classifier and experimentally estimate the performance of the framework to support our analysis. On the second part of the defense side, we change our focus to another defense strategy that makes use of the manifold assumption. In this manifold-based defense, classification is made after a given sample is “pulled back” into the data manifold. The data manifold is usually approximated by generative models, however, most of those generative models are ignorant of the geometry/topology of the data manifold. We propose a topology-aware training method, for generative models, such that the distribution of generative models can reflect the prior knowledge on topological information of data manifold. We empirically verified that, after applying the new training, the distribution of generative model outputs reflects the topology of data manifold. Also, experiments show that the performance of manifold-based defense can be improved when generative models are trained with topology-aware training.

## Liu, Tianyu: Approximate Complexity in Statistical Mechanics: Counting and Sampling in the Six- and Eight-Vertex Models

Abstract:

The six- and eight-vertex models originate in statistical mechanics for crystal lattices with hydrogen bonds. The rst such model was introduced by Linus Pauling in 1935 to account for the residual entropy of water ice. The family of models not only are among the most extensively studied topics in physics, but also have fascinated chemists, mathematicians, theoretical computer scientists, and others, with thousands of papers studying their properties and connections to other elds. In this dissertation, we study the computational complexity of approximately counting and sampling in the six- and eight-vertex models on various classes of underlying graphs. First, we study the approximability of the partition function on general 4-regular graphs, classied according to the parameters of the models. Our complexity results conform to the phase transition phenomenon from physics due to the change in temperature. We introduce a quantum decomposition of the six- and eight-vertex models and prove a set of closure properties in various regions of the parameter space. These regions of the parameter space are concordant with the phase transition threshold. Using these closure properties, we derive polynomial time approximation algorithms via Markov chain Monte Carlo in some parameter space in the high temperature regime. In some other parameter space in the high temperature regime, we prove that the problem is (at least) as hard as approximately counting perfect matchings, a central open problem in this eld. We also show that the six- and eight-vertex models are NP-hard to approximate in the whole low temperature regime on general 4-regular graphs. We then study the six- and eight-vertex models on more restricted classes of 4-regular graphs, including planar graphs and bipartite graphs. We give the rst polynomial time approximation algorithm for the partition function in the low temperature regime on planar and on bipartite graphs. Our results show that the six- and eight-vertex models are the rst problems with the provable property that while NP-hard to approximate on general graphs (even #P-hard for planar graphs in exact complexity), they possess ecient approximation schemes on both bipartite graphs and planar graphs in substantial regions of the parameter space. Finally, we study the square lattice six- and eight-vertex models. We prove that natural Markov chains for these models are mixing torpidly in the low temperature regime. Moreover, we give the rst ecient approximate counting and sampling algorithms for the six- and the eight-vertex models on the square lattice at suciently low temperatures.

## Le, Yanfang: Improving Datacenter Performance with Network Offloading

Abstract:

There has been a recent emergence of distributed systems in datacenters, such as MapReduce and Spark for data analytics and TensorFlow and PyTorch for machine learning. These frameworks are not only computation and memory intensive, they also place high demands on the network for distributing data. The fast-growing Ethernet speed mitigates the high demand a bit. However, as Ethernet speed outgrows the CPU processing power, it not only requires us to rethink the existing algorithms for different network layers, but also provides opportunities to innovate with new application designs, such as datacenter resource disaggregation and in-network computation applications.

The fast network devices come with a programmability feature, which enables offloading computation tasks from CPU to NICs or switches. *Network offloading* to programmable hardware is a promising approach to help relieve processing pressure on the CPU for computation-intensive applications, e.g., Spark, or reduce the network traffic for network-intensive applications, e.g., TensorFlow. However, leveraging programmable hardware effectively is challenging due to the limited memory capacity and restricted programming model. In order to understand how to leverage the advantage of network offloading in developing new network stacks, network protocols, and applications, the following question needs to be answered: *how to do judicious division between the programmable hardware and software for network offload given limited resources and restricted programming models?*

Driven by the real application demand while exploring the answer to this question, we first propose RoGUE, a new congestion control and recovery mechanism for RDMA over Converged Ethernet that does not rely on PFC while preserving the benefits of running RDMA, i.e., low CPU and low latency. To preserve the low CPU benefit, RoGUE offloads packet pacing to the NIC.

Though RoGUE achieves better performance in extensive testbed evaluations, the architecture for optimal congestion control should be a centralized packet scheduler, which has global visibility into packet reservation requests from all the servers. Given all the hosts are connected through switches and the emerging programmable switch hardware can have stateful objects, we designed a centralized packet scheduler at the switch, called PL2, to provide stable and near-zero-queuing in the network by proactively reserving switch buffers for packet bursts in the appropriate time-slots.

Congestion control is an essential component in the networking stack because application demand for the network is higher than link speed. To eliminate the net- work congestion control, the fundamental solution is reducing the network traffic such that the application demand for the network is no more than link speed. We observed that we are able to reduce the network traffic for distributed training sys- tems by offloading a critical function, gradients aggregation, to the programmable switch. Each worker in the distributed training system sends gradients over the network to special components, parameter servers, to do aggregation, which is a simple *add *operator. Thus, we propose ATP, a network service for in-network aggregation aimed at modern multi-rack, multi-job DT settings. ATP performs *decentralized, dynamic, best-effort* aggregation, enables efficient and equitable sharing of limited switch resources across simultaneously running DT jobs, and gracefully accommodates heavy contention for switch resources.

## O'Neill, Michael: Algorithms for Smooth Nonconvex Optimization with Worst-Case Guarantees

Abstract:

Abstract: The nature of global convergence guarantees for nonconvex optimization algorithms has changed significantly in recent years. New results characterize the maximum computational cost required for algorithms to satisfy approximate optimality conditions, instead of focusing on the limiting behavior of the iterates. In many contexts, such as those arising from machine learning, convergence to approximate second order points is desired. Algorithms designed for these problems must avoid saddle points efficiently to achieve optimal worst-case guarantees.

In this dissertation, we develop and analyze a number of nonconvex optimization algorithms. First, we focus on accelerated gradient algorithms and provide results related to the avoidance of “strict saddle points”. In addition, the rate of divergence these accelerated gradient algorithms exhibit when in a neighborhood of strict saddle points is proven. Subsequently, we propose three new algorithms for smooth, nonconvex optimization with worst-case complexity guarantees. The first algorithm is developed for unconstrained optimization and is based on the classical Newton Conjugate Gradient method. This approach is then extended to bound constrained optimization by modifying the primal-log barrier method. Finally, we present a method for a special class of “strict saddle functions” which does not require knowledge of the parameters defining the optimization landscape. These algorithms converge to approximate second-order points in the best known computational complexity for their respective problem classes.

## Sen, Ayon: Learning with the Help of a Teacher

Abstract: Machine learning has been playing an important role in our lives for quite a while now. It has been used for a multitude of applications ranging from voice recognition to disease detection. In traditional machine learning, a learning algorithm is presented with a large set of data and it is the task of the algorithm to find a model that explains the data best. This process, even though useful, can be very time consuming. With the help of a teacher the time required to train a learner can be reduced drastically. Training a machine learner with the help of a teacher is known as optimal control for machine learning (also known as machine teaching), and is applicable in a multitude of domains including education and security. In this thesis, we will explore some such usefulness of optimal control. We first provide evidence that optimal control can be helpful to teach humans, both for learning perceptual fluency in chemistry and the pronunciation of written words. Our research also verifies that an artificial neural network can be useful as a cognitive model for humans. Then we present a new type of attack on machine learners called training set camouflage. In this attack a malicious agent can train a publicly available learner on a sensitive task with the help of a benign looking training set. The malicious agent does not raise any suspicion while passing this training set. Finally, we show a comparative analysis of the different metrics used in adversarial machine learning. In particular, through human experiments we show that even though none of the metrics currently used in practice is suitable, pixel 3-norm provides the best approximation. The research works presented in this thesis show how a teacher can play an important and useful role to train a machine learning algorithm.

## Shao, Shuai: From Holant to Quantum Entanglement and Back

Abstract:

Holant problems are intimately connected with quantum theory as tensor networks. We first use techniques from Holant theory to derive new and improved results for quantum entanglement theory. We discover two particular entangled states |Ψ6i of 6 qubits and |Ψ8i of 8 qubits respectively, that have extraordinary and unique closure properties in terms of the Bell property. Then we use entanglement properties of constraint functions to derive a new complexity dichotomy for all real-valued Holant problems containing an odd-arity signature. The signatures need not be symmetric, and no auxiliary signatures are assumed.

## Smith, Calvin: Program Synthesis for Data Analysis: Scalability and Privacy

Abstract:

This dissertation expands the applicability of program synthesis to data analysis, with a focus on scalability and privacy. Data is abundant and empowering, but barriers-to-access such as specialized technical knowledge or privacy requirements limit access. Program synthesis—the automatic generation of programs satisfying user intent—enables non-technical users to circumvent these barriers.

To provide access to large-scale data analysis techniques, we introduce BigΛ, a synthesis tool that utilizes *higher-order sketches* to generate scalable MapReduce programs from input-output examples. Higher-order sketches divide the work between mappers and reducers, and we introduce a verification technique that proves if a reducer is associative and commutative: a sufficient condition for the pipeline to be robust in the face of network-induced non-determinism. We demonstrate the efficacy of BigΛ by synthesizing a host of data analysis benchmarks on real-word data sets from a small number of examples.

We then study the problem of automatically synthesizing privacy-respecting programs using Zinc, a tool that automatically synthesizes probabilistic and provably-private programs. We base our technique on an effective inversion of the *linear dependent type system* DFuzz that tracks the resources consumed by a program, and hence its privacy cost. Zinc directs the synthesis towards programs satisfying a given privacy budget by using *symbolic context constraints* and *subtyping constraint abduction* to reason about the privacy-preserving behavior of partial programs. We then show Zinc’s ability to automatically synthesize privacy-preserving data analysis queries, as well as recursive differential privacy mechanisms from the literature.

To reason about the properties of probabilistic programs, such as those produced by Zinc, we introduce *trace abstraction modulo probability*, a proof technique for verifying high-probability guarantees. Our proofs over-approximate the set of program traces using *failure automata*, finite-state automata that upper-bound the probability of failing to satisfy a specification. We automate proof construction by a synthesis-enabled reduction of probabilistic reasoning to logical reasoning, which allows us to apply classic interpolation-based proof techniques. We evaluate our proof technique by proving properties of probabilistic programs drawn from the differential privacy literature. Our evaluation is the first instance of automatically-established accuracy properties—which contain symbolic inputs, parameterized distributions, and infinite state spaces—for these algorithms.

Finally, we discuss program synthesis with *equivalence reduction*, a methodology that utilizes equational specifications over a given synthesis domain to reduce the search space and improve performance. By leveraging classic and modern techniques from term rewriting, we use equations to induce a canonical representative per equivalence class of programs. We show how to design synthesis procedures that only consider canonical representatives, thus pruning the search space. We conclude by illustrating how to implement equivalence reduction using efficient data structures, and demonstrating the significant reductions it can achieve in overall synthesis time.

## Welton, Benjamin: The Feed Forward Measurement Model: A Performance Measurement Model for CPU/GPU Architectures

Abstract:

From mobile computing to the largest leadership-class supercomputers, many-core accelerators are relied upon to provide the computational ca- pabilities necessary to make today’s applications possible. Machine learn- ing, image processing, n-body simulations, and a host of other applica- tions are increasingly reliant on the computational capability provided by many-core accelerators to achieve the performance necessary to target real-world problems. For modern applications, especially those running on leadership-class supercomputers where the number of GPUs can out- number traditional CPUs three to one, effective exploitation of many-core compute resources is a must to achieve high efficiency. Effectively exploit- ing the additional parallelism provided by many-core accelerators requires developers identify where accelerator parallelization would provide bene- fit and ensure efficient interaction between the CPU and accelerator. While these issues appear straightforward and well understood, we have found that significant untapped performance opportunities still exist even in well-studied, heavily optimized, real world applications created by experi- enced developers. By addressing hidden performance opportunities, we were able to reduce execution time by up to 87% for the applications we have tested.

In this dissertation, we develop a new performance measurement and modeling technique called the feed-forward measurement model (FFM) that exposes previously hidden performance opportunities and delivers actionable feedback to developers on the potential benefit if the problem were corrected. We first explore a set of real-world applications to iden- tify hidden performance opportunities common among them, developing techniques that tools can implement to detect their presence. FFM refines and expands these detection techniques by including a model that can estimate the performance benefit of fixing these problems. We have created a performance tool called Diogenes that implements FFM to test its effectiveness on real world applications. The result was the discovery of problems that, when fixed, reduced execution time by up 17% in appli- cations we tested. Last, we expanded FFM to give guidance on how to remedy the problems it identified and the ability to automatically apply remedies to the program. By automatically correcting problems, FFM was able to reduce execution time of tested applications by up to 43%.

## Zhang, Haojun: Toward Building Entity Matching Management Systems

Abstract:

Entity matching (EM) has been a long-standing challenge in data management. Most current EM works focus only on developing matching algorithms. We argue that far more efforts should be devoted to building EM systems. We discuss the limitations of current EM systems, then present as a solution Magellan, a new kind of EM systems. Magellan is novel in four important aspects. (1) It provides how-to guides that tell users what to do in each EM scenario, step by step. (2) It provides tools to help users do these steps; the tools seek to cover the entire EM pipeline, not just matching and blocking as current EM systems do. (3) Tools are built on top of the data analysis and Big Data stacks in Python, allowing Magellan to borrow a rich set of capabilities in data cleaning, IE, visualization, learning, etc. (4) Magellan provides a powerful scripting environment to facilitate interactive experimentation and quick “patching” of the system. We describe research challenges raised by Magellan, then present extensive experiments with 44 students and users at several organizations that show the promise of the Magellan approach.