The workshop will increase the visibility in the general mathematical community of “The Mathematics of Machine Learning.” Multiple prestigious researchers in the field will provide talks that describe the main mathematical tools used in machine learning. The talks will be tutorial-type and accessible to a general audience of researchers working on machine learning or other fields of mathematics.
For mathematicians interested in machine learning, the workshop can serve to obtain first-hand knowledge of machine learning from a familiar perspective and to discover the main mathematical tools used in the field. For researchers already working on machine learning, the workshop can serve to promote interactions among researchers with shared interests and to learn about the latest results on certain specific topics.
The workshop will feature a poster session, providing an opportunity for selected participants to showcase their work through poster presentations.
Programme
9:00 - 10:00 | A Theory of Universal Learning | Steve Hanneke – Purdue University |
---|---|---|
Abstract: How quickly can functions in a given function class be learned from data? It is common to measure the performance of a supervised machine learning algorithm by plotting its "learning curve", that is, the decay of the classification risk (or "error rate") as a function of the number of training samples. However, the classical theoretical framework for understanding optimal rates of convergence of the risk in statistical learning, rooted in the works of Vapnik-Chervonenkis and Valiant (known as the PAC model), does not explain the behavior of learning curves: rather, it focuses on minimax analysis, which can only provide an upper envelope of the learning curves over a family of data distributions, and the "optimal" rate is the smallest such upper envelope achievable. This does not match the practice of machine learning, where in any given scenario, the data source is typically fixed, while the number of training samples may be chosen as needed. In this talk, I will describe an alternative framework that better captures such practical aspects of machine learning, but still gives rise to a complete theory of optimal learning rates in the spirit of the PAC model. Namely, we consider the problem of universal learning, which aims to understand the convergence rates achievable on every data distribution, without requiring uniformity of the guarantee over distributions for each sample size. In regard to supervised learning, the main result of this work is a remarkable trichotomy: there are only three possible optimal rates of universal learning. More precisely, we show that the learning curves of any given function class decay either at exponential, linear, or arbitrarily slow rates, under the realizability assumption. Moreover, each of these cases is completely characterized by appropriate combinatorial dimensions, and we exhibit optimal learning algorithms that achieve the best possible rate in each case. Allowing for non-realizable (so-called "agnostic") distributions, essentially the same trichotomy remains, with the linear rate replaced by sub-square-root rates. In recent extensions, we have also characterized the optimal universal rates for multiclass learning, general interactive learning, active learning with label queries, semi-supervised learning, learning with statistical margin conditions, and several other variations. In the course of these works, some general principles have emerged regarding the design of optimal learning algorithms based on winning strategies for certain infinite sequential games (Gale-Stewart games), which are used to define data-dependent partial function classes whose minimax rates match the optimal universal rate for the original function class. The corresponding combinatorial dimensions determine the existence of such winning strategies, and reflect a fascinating blending of familiar dimensions from the classical theories of statistical learning and adversarial online learning. Based on joint work with Olivier Bousquet, Shay Moran, Ramon van Handel, and Amir Yehudayoff, which appeared at STOC 2021, and numerous follow-up works (some published, some in preparation) with the aforementioned authors, as well as Idan Attias, Ariel Avital, Klim Efremenko, Alkis Kalavasis, Amin Karbasi, Amirreza Shaeiri, Jonathan Shafer, Ilya Tolstikhin, Grigoris Velegkas, and Qian Zhang. Bio: Steve Hanneke is an Assistant Professor in the Computer Science Department at Purdue University. His research explores the theory of machine learning, with a focus on reducing the number of training examples sufficient for learning. His work develops new approaches to supervised, semi-supervised, active, and transfer learning, and also revisits the basic probabilistic assumptions at the foundation of learning theory. Steve earned a Bachelor of Science degree in Computer Science from UIUC in 2005 and a Ph.D. in Machine Learning from Carnegie Mellon University in 2009 with a dissertation on the theoretical foundations of active learning. Steve's website can be found at http://www.stevehanneke.com |
||
10:00 - 11:00 | A Theory of Interpretable Approximations | Nicolò Cesa-Bianchi – University of Milan |
Abstract: Can a deep neural network be approximated by a small decision tree based on simple features? This question and its variants are behind the growing demand for machine learning models that are interpretable by humans. In this talk we describe interpretable approximations, a notion that captures the idea of approximating a target classifier by a small aggregation of classifiers in some base class. In particular, we consider the approximation of a binary classifier by decision trees based on a simple class (e.g., of bounded VC dimension), and use the tree depth as a measure of complexity. Our primary contribution is the following trichotomy. For any given pair of base class and target classifier, exactly one of these cases holds: (i) the target cannot be approximated by the base class with arbitrary accuracy; (ii) the target can be approximated by the base class with arbitrary accuracy, but there exists no distribution-free upper bound on the complexity of the approximations; or (iii) for any data distribution and any desired accuracy level, the target can be approximated by the base class with a constant complexity (where the constant depends only on the target and the base class). Joint work with M. Bressan, E. Esposito, Y. Mansour, S. Moran, and M. Thiessen. Bio: Nicolò Cesa-Bianchi is professor of Computer Science at Università degli Studi di Milano and holds a joint appointment at Politecnico di Milano. His main research interests are the design and analysis of machine learning algorithms for online learning, sequential decision-making, and graph analytics. He is co-author of the monographs "Prediction, Learning, and Games" and "Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems". He served as President of the Association for Computational Learning and co-chaired the program committees of some of the most important machine learning conferences, including NeurIPS, COLT, and ALT. He is the recipient of a Google Research Award, a Xerox Foundation Award, a Criteo Faculty Award, a Google Focused Award, and an IBM Research Award. He is ELLIS fellow, member of the ELLIS board, and co-director of the ELLIS program on Interactive Learning and Interventional Representations. He is a corresponding member of the Italian Academy of Sciences. |
||
11:00 - 11:30 | Coffee break | |
11:30 - 12:30 | Inference for the Wasserstein Distance Between Mixing Measures in Topic Models | Florentina Bunea – Cornell University |
Abstract: The Wasserstein distance between mixing measures has come to occupy a central place in the statistical analysis of mixture models. We give the first axiomatic justification of its usage as a canonical measure of discrepancy between any mixture distributions. Inference for the Wasserstein distance between mixing measures is generally difficult in high dimensions. Specializing to discrete mixtures arising from topic models, we offer the first minimax lower bound on estimating the distance between pairs of mixing measures in this model class. This reveals regimes under which fast estimation of the distance between mixing measures can be expected, even if the ambient dimension of the mixture distributions is large. In such regimes, we develop fully data-driven inferential tools that allow us to obtain the first asymptotically valid confidence intervals for the Wasserstein distance between mixing measures, in topic models. Our results apply to potentially sparse mixtures of potentially sparse high-dimensional discrete probability distributions, and are illustrated via an example on movie reviews from the IMDb data set. Bio: Florentina Bunea is a professor of statistics in the Department of Statistics and Data Science at Cornell University. She is an IMS fellow, for her contributions to the foundations of model selection. Her recent research on interpretable machine learning (factor models) and on topic models won her a ‘25 IMS Medallion Award. Applications of this work to a large class of medical problems has appeared very recently in Nature Methods https://news.cornell.edu/stories/2024/03/statistical-machine-learning-can-find-unknown-factors-behind-disease Her current work involves applications of optimal transport to statistical applications and also the study of softmax mixtures, with applications to large language models. Bunea’s research has been continuously funded, in part, by the National Science Foundation. She has served, or is serving, as an Associate Editor of the Annals of Statistics, Bernoulli, JASA. JRSS-B, among others. She is an active promoter of equity and diversity in Data Science and Statistical Machine Learning. |
||
12:30 - 14:00 | Lunch time | |
14:00 - 15:00 | Differentially Private M-Estimation via Noisy Optimization | Po-Ling Loh – University of Cambridge |
Abstract: We present a noisy composite gradient descent algorithm for differentially private statistical estimation in high dimensions. We begin by providing general rates of convergence for the parameter error of successive iterates under assumptions of local restricted strong convexity and local restricted smoothness. Our analysis is local, in that it ensures a linear rate of convergence when the initial iterate lies within a constant-radius region of the true parameter. At each iterate, multivariate Gaussian noise is added to the gradient in order to guarantee that the output satisfies Gaussian differential privacy. We then derive consequences of our theory for linear regression and mean estimation. Motivated by M-estimators used in robust statistics, we study loss functions which downweight the contribution of individual data points in such a way that the sensitivity of function gradients is guaranteed to be bounded, even without the usual assumption that our data lie in a bounded domain. We prove that the objective functions thus obtained indeed satisfy the restricted convexity and restricted smoothness conditions required for our general theory. We then show how the private estimators obtained by noisy composite gradient descent may be used to obtain differentially private confidence intervals for regression coefficients, by leveraging work in Lasso debiasing proposed in high-dimensional statistics. We complement our theoretical results with simulations that illustrate the favorable finite-sample performance of our methods. Bio: Po-Ling Loh received her PhD in Statistics from UC Berkeley in 2014. From 2014-2016, she was an Assistant Professor of Statistics at the University of Pennsylvania. From 2016-2018, she was an Assistant Professor of Electrical & Computer Engineering at UW-Madison, and from 2019-2020, she was an Associate Professor of Statistics at UW-Madison and a Visiting Associate Professor of Statistics at Columbia University. She began her appointment in the Department of Pure Mathematics and Mathematical Statistics at the University of Cambridge in 2021, where she is currently a Professor of Statistics and a Fellow of St. John’s College. Po-Ling's current research interests include high-dimensional statistics, robustness, and differential privacy. She is a recipient of a Philip Leverhulme Prize, NSF CAREER Award, ARO Young Investigator Award, IMS Tweedie New Researcher Award, Bernoulli Society New Researcher Award, a Leverhulme Prize, and a Hertz Fellowship. |
||
15:00 - 16:00 | Differentially Private Algorithms for Statistical Estimation Problems | Rachel Cummings – Columbia University |
Abstract: Differential privacy (DP) is widely regarded as a gold standard for privacy-preserving computation over users’ data. It is a parameterized notion of database privacy that gives a rigorous worst-case bound on the information that can be learned about any one individual from the result of a data analysis task. Algorithmically it is achieved by injecting carefully calibrated randomness into the analysis to balance privacy protections with accuracy of the results. In this talk, we will survey recent developments in the development of DP algorithms for three important statistical problems, namely online learning with bandit feedback, causal inference, and learning from imbalanced data. For the first problem, we will show that Thompson sampling -- a standard bandit algorithm developed in the 1930s -- already satisfies DP due to the inherent randomness of the algorithm. For the second problem of causal inference and counterfactual estimation, we develop the first DP algorithms for synthetic control, which has been used non-privately for this task for decades. Finally, for the problem of imbalanced learning, where one class is severely underrepresented in the training data, we show that combining existing techniques such as minority oversampling perform very poorly when applied as pre-processing before a DP learning algorithm; instead we propose novel approaches for privately generating synthetic minority points. based on joint works with Marco Avella Medina, Vishal Misra, Yuliia Lut, Tingting Ou, Saeyoung Rho, Lucas Rosenblatt, Ethan Turok Bio: Dr. Rachel Cummings is an Associate Professor of Industrial Engineering and Operations Research and (by courtesy) Computer Science at Columbia University, where she is also a member of the Data Science Institute and co-chairs the Cybersecurity Research Center. She is also a Fellow at the Center for Democracy & Technology. Before joining Columbia, she was an Assistant Professor of Industrial and Systems Engineering and (by courtesy) Computer Science at the Georgia Institute of Technology. Her research interests lie primarily in data privacy, with connections to machine learning, algorithmic economics, optimization, statistics, and public policy. Dr. Cummings is the recipient of numerous awards including an NSF CAREER award, a DARPA Young Faculty Award, a DARPA Director's Fellowship, an Early Career Impact Award, multiple industry research awards, a Provost’s Teaching Award, two doctoral dissertation awards, and Best Paper Awards at DISC 2014, CCS 2021, and SaTML 2023. Dr. Cummings also serves on the ACM U.S. Technology Policy Committee, the IEEE Standards Association, and the Future of Privacy Forum's Advisory Board. |
9:00 - 10:00 | Optimization, Robustness and Attention in Deep Learning: Insights from Random and NTK Feature Models | Marco Mondelli – IST Austria |
---|---|---|
Abstract: A recent line of work has analyzed the properties of deep learning models through the lens of Random Features (RF) and the Neural Tangent Kernel (NTK). In this talk, I will show how concentration bounds on RF and NTK maps provide insights on (i) the optimization of the network via gradient descent, (ii) its adversarial robustness, and (iii) the success of attention-based architectures, such as transformers. I will start by proving tight bounds on the smallest eigenvalue of the NTK for deep neural networks with minimum over- parameterization. This implies that the network optimized by gradient descent interpolates the training dataset (i.e., reaches 0 training loss), as soon as the number of parameters is information-theoretically optimal. Next, I will focus on the robustness of the interpolating solution. A thought-provoking paper by Bubeck and Sellke has proposed a “universal law of robustness”: interpolating smoothly the data necessarily requires many more parameters than simple memorization. By providing sharp bounds on RF and NTK models, I will show that, while some random feature models cannot be robust (regardless of the over-parameterization), NTK features are able to saturate the universal law of robustness, thus addressing a conjecture by Bubeck, Li and Nagaraj. Finally, I will consider attention-based architectures, showing that random attention features are sensitive to a change of a single word in the context, as expected from a model suitable for NLP tasks. In contrast, the sensitivity of random features decays with the length of the context. This property translates into generalization bounds: due to their low word sensitivity, random features provably cannot learn to distinguish between two sentences that differ only in a single word. In contrast, due to their high word sensitivity, random attention features have higher generalization capabilities. |
||
10:00 - 11:00 | Operator Learning for Dynamical Systems and Spectral Bounds | Massimiliano Pontil – Istituto Italiano di Tecnologia |
Abstract: We investigate learning the eigenfunctions and eigenvalues of transfer operators associated with stochastic dynamical systems, notably Langevin dynamics. These operators describe the average evolution of functions of the system's state over a future time interval. For continuous systems, we also investigate learning the closely related infinitesimal generator, essential for characterizing solutions to stochastic differential equations. We discuss operator estimators, whose domain is either a reproducing kernel Hilbert space or the span of basis functions learned via a neural network. We present spectral learning guarantees for these methods, highlighting the important role of controlling the estimator's rank. Finally, we discuss how physics-informed algorithms to learn the infinitesimal generators can be naturally adapted to learning from biased data trajectories, with significant applications in computational chemistry. Bio: Massimiliano Pontil is Senior Researcher at the Italian Institute of Technology, where he leads the CSML research unit, and co-director of ELLIS unit Genoa. He is also Professor at University College London and member of the UCL Centre for Artificial Intelligence. He has been active in machine learning for over twenty-five years, working on theory and algorithms, including the areas of kernel methods, dynamical systems, meta-learning, multitask and transfer learning, sparse estimation, and statistical learning theory. |
||
11:00 - 11:30 | Coffee break | |
11:30 - 12:30 | TDS Learning | Adam Klivans – The University of Texas at Austin |
Abstract: We revisit the fundamental problem of learning with distribution shift, in which a learner is given labeled samples from training distribution D, unlabeled samples from test distribution D′ and is asked to output a classifier with low test error. The standard approach in this setting is to bound the loss of a classifier in terms of some notion of distance between D and D′. These distances, however, seem difficult to compute and do not lead to efficient algorithms. We depart from this paradigm and define a new model called testable learning with distribution shift (TDS learning), where we can obtain provably efficient algorithms for certifying the performance of a classifier on a test distribution. In this model, a learner outputs a classifier with low test error whenever samples from D and D′ pass an associated test; moreover, the test must accept if the marginal of D equals the marginal of D′. We give several positive results for learning well-studied concept classes such as halfspaces, intersections of halfspaces, and decision trees. Prior to our work, no efficient algorithms for these basic cases were known without strong assumptions on D′. Bio: Adam Klivans is a professor of computer science and director of the NSF AI Institute for Foundations of Machine Learning (IFML). He also leads the UT-Austin Machine Learning Lab. His research interests include algorithms for supervised learning and AI for protein design. |
||
12:30 - 14:00 | Lunch time | |
14:00 - 15:00 | Looking beyond the Worst-Case Adversaries in Machine Learning | Nika Haghtalab – UC Berkeley |
Abstract: Robustness to changes in data is one of the main challenges faced by sequential machine learning and decision-making algorithms. Yet, most efficient and highly optimized deployed algorithms today were designed to work well on fixed data sets and ultimately fail when data evolves in unpredictable or adversarial ways. It is even more concerning that, for most fundamental problems in machine learning and optimization, providing any performance guarantees that do not completely diminish in the presence of all-powerful adversaries is impossible. In this talk, we will explore the smoothed analysis perspective on adaptive adversaries in machine learning and optimization, which goes beyond the worst-case scenario. We will examine both information theoretical and computational perspectives and present general-purpose techniques that provide strong robustness guarantees in practical domains for a wide range of applications, such as online learning, differential privacy, discrepancy theory, sequential probability assignment, equilibrium finding, and learning-augmented algorithm design. Our conclusion is that even small perturbations to worst-case adaptive adversaries can make learning in their presence as easy as learning over a fixed data set. Bio: Nika Haghtalab is an Assistant Professor in the Department of Electrical Engineering and Computer Sciences at UC Berkeley. She works broadly on the theoretical aspects of machine learning and algorithmic economics. Prof. Haghtalab’s work builds theoretical foundations for ensuring both the performance of learning algorithms in presence of everyday economic forces and the integrity of social and economic forces that are born out of the use of machine learning systems. Among her honors are Sloan fellowship, NSF CAREER award, Schmidt Sciences AI2050 award, the CMU School of Computer Science Dissertation Award, SIGecom Dissertation Honorable Mention, and NeurIPS and ICAPS outstanding paper awards. |
||
16:00 - 19:00 | Poster session Location: ETH AI Center |
9:00 - 10:00 | Causal Inference with Unstructured Data | Yixin Wang – University of Michigan |
---|---|---|
Abstract: Causal inference traditionally involves analyzing tabular data where variables like treatment, outcome, covariates, and colliders are manually labeled by humans. However, many complex causal inference problems rely on unstructured data sources such as images, text and videos that depict overall situations. These causal problems require a crucial first step - extracting the high-level latent causal factors from the low-level unstructured data inputs, a task known as "causal representation learning." In this talk, we explore how to identify latent causal factors from unstructured data, whether from passive observations, interventional experiments, or multi-domain datasets. While latent factors are classically uncovered by leveraging their statistical independence, causal representation learning grapples with a thornier challenge: the latent causal factors are often correlated, causally connected, or arbitrarily dependent. Our key observation is that, despite correlations, the causal connections (or the lack of) among factors leave geometric signatures in the latent factors' support - the ranges of values each can take. Leveraging these signatures, we show that observational data alone can identify the latent factors up to coordinate transformations if they bear no causal links. When causal connections do exist, interventional data can provide geometric clues sufficient for identification. In the most general case of arbitrary dependencies, multi-domain data can separate stable factors from unstable ones. Taken together, these results showcase the unique power of geometric signatures in causal representation learning. This is joint work with Kartik Ahuja, Yoshua Bengio, Michael Jordan, Divyat Mahajan, and Amin Mansouri. [1] https://arxiv.org/abs/2109.03795 Bio: Yixin Wang is an assistant professor of statistics at the University of Michigan. She works in the fields of Bayesian statistics, machine learning, and causal inference. Previously, she was a postdoctoral researcher with Professor Michael Jordan at the University of California, Berkeley. She completed her PhD in statistics at Columbia, advised by Professor David Blei, and her undergraduate studies in mathematics and computer science at the Hong Kong University of Science and Technology. Her research has been recognized by the j-ISBA Blackwell-Rosenbluth Award, ICSA Conference Young Researcher Award, ISBA Savage Award Honorable Mention, ACIC Tom Ten Have Award Honorable Mention, and INFORMS data mining and COPA best paper awards. |
||
10:00 - 11:00 | The Physics of Machine Learning | Lenka Zdeborova – EPFL |
Abstract: Machine learning provides an invaluable toolbox for natural sciences, but it also comes with many open questions that theoretical branches of natural sciences can investigate and tackle. This talk will describe some recent trends and progress in the second direction. We will discuss how diffusion or flow-based generative models sample or fail to sample challenging probability distributions. We will present a toy model of dot-product attention that presents a phase transition between positional and semantic learning. We will also revisit some classical methods for estimating uncertainty and their status in the context of modern over-parametrized neural networks. Bio: Lenka Zdeborová is a professor of physics and of computer science at École Polytechnique Fédérale de Lausanne in Switzerland, where she leads the Statistical Physics of Computation Laboratory. Her expertise is in applications of concepts from statistical physics, such as advanced mean field methods, replica method and related message-passing algorithms, to problems in machine learning, signal processing, inference and optimization. She has said that she “enjoys erasing the boundaries between theoretical physics, mathematics and computer science.” Zdeborová received a Ph.D. in physics from University Paris-Sud and from Charles University in Prague in 2008. She spent two years in the Los Alamos National Laboratory as the Director's Postdoctoral Fellow; she then spent ten years as a researcher at Centre national de la recherche scientifique (CNRS) working in the Institute of Theoretical Physics in CEA Saclay, France. She served as an editorial board member for Journal of Physics A, Physical Review E, Physical Review X, SIMODS, Machine Learning: Science and Technology, and Information and Inference. Her honors include the CNRS bronze medal in 2014; the Philippe Meyer prize in theoretical physics in 2016; the Irène Joliot-Curie prize in 2018; and the Gibbs lectureship of AMS and the Neuron Fund award in 2021. |
||
11:00 - 11:30 | Coffee break | |
11:30 - 12:30 | The Impacts of Overparameterization under Distribution Shift | Tong Zhang – HKUST |
Abstract: Modern machine learning often relies on overparameterized models, typically trained to overfit the data. Recent studies have suggested that such overfitting can be "benign," leading to estimators with strong generalization performance. In this talk, we further examine the consequences of overparameterization under distribution shifts and present the following findings: 1. For adversarially perturbed data, benign overfitting can result in models that lack adversarial robustness, even when the true underlying model is robust. 2. On the other hand, when test data are perturbed non-adversarially but exhibit a distribution shift from the training data, overparameterization proves advantageous. As the model size increases, the out-of-distribution performance also improves. We support our theoretical results with experimental evidence. |
||
12:30 - 14:00 | Lunch time | |
14:00 - 15:00 | Pure Exploration Problems | Wouter Koolen – University of Twente |
Abstract: In this tutorial we will consider the design of an active learning agent that decides which experiments to perform sequentially. We formulate the learning task in the multi-armed bandit setting and highlight the distinction between reward maximisation and statistical testing. We then develop mathematical tools and solve three classic learning tasks: regret minimization, best arm identification with fixed confidence and best arm identification with fixed budget. We finally show how to scale up to instance-optimal learning for more advanced tasks including multi-objective optimization. Bio: Wouter Koolen is a professor of mathematical machine learning at Twente University and a senior researcher at the Centrum Wiskunde & Informatica Amsterdam. His specialties include full information online learning and bandits, statistical testing and saddle point computation. |
||
16:00 - | Speakers Event |
9:00 - 10:00 | Adaptive estimation from indirect observations | Anatoli Juditsky – Université Grenoble Alpes |
---|---|---|
Abstract: We discuss an approach to estimate aggregation and adaptive estimation based upon (nearly optimal) testing of convex hypotheses. The proposed adaptive routines generalize upon now-classical Goldenshluger-Lepski adaptation schemes, and, in the situation where the observations stem from simple observation schemes (i.e., have Gaussian, discrete and Poisson distribution) and where the set of unknown signals is a finite union of convex and compact sets, attain nearly optimal performance. As an illustration, we consider application of the proposed estimates to the problem of recovery of unknown signal known to belong to a union of ellitopes in Gaussian observation scheme. The corresponding numerical routines can be implemented efficiently when the number of sets in the union is “not very large.” We illustrate “practical performance” of the method in an example of estimation in the single-index model. Bio: Anatoli Juditsky received the M.S. in Electrical Engineering in 1985 from the Moscow Institute of Physics and Technology and the Ph.D. in 1989 from the Institute of Control Sci., Moscow, USSR. Since 1999 he is Professor at the Department of Mathematics and Informatics (IM2AG) of the University of Grenoble Alps, France; he held a research position at INRIA-IRISA, Rennes, in 1991--1996, and then at INRIA, Grenoble, in 1996--1999. His current research interests include stochastic and large-scale convex optimization, statistical theory and their applications. |
||
10:00 - 11:00 | Theoretical Foundation of Mean Field Langevin Dynamics and Its Application to Neural Network Optimization | Taiji Suzuki – The University of Tokyo |
Abstract: The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the gradient Langevin dynamics (GLD) that minimizes an entropy regularized convex function defined on the space of probability distributions, and it naturally arises from the optimization of two-layer neural networks via (noisy) gradient descent. In this talk, I will present the convergence analysis of MFLD and explain how the convergence of MFLD is characterized by the log-Sobolev inequality of the so-called proximal Gibbs measure corresponding to the current solution. Moreover, I will provide a general framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and stochastic gradient approximation. In the latter half, I will discuss the generalization error analysis of neural networks trained by MFLD. Thanks to its feature learning ability of neural networks, we can show that neural networks achieve better statistical properties than kernel methods in several problem settings. Finally, we also present recent developments of optimizing mean field neural networks for non-convex objectives appearing in reinforcement learning and in-context learning. |
||
11:00 - 11:30 | Coffee break | |
11:30 - 12:30 | Learning Representations and Associations with Gradient Descent | Jason D. Lee – Princeton University |
Abstract: TBD Bio: Jason Lee is an associate professor in Electrical Engineering and Computer Science (secondary) at Princeton University. Prior to that, he was in the Data Science and Operations department at the University of Southern California and a postdoctoral researcher at UC Berkeley working with Michael I. Jordan. Jason received his PhD at Stanford University advised by Trevor Hastie and Jonathan Taylor. His research interests are in the theory of machine learning, optimization, and statistics. Lately, he has worked on the foundations of deep learning, representation learning, and reinforcement learning. He has received the Samsung AI Researcher of the Year Award, NSF Career Award, ONR Young Investigator Award in Mathematical Data Science, Sloan Research Fellowship, NeurIPS Best Student Paper Award and Finalist for the Best Paper Prize for Young Researchers in Continuous Optimization. |
||
12:30 - 14:00 | Lunch time | |
14:00 - 15:00 | Causal Effect Estimation with Context and Confounders | Arthur Gretton – Gatsby, UCL, DeepMind |
Abstract: A fundamental causal modelling task is to predict the effect of an intervention (or treatment) on an outcome, given context/covariates. Examples include predicting the effect of a medical treatment on patient health given patient symptoms and demographic information, or predicting the effect of ticket pricing on airline sales given seasonal fluctuations in demand. The problem becomes especially challenging when the treatment and context are complex (for instance, "treatment" might be a web ad design or a radiotherapy plan), and when only observational data is available (i.e., we have access to historical data, but cannot intervene ourselves). The challenge is greater still when the covariates are not observed, and constitute a hidden source of confounding. I will give an overview of some practical tools and methods for estimating causal effects of complex, high dimensional treatments from observational data. The approach is based on conditional feature means, which represent conditional expectations of relevant model features. These features can be deep neural nets (adaptive, finite dimensional, learned from data), or kernel features (fixed, infinite dimensional, enforcing smoothness). When hidden confounding is present, a neural net implementation of instrumental variable regression can be used to correct for this confounding. The methods will be applied to modelling employment outcomes for the US Job Corps program for Disadvantaged Youth, and in policy evaluation for reinforcement learning. Bio: Arthur Gretton is a Professor with the Gatsby Computational Neuroscience Unit, and director of the Centre for Computational Statistics and Machine Learning (CSML) at UCL; and a research scientist at Google Deepmind. His recent research interests include causal inference and representation learning, design and training of generative models, and nonparametric hypothesis testing. Arthur has been an associate editor at IEEE Transactions on Pattern Analysis and Machine Intelligence, an Action Editor for JMLR, a Senior Area Chair for NeurIPS (2018,2021) and ICML (2022), a member of the COLT Program Committee in 2013, and a member of Royal Statistical Society Research Section Committee since January 2020. Arthur was program co-chair for AISTATS in 2016, tutorials co-chair for ICML 2018, workshops co-chair for ICML 2019, program co-chair for the Dali workshop in 2019, and co-organsier of the Machine Learning Summer School 2019 in London. |
Author | Poster |
---|---|
Yifan Hu | Contextual Bilevel Stochastic Optimization |
Ilyas Fatkhullin | Taming Nonconvex Stochastic Mirror Descent with General Bregman Divergence |
Florian Hubler | Tuning-Free Optimization under Relaxed Smoothness |
Zebang Shen | Integral Relaxation for Minima-selection in Bi-Level Optimization: A CVaR-based Approach |
Piersilvio De Bartolomeis | Detecting critical treatment effect bias in small subgroups |
Julia Kostin | Achievable distributional robustness when the robust risk is only partially identified |
Freya Behrens | A phase transition between positional and semantic learning in a solvable model of dot-product attention |
Freya Behrens | Counting on Algorithmic Capacity: The Interplay between Mixing and Memorization in Toy Models of Transformers |
Linara Adilova | Layer-wise Linear Mode Connectivity |
Amir Joudaki | Batch Normalization Without Gradient Explosion |
Luca Arnaboldi | Online Learning and Information Exponents: The Importance of Batch size Time/Complexity Tradeoffs |
Yatin Dandi | The Benefits of Reusing Batches for Gradient Descent in Two-Layer Networks: Breaking the Curse of Information and Leap Exponents |
Odilon Duranthon | Asymptotic generalization error of a single-layer graph convolutional network |
Ashok Vardhan Makkuva | Local to Global: Learning Dynamics and Effect of Initialization for Transformers |
Ashok Vardhan Makkuva | Attention with Markov: A Curious case of Single-layer Transformers |
Simone Bombari | How Spurious Features Are Memorized: Precise Analysis for Random and NTK Features |
Anas Barakat | TBD |
Jiawei Huang | Learning to Steer Markovian Agents under Model Uncertainty |
Adrian Müller | Truly No-Regret Learning in Constrained MDPs |
Yaxi Hu | Provable Privacy with Non-Private Pre-Processing |
Xabier de Juan | Optimality of the Median-of-Means Under Adversarial Contamination |
Hongjie Chen | Private Estimation for Random Graphs via Sum-of-Squares |
Daniil Dmitriev | Robust Mixture Learning when Outliers Overwhelm Small Groups |
Liang Zhang | DPZero: Private Fine-Tuning of Language Models without Backpropagation |
Ludovic Stephan | Repetita Iuvant: Data Repetition Allows SGD to Learn High-Dimensional Multi-Index Functions |
Emanuele Troiani | Fundamental limits of weak learnability in high-dimensional multi-index models |
Kevin Kögler | Compression of Structured Data with Autoencoders: Provable Benefit of Nonlinearities and Depth |
Konstantinos Stavropoulos | Testable Learning with Distribution Shift |
Alessio Mazzetto | An Adaptive Algorithm for Learning with Unknown Distribution Drift |
Nataly Brukhim | Multiclass Boosting: simple and intuitive weak learning criteria |
Verónica Álvarez | Programmatic Weak Supervision with Confidence Intervals for Probabilities |
Giacomo Meanti | Estimating Koopman operators with sketching to provably learn large scale dynamical systems |
Jonas Hübotter | Active Few-Shot Fine-Tuning |
Giovanni Piccioli | Gibbs Sampling the Posterior of Neural Networks |
Davide Ghio | Sampling with flows, diffusion and autoregressive neural networks: A spin-glass perspective |
Katya Mirylenka | Leveraging Large Language Models for Natural Language to SQL Conversion |
Ya-Ping Hsieh | Sinkhorn Flow as Mirror Flow |
Lucas Clarte | Analysis of Bootstrap and Subsampling in High-dimensional Regularized Regression |
Prieto Zerbetto | TBD |
Marco Rando | TBD |
Ieva Petrulyonite | TBD |
Alexander Shevchenko | TBD |
Organizing Committee
Location
Location of the invited talks:
ETH Zurich Zentrum HG E 1.1
Rämistrasse 101, 8006 Zurich
Zurich, Switzerland
Location of the poster sessions:
ETH AI Center OAT X11
Andreasstrasse 5, 8092 Zurich
Zurich, Switzerland
For more information please refer to information_sheet.pdf
Registration
Registration for the workshop is now closed as we have reached full capacity. However, remote participation is available via a Zoom link. If you are interested, please email us with your position and affiliation details to obtain access. Additionally, recordings of the talks will be available online after the workshop.