S&DS 659: Mathematics of Deep Learning

Description

The goal of this course is to provide an introduction to selected topics in deep learning theory. I will present a number of mathematical models and theoretical concepts that have emerged in recent years to understand neural networks.

Lectures: Wednesdays 4:00pm–5:50pm
Office Hours: Thursdays 4:00pm–5:00pm, Kline Tower 1049

Prerequesites: I will not assume specific background in machine learning, let alone neural networks. On the other hand, I will assume a degree of mathematical maturity, in particular in linear algebra, analysis, and probability theory (at the level of S&DS 241/541).

Assignments: Scribe one lecture during the semester. You will have to write a report on a research topic related to deep learning theory, and give a presentation at the end of the semester.

Handwritten notes (2025): Lecture 1 (Introduction), Lecture 2 (Uniform convergence), Lecture 3 (Implicit regularization), Lecture 4 (Benign overfitting), Lecture 5 (Lazy regime), Lecture 6 (Kernels), Lecture 7 (Mean field), Lecture 8 (abc parametrization), Lecture 9 (Multi-index models), Lecture 10 (Computational hardness), Lecture 11 (High-dim dynamics)

Course syllabus

Week 1: General Introduction
- Empirical risk minimization and the classical paradigm of statistical learning.
- Tractability via overparametrization, implicit bias.
- Universal approximation, Barron's theorem, uniform convergence.

Week 2: Generalization and Uniform Convergence
- Basics of uniform convergence theory.
- Norm-based uniform convergence for multilayer NNs.

Week 3: Implicit Bias
- Implicit bias of learning algorithms.
- Examples of mirror descent and steepest descent.

Week 4: Benign Overfitting/Double Descent
- Overfitting and double descent phenomena.
- Benign overfitting in linear regression, self-induced regularization.
- Inner-product kernels on the sphere.

Week 5: Lazy Regime and NTK
- Lazy training regime in optimization.
- Global convergence of two-layer NNs.
- Neural Tangent Kernel.

Week 6: Kernel Methods
- Background on kernel methods.
- Deterministic equivalents for ridge regression.
- Curse-of-dimensionality, learning lower bounds for linear methods.

Week 7: Mean-Field Description
- Infinite-width limits and mu parametrization.
- Mean-field theory for two-layer NNs: McKean–Vlasov PDE and Optimal transport formulations.
- Global convergence guarantees.

Week 8: Linear Methods vs Feature Selection vs Feature Learning
- Convex neural networks.
- Case study: multi-index functions.
- Staircase mechanism.

Week 9: Power and Limitations of Differentiable Learning
- Computational hardness of deep learning.
- Poly-time universality of SGD on NNs.

Week 10: High-dimensional Landscapes and Dynamics
- Landscape concentration.
- Dynamics on non-convex problems in high dimensions.

Week 11: Transformers, Attention, and In-Context Learning
- Transformer architecture, attention layer.
- In-context linear regression.

Week 12: Edge-of-Stability, Neural Scaling Laws, Emergence, and Beyond
- Review of a number of phenomena.
- Open ended discussions.

Week 13: In-class presentations + pizza