Rethinking the Role of Bayesianism in the Age of Modern AI (Dagstuhl Seminar 24461)

Fortuin, Vincent; Khan, Mohammad Emtiyaz; van der Wilk, Mark; Ghahramani, Zoubin; Fisher, Katharine

doi:10.4230/DagRep.14.11.40

Rethinking the Role of Bayesianism
in the Age of Modern AI

Report from Dagstuhl Seminar 24461

Vincent Fortuin¹¹1Editor / Organizer Helmholtz AI – Neuherberg, DE Mohammad Emtiyaz Khan²²2Editor / Organizer RIKEN – Tokyo, JP
Mark van der Wilk³³3Editor / Organizer University of Oxford, GB Zoubin Ghahramani⁴⁴4Editor / Organizer Google Deepmind – Mountainview, US Katharine Fisher⁵⁵5Editorial Assistant / Collector MIT – Cambridge, US

Abstract

Despite the recent success of large-scale deep learning, these systems still fall short in terms of their reliability and trustworthiness. They often lack the ability to estimate their own uncertainty in a calibrated way, encode meaningful prior knowledge, avoid catastrophic failures, and also reason about their environments to avoid such failures. Since its inception, Bayesian deep learning (BDL) has harbored the promise of achieving these desiderata by combining the solid statistical foundations of Bayesian inference with the practically successful engineering solutions of deep learning methods. This was intended to provide a principled mechanism to add the benefits of Bayesian learning to the framework of deep neural networks. However, compared to its promise, BDL methods often do not live up to the expectation and underdeliver in terms of real-world impact. This is due to many fundamental challenges related to, for instance, computation of approximate posteriors, unavailability of flexible priors, but also lack of appropriate testbeds and benchmarks. To make things worse, there are also numerous misconceptions about the scope of Bayesian methods, and researchers often end up expecting more than what they can get out of Bayes. By bringing together researchers from diverse communities, such as machine learning, statistics, and deep learning practice, in a personal and interactive seminar environment featuring debates, round tables, and brainstorming sessions, our Dagstuhl Seminar “Rethinking the Role of Bayesianism in the Age of Modern AI” (24461) has discussed these questions from a variety of angles and charted a path for future research to innovate, enhance, and strengthen meaningful real-world impact of Bayesian deep learning.

Keywords and phrases:

Bayesian machine learning, deep learning, foundation models, model selection, uncertainty estimation

Seminar:

November 10–15, 2024 – https://www.dagstuhl.de/24461

2012 ACM Subject Classification:

Computing methodologies

\rightarrow

Artificial intelligence

Copyright and License:

Except where otherwise noted, content of this report is licensed under a Creative Commons BY 4.0 International license

DOI:

10.4230/DagRep.14.11.40

1 Executive Summary

Vincent Fortuin (Helmholtz AI – Neuherberg, DE)

License: Creative Commons BY 4.0 International license © Vincent Fortuin

The Dagstuhl Seminar “Rethinking the Role of Bayesianism in the Age of Modern AI” (24461) was convened to explore the contemporary role of Bayesian methods in artificial intelligence, particularly in light of the remarkable advancements in large-scale deep learning. While Bayesian Deep Learning (BDL) holds the promise of addressing key limitations of traditional deep learning, such as uncertainty estimation, encoding prior knowledge, and preventing catastrophic failures, it frequently falls short of its potential in practical applications. This discrepancy arises from several fundamental challenges. These challenges include the difficulty of computing accurate posterior approximations, the scarcity of flexible prior distributions, and the lack of suitable benchmarks for evaluating Bayesian models. Furthermore, misconceptions regarding the scope of Bayesian methods often lead researchers to harbor unrealistic expectations and overlook simpler, non-Bayesian alternatives like bootstrap methods, post-hoc uncertainty scaling, and conformal prediction. Such over-expectations, followed by under-delivery, may cause researchers to lose faith in Bayesian approaches. The central question addressed by the seminar was: In this era of AI where scaling seems to solve many problems, what is the unique role of Bayesian methods? The goal was to redefine the promises and challenges of Bayesian approaches, identify areas where they can outperform non-Bayesian methods, and highlight key application domains where their strengths can be best leveraged. By bringing together researchers from diverse backgrounds, the seminar aimed to chart a path for future research to innovate, enhance, and strengthen the real-world impact of BDL. The seminar recognized that while non-Bayesian methods seem to be solving problems that Bayesians once hoped to solve with Bayesian methods, it was important to re-examine the value and potential of the Bayesian approach.

Structure of the Seminar

The seminar was designed to foster an interactive and collaborative environment, incorporating three distinct types of events: workgroup sessions, guided discussions, and final plenary discussions.

Workgroup Sessions.

These sessions revolved around overarching questions pertaining to the main advantages of Bayesian methods, the challenges hindering their adoption, and the practical areas where they can make the most difference. The workgroups always featured between one and three short input talks from different participants, which then informed the subsequent discussions. The workgroups were structured around three main questions:

$\blacksquare$

What are the main benefits of Bayes that are hard to achieve otherwise?
$\blacksquare$

What are the most pressing challenges in its adoption?
$\blacksquare$

What are the most impactful ways for Bayes to make a difference in practice?

Guided Discussions.

The guided discussions were designed to examine contentious issues and encourage debate, focusing on three key motions:

$\blacksquare$

“Bayes’ theorem is broken for making predictions with large models”
$\blacksquare$

“We can build subjective Bayesian priors for NNs that we actually believe in”
$\blacksquare$

“Bayes is useless if we cannot scale to LLMs”

Final plenary discussions.

The final discussions focused on the bigger picture and the next steps for researchers in the field. They centered around three main themes:

$\blacksquare$

What can we do to encourage researchers to join the BDL community and how can we support and uplift each other within the community?
$\blacksquare$

How can we measure progress in the field and find promising application areas that would convince practitioners to use Bayesian methods?
$\blacksquare$

What are some grand long-term challenges for which we could hope Bayesian methods to make a difference and potentially outperform standard deep learning?

Insights from the Working Groups

Talks.

The seminar featured a series of presentations covering a wide range of topics related to Bayesian methods. The participants contributed these talks based on a pre-seminar poll regarding the group’s interests, which informed the working group’s discussions. They included discussions on:

$\blacksquare$

The distinction between aleatoric and epistemic uncertainty. This included a detailed look at how these terms are often used inconsistently, leading to issues in the literature. The discussion also covered how to estimate these uncertainties in practice and how to best decompose total uncertainty.
$\blacksquare$

The difference between predictive and parameter uncertainty. The discussion here considered how to search the space of predictions and how to judge explanations without relying on predictions.
$\blacksquare$

Developing benchmarks for Bayesian methods. This included a discussion on whether current uncertainty measures are useful for model comparison and selection, and whether new benchmarks are needed.
$\blacksquare$

The roles of prediction and explanation in science. The discussion focused on how machine learning has changed the landscape of prediction and explanation, and the role of Bayesian approaches in these areas.
$\blacksquare$

Bayesian foundation models. This discussion considered how probabilistic thinking can help us understand foundation models and whether deep learning technologies can help advance probabilistic methods.
$\blacksquare$

Bayesian Neural Network (BNN) architectures. This included a look at model selection using the marginal likelihood, and whether uncertainty helps to avoid overfitting.
$\blacksquare$

Pseudo-posteriors. This session explored methods like likelihood tempering and robust loss functions to address model misspecification.
$\blacksquare$

Bayesian methods for sequential learning. This included discussions of new algorithms for deep learning and how to apply them in dynamic settings.
$\blacksquare$

The geometry of BNN posteriors. The discussion focused on the challenges for Bayesian inference in deep learning, such as the intractability of posterior distributions and the existence of multiple minima.
$\blacksquare$

Partial stochasticity in BNNs. This talk explored scalable variational approximations based on subnetworks and whether a fully Bayesian treatment of NNs is necessary.
$\blacksquare$

Teaching Bayesian ML. This session covered the decisions academics make when teaching Bayesian ML, what to include and what to omit, and the value of diversity in teaching approaches.
$\blacksquare$

The relationship between Bayesian theory and practice. This presentation explored non-Bayesian justifications for Bayesian updating, the challenges in modeling complex data, and the value of trying out models to see which ones work best.

Together these workgroup sessions yielded important insights into the potential and challenges associated with Bayesian methods along the three main themes of the seminar:

Benefits of Bayes.

Participants highlighted several core benefits, including the ability to quantify uncertainty, update models, perform model selection, and obtain improved point estimates. The quantification of uncertainty was noted as a key advantage, although it was admitted that it can sometimes be achieved by other means. In contrast, model updating was seen as a critical unique benefit, allowing models to adapt to new data without complete retraining.

Challenges in Adoption.

Significant challenges were identified, particularly in the areas of scalability, and prior and model misspecification. These challenges pose barriers to the wider adoption of Bayesian methods. Scalability was a significant concern, as many Bayesian methods are computationally expensive. Prior misspecification was also identified as a major issue, as it can bias the results negatively and hamstring many of the benefits of the Bayesian approach. Finally, model misspecification also presents problems, as many models do not perfectly fit real-world data.

Impactful Applications.

Sequential learning was emphasized as an area where Bayesian methods have the potential to make a substantial impact. The ability of Bayesian methods to update beliefs over time and adapt to new data makes them well-suited to sequential learning tasks.

Insights from the Guided Discussions

The guided discussions brought to light differing opinions and perspectives on critical issues within the Bayesian community.

Bayes’ Theorem and Large Models.

A central debate revolved around the applicability of Bayes’ theorem to large models. The “pro” side contended that while mathematically sound, the epistemological assumptions of Bayes’ theorem do not translate well to complex neural networks (NNs). They argued that NNs lack clearly defined priors. Furthermore, they noted that simpler, more direct methods like point estimates or conformal predictions are often more cost-effective and practical. The “con” side, however, argued that any limitations are due to implementation issues and not the theorem itself. They also noted the value of Bayesian methods when fine-tuning models with small datasets, emphasizing that Bayes provides a flexible framework. The core of the debate was whether the practical constraints of large models should limit the application of Bayesian methods, or whether the flexibility of Bayesian approaches could be adapted to these large models. This discussion highlighted the need for a nuanced understanding of the strengths and limitations of Bayesian methods in different contexts.

Subjective Bayesian Priors.

The discussion on subjective priors for NNs explored the significance of priors, particularly for out-of-distribution data, and the difficulties in defining them effectively. Some participants emphasized that priors should be based on domain expertise, while others questioned the mathematical basis for using subjective priors on neural networks. The discussion highlighted the challenge of balancing subjective knowledge with the need for mathematical rigor. It was also noted that priors on function spaces might be easier to specify than priors on model parameters, and that designing priors to bias solutions toward the data was an area worth exploring.

Scaling to LLMs.

A significant point of contention was whether Bayesian methods are still relevant if they cannot scale to LLMs. The “pro” side argued that the need for scalability to LLMs is paramount for Bayes to stay relevant in the field. The “con” side countered that Bayes should not be limited to large models; it also plays a crucial role in small-data problems and scientific experiments. It was suggested that LLMs themselves could be used as priors and diffusion models as inference algorithms, highlighting the possibility of using modern AI tools within a Bayesian framework. This debate emphasized the need to re-evaluate the role of Bayesian methods in the context of rapidly advancing AI technologies, and whether the Bayesian approach can be adapted to new tools.

Insights from the Final Discussions

The concluding discussions synthesized the key findings from the seminar and outlined future directions for the community.

Community Building.

There was a strong consensus on the need to foster inclusivity within the Bayesian community, encompassing all levels of seniority, as well as industry and academia. It was stressed that a positive outlook on the Bayesian toolkit in reviews was also crucial. The community should view Bayesian methods as a set of useful tools, rather than as a rigid ideology. The importance of mentorship and support for junior researchers was also noted, as well as the value of bringing in people who may be implicitly Bayesian without realizing it.

Benchmarks and Applications.

Participants emphasized the importance of moving beyond traditional vision-based benchmarks to include decision-making and sequential learning tasks. The community needs to focus on identifying applications that highlight the unique advantages of Bayesian methods and create tools that can be used in impactful applications. The use of scoring rules for decision-making was also suggested, as it allows for a clear understanding of the value of improvements and highlights utility in downstream decisions as the key metric for the success of predictive systems. The discussion also highlighted the need to consider applications relevant to the current state of AI and other sciences, rather than relying on past applications.

Grand Challenges.

Discussions on grand challenges included developing a Bayesian equivalent of AlphaFold, addressing the ARC challenge, and incorporating LLMs as priors. Data efficiency was highlighted as a key strength of Bayesian methods, with the potential to significantly reduce the amount of data required for training. The group also raised important questions about the nature of reasoning and compositionality in models, as well as the challenge of building robust and trustworthy AI systems. The need for causal inference was also noted as critical for many real-world applications. The discussion also covered the possibility of using LLMs to learn structured models and programs.

Next Steps

The seminar concluded with the identification of several concrete steps to advance the field of Bayesian deep learning.

Benchmarks.

The community should develop benchmarks that are challenging for deep learning but can be addressed using Bayesian methods, with a focus on sequential and active learning. Data efficiency should also be a focus when creating benchmarks. Furthermore, existing benchmarks should be evaluated for adaptation, especially those that move beyond vision-based tasks.

Research.

Future research should move beyond traditional likelihood metrics and instead prioritize posterior predictive checks to ensure that models are making good predictions. Researchers should also seek to communicate the importance of decision outcomes and ensure that the metrics align with practical goals. There was also a call to focus on the principles behind Bayesian methods rather than just scaling, and to allow for alternative Bayesian inference frameworks (such as the martingale posterior).

Organization.

There is a need to establish a benchmark track at the yearly Symposium on Advances in Approximate Bayesian Inference (AABI), as well as to continue fostering connections within the Bayesian deep learning community through communicative tools (e.g., slack, Notion). The community should also explore the possibility of creating a Bayesian summer school and a virtual seminar, and to seek integration with with the International Society for Bayesian Analysis (ISBA), possible through the foundation of a Bayesian deep learning chapter. There is also a desire to create a yearly Bayesian AI event to foster community. Furthermore, there was a call to share teaching resources to help standardize and improve international higher education in Bayesian machine learning.

2 Table of Contents

Executive Summary

Vincent Fortuin

Overview of Talks

Bayesian foundation models: Do we need to be Bayesian in pre-training?

Laurence Aitchison

Prediction versus Explanation

Alexander A. Alemi

Laplace vs. variational approximations: a biased point of view

Pierre Alquier

Evolution of the Bayesian paradigm in the age of modern machine learning

Julyan Arbel

Bayesian Foundation Models: Exploring the Intersection of Probabilistic Thinking and Modern AI

Thang Bui

Geometry of BNN posteriors

Gintare Karolina Dziugaite

Benchmarking models through uncertainties

Maurizio Filippone

Teaching Bayesian ML

Philipp Hennig

My Thoughts on “Fixing Deep Learning with Bayes”

Mohammad Emtiyaz Khan

Revisiting Bayesian Foundations in the Age of Modern AI

Jeremias Knoblauch

On Scaling Up Bayesian Neural Networks in LLM era

Yingzhen Li

Bong: Bayesian Online Natural Gradient Descent

Kevin Murphy

Bayes Plays the Lottery

Eric Nalisnick

Rethinking Predictive Uncertainty Decomposition

Tom Rainforth

Bayesian theory vs. practice

Daniel Roy

Bayesian Model Selection for Neural Architectures: A Path to Better Generalization

Mark van der Wilk

Working groups

Benchmarks and Applications

Vincent Fortuin

Community building and career

Vincent Fortuin

Grand challenges

Vincent Fortuin

Panel discussions

We can build subjective Bayesian priors for neural networks that we actually believe in

Maurizio Filippone, Vincent Fortuin, Daniel Roy, and Sinead Williamson

Bayes theorem is broken for making predictions with large models

Vincent Fortuin, Alexander A. Alemi, Jeremias Knoblauch, Eric Nalisnick, and Mark van der Wilk

Bayes is useless if we cannot scale to LLMs

Mariia Vladimirova, Yingzhen Li, and Tom Rainforth

Participants

3 Overview of Talks

3.1 Bayesian foundation models: Do we need to be Bayesian in pre-training?

Laurence Aitchison (University of Bristol, GB)

License: Creative Commons BY 4.0 International license © Laurence Aitchison

This talk challenged the need for Bayesian approaches in foundation model pre-training, arguing that the standard maximum likelihood objective is equivalent to Bayesian model averaging as long as each training sample is seen only once. The intuition behind this claim is that overfitting is not a concern during pre-training, and the sampling process should not depend on the trained network. The talk also highlighted the limitations of this approach, particularly when samples are seen multiple times across epochs. In contrast, it was suggested that Bayesian methods can be more easily applied in post-training settings, such as when fine-tuning models using low-rank adapters (LoRA). The discussion touched on connections to online learning, prequential loss, and the role of priors in limited data settings, as well as the relationship between optimization and Bayesian inference. The second part of the talk introduced Bayesian low-rank adaptation for large language models, exploring the loss functions and connections to KL penalties, and comparing the performance of different methods, including Laplace approximation and ensembles.

3.2 Prediction versus Explanation

Alexander A. Alemi (Kissimmee, US)

License: Creative Commons BY 4.0 International license © Alexander A. Alemi

Joint work of: Alexander A. Alemi, Ben Poole, Warren R. Morningstar, Joshua V. Dillon

Main reference: Alexander A. Alemi, Ben Poole: “Variational Prediction”, CoRR, Vol. abs/2307.07568, 2023.

URL: https://doi.org/10.48550/ARXIV.2307.07568

Main reference: Warren R. Morningstar, Alex Alemi, Joshua V. Dillon: “PACm-Bayes: Narrowing the Empirical Risk Gap in the Misspecified Bayesian Regime”, in Proc. of the International Conference on Artificial Intelligence and Statistics, AISTATS 2022, 28-30 March 2022, Virtual Event, Proceedings of Machine Learning Research, Vol. 151, pp. 8270–8298, PMLR, 2022.

URL: https://proceedings.mlr.press/v151/morningstar22a.html

The world has changed. The alchemy of machine learning has unlocked powerful, guidable search in the space of functions. This highlights the distinction between prediction and explanation, two distinct goals in science that historically have been tightly coupled. Historically, Bayesian approaches promised both, and are provably optimal at making predictions, but only in expectation and if well-specified. Moving forward, we should ask ourselves two important questions. First, how ought we search in the space of predictions? What objectives can we discover that can generate predictions that contain some of the uncertainty quantification aspects we’ve come to love from Bayesian inference but that don’t require the full and costly machinery? Second, how ought we judge explanations? If we are interested in continuing the broad program of science, how do we decide between competing theories if each might contain submodules that constitute powerful prediction machines?

3.3 Laplace vs. variational approximations: a biased point of view

Pierre Alquier (ESSEC Business School – Singapore, SG)

License: Creative Commons BY 4.0 International license © Pierre Alquier

Main reference: Pierre Alquier: “User-friendly Introduction to PAC-Bayes Bounds”, Found. Trends Mach. Learn., Vol. 17(2), pp. 174–303, 2024.

URL: https://doi.org/10.1561/2200000100

The objective of this talk is to initiate a discussion on the respective pros and cons of Laplace approximations and variational approximations in Bayesian statistics. The discussion will be based on a set of theoretical tools knowns as PAC-Bayes bounds, introduced by [4] (see [1] for a recent survey).

PAC-Bayes bounds are tools used to analyze the theoretical performance of Bayesian and generalized Bayesian algorithms. They were recently used to proove the consistency of variational approximations [5, 2]. In the talk, I will show that: 1) if a PAC-Bayes bound can prove the consistency of the Laplace approximation in a model, it proves the consistency of the Gaussian variational approximation as an immediate corollary. 2) the converse is not true: I will provide an explicit example of a model where the PAC-Bayes bound proves the consistency of the Gaussian variational approximation, while Laplace approximation is not consistent. The example is taken from [2].

More generally, PAC-Bayes bound lead to the intuition that variational approximations will generalize better when we find flat minima of the likelihood or the empirical risk, see a very nice discussion with applications to deep learning in [3].

This is obviously a partial point of view. When we try to decide which approximation to use in a specific context, statistical properties are not the only criterion we should consider. Algorithmic considerations, for example, are also fundamental. However, I hope that this will help to understand that there are situations where Laplace should be avoided. More importantly, I hope this will initiate a fruitful discussion between the participants of the seminar.

References

[1] Alquier P. User-friendly introduction to PAC-Bayes bounds. Foundations and Trends in Machine Learning, 2024, vol. 17, no. 2, pp. 174-303.
[2] Alquier, P. and Ridgway, J. Concentration of Tempered Posteriors and of their Variational Approximations. The Annals of Statistics, 2020, vol. 48, no. 3, pp. 1475-1497.
[3] Dziugaite, G. K. and Roy, D. Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data. Proceedings of UAI, 2017.
[4] McAllester, D. A. Some PAC-Bayesian theorems. Proceedings of the eleventh annual conference on Computational learning theory (COLT), 1998.
[5] Yang, Y., Pati, D. and Bhattacharya, A. $\alpha$ -variational inference with statistical guarantees. The Annals of Statistics, 2020, vol. 48, no. 2, pp. 886-905.

3.4 Evolution of the Bayesian paradigm in the age of modern machine learning

Julyan Arbel (INRIA – Grenoble, FR)

License: Creative Commons BY 4.0 International license © Julyan Arbel

In the era of large-scale AI, foundation models, and deep learning, the Bayesian paradigm is evolving beyond traditional posteriors. In this short talk, I introduced three distinct yet interconnected approaches to Bayesian inference: Generalized Bayes, Variational Inference, and the Laplace approximation. These methodologies offer a rich continuum of ways to be Bayesian in modern machine learning, balancing theoretical rigor and computational scalability. Through this lens, we reflect on the role of Bayesianism in contemporary AI, addressing both its theoretical developments and its place within the broader statistical community.

3.5 Bayesian Foundation Models: Exploring the Intersection of Probabilistic Thinking and Modern AI

Thang Bui (Australian National University – Acton, AU)

License: Creative Commons BY 4.0 International license © Thang Bui

This talk revisited the questions posed by Yee Whye Teh in his 2017 NeurIPS Breiman lecture, substituting foundation models for deep learning. Two empirical observations were presented, demonstrating the value of probabilistic thinking in understanding foundation models. Firstly, the marginal likelihood was shown to provide insight into the phenomenon of “grokking” or delayed generalization in neural networks. Secondly, a simple approximation scheme for Bernoulli and softmax likelihoods was introduced, which can outperform more principled approaches in certain cases. The talk also explored the potential benefits of Bayesian foundation models, highlighting their adaptability to various tasks that require uncertainty quantification, such as active learning. The discussion concluded with a call to action for the community to define tasks that would benefit from Bayesian foundation models, and to investigate the role of uncertainty in these applications.

3.6 Geometry of BNN posteriors

Gintare Karolina Dziugaite (Google DeepMind – Toronto, CA)

License: Creative Commons BY 4.0 International license © Gintare Karolina Dziugaite

I started by discussing some obstacles for Bayesian inference in DL: namely, the intractability of posterior distributions over the weights, as well as the existence of multiple minima. A recent conjecture by Entezari et al. suggests that, modulo permutation symmetries, there may be only one minimum when training with SGD. What are the implications for Bayesian inference in deep learning if the conjecture were true? Would the challenges we currently face when working in the weight space geometry be resolved?

I then discussed alternative geometries. I brought up the example of sped up training when imposing the “right” structure on the feature space (Malinar et al., “Emergence in non-neural models”). This example demonstrates the potential of working in the feature/function space geometry, though realizing this potential is tricky, and right now figuring out the structure of the features at best can be done post-training.

Finally, I discussed the role of the “right” initialization, that, historically, enabled deep neural network training (He initialization) and scaling ( $\mu$ -P initialization). I raised a question of what would be an analogue of signal propagation for Bayesian analysis. Perhaps what we are missing is the “right” initialization for priors that would enable scaling.

3.7 Benchmarking models through uncertainties

Maurizio Filippone (EURECOM – Biot, FR)

License: Creative Commons BY 4.0 International license © Maurizio Filippone

In the literature of Bayesian neural networks, there has been considerable interest in studying uncertainty. There have been numerous attempts to define uncertainties, and a converging point has been to use information theoretic measures to do so. There have been proposals to decompose the total predictive uncertainty, measured through the entropy of the predictive distribution obtained by marginalizing parameters, into (mainly) aleatoric and epistemic uncertainties. While these measures make intuitive sense, in my presentation I tried to make two points on the usefulness of these measures. First, these measures of uncertainty have been criticized by some recent works (e.g., Wimmer et al., UAI 2023 and Schweighofer et al., arXiv:2410.10786), which point out some pathological behaviors in some specific cases. Second, I presented a simple polynomial regression example with a dataset generated from a fixed order polynomial; in this example, I reported total, epistemic and aleatoric uncertainties with respect to modeling choices with increasing model order and with respect to increasing number of observations. These results show some sensible behavior, but suggest some difficulties in using them in an actionable way to compare, select, or improve models. This motivates studying alternative ways to study uncertainties with the aim of benchmarking models and inference methods, and I welcomed discussions on this front.

3.8 Teaching Bayesian ML

Philipp Hennig (Universität Tübingen, DE)

License: Creative Commons BY 4.0 International license © Philipp Hennig

Main reference: Philipp Hennig: “Probabilistic Machine Learning”, University of Tübingen, 2022.

URL: https://youtube.com/playlist?list=PL05umP7R6ij2YE8rRJSb-olDNbntAQ_Bx

Teaching, not research, is the principal way for academics to create value. While industry and academia compete with each other in research, teaching is unique to academia. However, we rarely talk about teaching this way, and most academics are primarily evaluated based on their research contribution, rather their teaching impact. When we teach, we actively decide what to teach and what not. By deciding what not to teach, we decide what to let die. Academics who want to ensure their research work outlasts them should ultimately actively aim to get their work immortalized in the teaching canon. Writing papers is not the best way to achieve this. Teaching, then, is an active act of forming the body of knowledge for entire generations of students, of formalizing the field into a canon. Because no course is ever complete, it would be an overall loss of knowledge if every student only learned from one and the same course. If each university teaches ML differently, this allows for some diversity in teaching ML, leading to more knowledge retained in distribution. There are many valuable concepts (e.g. Bayesian Nonparametrics, Sum Product, and more generally message passing in graphical models, Support and Relevance Vector Machines) that have now dropped from the widely taught curriculum, and are thus slated to vanish from memory. In my talk, I used my own “Probabilistic ML” course in Tübingen as a trigger for discussion of the above points.

3.9 My Thoughts on “Fixing Deep Learning with Bayes”

Mohammad Emtiyaz Khan (RIKEN – Tokyo, JP)

The biggest challenge with modern AI is that it requires a huge amount of resources which favors monopoly and harmful for the society in the long run. We need to fix this and develop better systems that are more sustainable, transparent, and trustworthy. To do so, we must retrain less and reuse more (the previously trained models). I argue that lifelong learning is the ultimate goal and designing incremental, continual, federated, active learning is extremely important. These are the problems where Bayesian principles shine. Our group has focused on such problems for the last many years which I will briefly present.

3.10 Revisiting Bayesian Foundations in the Age of Modern AI

Jeremias Knoblauch (University College London, GB)

This talk examined the assumptions underlying traditional Bayesianism, specifically: (A1) the model is well-specified, (A2) beliefs are captured in the prior, and (A3) computation is tractable. The limitations of these assumptions in the context of modern AI were discussed, and several alternative approaches were presented to address model mispecification, including likelihood tempering, robust loss functions yielding a pseudo-posterior, and martingale posteriors. These methods represent a departure from traditional Bayesianism and can be seen as part of a broader class of “post-Bayesian” approaches, which aim to adapt Bayesian foundations to the complexities of modern AI applications.

3.11 On Scaling Up Bayesian Neural Networks in LLM era

Yingzhen Li (Imperial College London, GB)

Joint work of: Hippolyt Ritter, Martin Kukla, Cheng Zhang, Yingzhen Li

Main reference: Hippolyt Ritter, Martin Kukla, Cheng Zhang, Yingzhen Li: “Sparse Uncertainty Representation in Deep Learning with Inducing Weights”, in Proc. of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp. 6515–6528, 2021.

URL: https://proceedings.neurips.cc/paper/2021/hash/334467d41d5cf21e234465a1530ba647-Abstract.html

A paradigm shift is on-going in the deep learning field, where researchers are racing in training multimodal foundation models bigger than ever. Currently non-probabilistic approaches dominate the playground of big foundation models, where most Bayesian neural network approaches are prohibitively expensive to compete. But I claim the practitioner’s impression that “Bayesian neural networks with approximate inference are always more expensive than simple deterministic neural networks” will no longer be true in the near future. To partly prove my point, I discuss an example from my previous and on-going work which has made Bayesian neural networks with time and memory complexity lower than deterministic neural networks. The final piece of the puzzle is to translate the time complexity gains to real-time savings, and here the key is to leverage GPU compute more efficiently. I argue that the aspect of algorithm-hardware co-design has been largely ignored by the Bayesian deep learning community, and I hope my encouraging initial result will inspire future Bayesian ML/DL developments by better leveraging contemporary computing hardware.

3.12 Bong: Bayesian Online Natural Gradient Descent

Kevin Murphy (Google DeepMind – Mountain View, US)

Joint work of: Matt Jones, Peter Chang, Kevin Murphy

Main reference: Matt Jones, Peter G. Chang, Kevin Murphy: “Bayesian Online Natural Gradient (BONG)”, CoRR, Vol. abs/2405.19681, 2024.

URL: https://doi.org/10.48550/ARXIV.2405.19681

We propose a novel approach to sequential Bayesian inference based on variational Bayes (VB). The key insight is that, in the online setting, we do not need to add the KL term to regularize to the prior (which comes from the posterior at the previous timestep); instead we can optimize just the expected log-likelihood, performing a single step of natural gradient descent starting at the prior predictive. We prove this method recovers exact Bayesian inference if the model is conjugate. We also show how to compute an efficient deterministic approximation to the VB objective, as well as our simplified objective, when the variational distribution is Gaussian or a sub-family, including the case of a diagonal plus low-rank precision matrix. We show empirically that our method outperforms other online VB methods in the non-conjugate setting, such as online learning for neural networks, especially when controlling for computational costs.

3.13 Bayes Plays the Lottery

Eric Nalisnick (Johns Hopkins University – Baltimore, US)

Bayesian approaches have the potential to mitigate problems with neural networks (NNs) such as overconfidence and lack of robustness. However, computation is a major obstacle to performing high-fidelity posterior inference. In this talk, I will first present our research on scalable variational approximations based on subnetworks. Only a subset of the NN is given a Bayesian treatment, and we find this is enough to perform competitive uncertainty estimation. I will then go on to further justify subnetwork inference, not simply for its computational benefits, but from the theoretical insight that these NNs have as rich a posterior predictive distribution as fully-stochastic NNs. Moreover, across various inference schemes, we observe no empirical benefit to using fully stochastic NNs. I will close by questioning whether a fully-Bayesian treatment of NNs can ever have a benefit.

3.14 Rethinking Predictive Uncertainty Decomposition

Tom Rainforth (University of Oxford, GB)

Joint work of: Freddie Bickford Smith, Jannik Kossen, Eleanor Trollope, Mark van der Wilk, Adam Foster, Tom Rainforth

Main reference: Freddie Bickford Smith, Jannik Kossen, Eleanor Trollope, Mark van der Wilk, Adam Foster, Tom Rainforth: “Rethinking Aleatoric and Epistemic Uncertainty”, CoRR, Vol. abs/2412.20892, 2024.

URL: https://doi.org/10.48550/ARXIV.2412.20892

The ideas of aleatoric and epistemic uncertainty are widely used to reason about the probabilistic predictions of machine-learning models. This talk highlights that the common understanding of these ideas is not self-consistent, with the terms overloaded to refer to some quite distinct precise mathematical quantities. This has caused a number of issues with the narrative in the literature, including a number of spurious associations, such as between the data-generating process and a model’s predictive uncertainty, between parameter stochasticity and the reducibility of predictive uncertainty, and between subjective uncertainty estimates and objective measures of predictive performance. We suggested a new ontology for how to talk about uncertainty decomposition, and discussed what quantities we actually wish to estimate in practice, and how to go about this.

3.15 Bayesian theory vs. practice

Daniel Roy (University of Toronto, CA)

I discussed the challenges in modeling complex data, where one necessarily fails to meet the high bar set by theory motivating subjective Bayesianism. I briefly posited that perhaps trying out models and sticking with those that work is a deep part of the success of Bayesianism. I then proposed another approach, rooted in regret theory, where one aims to control the difference between the losses incurred by their own actions and the losses that would have been incurred by any in a set of (comparator) policies. I introduced exponential weights, a classical algorithm, that delivers nontrivial regret guarantees when competing against a finite comparator set, and showed that exponential weights is, de facto, updating a (generalized Bayesian) posterior over the comparator set, which it uses to guide actions. Unlike Bayesian guarantees for posterior updating and prediction, which rest on the validity of the model and prior, regret guarantees for these very same algorithms hold without assumption, offering a new perspective which much wider purview.

3.16 Bayesian Model Selection for Neural Architectures: A Path to Better Generalization

Mark van der Wilk (University of Oxford, GB)

This talk explored the role of Bayesian methods in selecting neural network architectures, with the goal of improving generalization performance. It was argued that machine learning excels at optimizing function spaces, but regularization is inherently tied to the chosen function space. It was then proposed to use Bayesian model selection to choose between architectures, rather than just parameters, by evaluating the marginal likelihood of the data given the architecture. This approach allows for the selection of models that generalize better outside the training data regime. The talk highlighted the challenges of computing the marginal likelihood, which is often approximated using point estimates. It also discussed the potential benefits of using Bayesian model selection to understand causality and causal direction, where the prior defines a notion of simplicity that informs model choice. By parametrizing equivariances, Bayesian model selection can also help choose appropriate transformations, leading to more robust and generalizable models.

4 Working groups

4.1 Benchmarks and Applications

Vincent Fortuin (Helmholtz AI – Neuherberg, DE)

This working group discussed the need for robust and long-lived benchmarks to measure progress in Bayesian research, similar to those that have driven advances in deep learning. The current over-reliance on vision-based benchmarks was identified as a limitation, and the group explored alternative applications and domains where Bayesian methods can add value. The importance of decision-making and scoring rules was emphasized, with a focus on deriving scoring rules from decision utilities in real-world applications rather than solely relying on likelihood-based metrics. The group also highlighted opportunities for Bayesian methodology in areas such as fine-tuning large language models with small datasets, synthetic benchmarks for sequential decision making, and active learning. To move forward, the group proposed developing benchmarks that showcase the strengths of Bayesian approaches, such as data efficiency and robustness, and creating tools that can be used as building blocks for more impactful applications. Additionally, the group discussed the need to look beyond likelihood as a metric and to better communicate the value of Bayesian methods in terms of decision outcomes, using techniques such as posterior predictive checks. Finally, it was suggested to establish a Benchmark Track at large Bayesian scientific venues, such as the Symposium on Advances in Approximate Bayesian Inference (AABI).

4.2 Community building and career

Vincent Fortuin (Helmholtz AI – Neuherberg, DE)

This working group discussed strategies for fostering a welcoming and inclusive community of Bayesian researchers, encompassing both current and prospective members. The group emphasized the importance of embracing Bayesian methods as a set of tools and ideas, rather than a rigid ideology, and promoting a positive narrative that highlights successes and contributions. To achieve this, the group suggested shifting the focus from criticizing flaws to evaluating the potential for contribution, particularly when reviewing papers and grant applications. The discussion also highlighted the need to engage junior researchers, through initiatives such as nominating them as reviewers, promoting networking opportunities, and establishing virtual seminar series and summer schools. Additionally, the group touched on the importance of branding, inclusivity across seniority levels and sectors, and striking a balance between short-term and long-term research goals. Finally, it was suggested to form a chapter for Bayesian deep learning at the International Society for Bayesian Analysis (ISBA). By adopting a more inclusive and supportive approach, the Bayesian community can become a more vibrant and attractive hub for researchers from diverse backgrounds and career stages.

4.3 Grand challenges

Vincent Fortuin (Helmholtz AI – Neuherberg, DE)

This working group identified and discussed a range of grand challenges for Bayesian research, spanning applications, algorithms, and fundamental understanding. The group explored potential “AlphaFold” moments for Bayesian methods, such as tackling the ARC challenge, planning as inference, and reasoning by analogy. The discussion emphasized the need for hierarchical goals, including ARC, Craftax, and scientific applications, and questioned whether being as good as a large language model (LLM) is a sufficient goal, suggesting that aiming for 1000x better performance could be more ambitious. The group also touched on the importance of causal inference, using LLMs as priors in Bayesian inference, and the potential benefits of Bayesian approaches, such as data efficiency and marginalization. Other challenges and opportunities discussed included combinatorial optimization, compositional modeling, local and distributed learning, and personalized AI. The group also explored the potential for Bayesian methods to address challenges in science, such as understanding rare diseases, and the need for explainable AI, uncertainty quantification, and robust systems. Finally, the discussion highlighted the importance of understanding how deep learning and LLMs learn effectively, and the potential for Bayesian optimization, simulation, and modeling of complex systems, such as nervous systems and physical systems.

5 Panel discussions

5.1 We can build subjective Bayesian priors for neural networks that we actually believe in

Maurizio Filippone (EURECOM – Biot, FR), Vincent Fortuin (Helmholtz AI – Neuherberg, DE), Daniel Roy (University of Toronto, CA), and Sinead Williamson (Apple – Seattle, US)

License: Creative Commons BY 4.0 International license © Maurizio Filippone, Vincent Fortuin, Daniel Roy, and Sinead Williamson

This discussion was held on the feasibility of constructing subjective Bayesian priors for neural networks that reflect genuine beliefs. Some arguments in favor of this approach emphasized the importance of priors in Bayesian methods, particularly for out-of-distribution data where uncertainty is crucial. It was suggested that priors on function spaces, such as those using Gaussian processes or martingale posteriors, may be easier to specify than priors on model parameters. Additionally, domain knowledge from experts can inform the design of priors. On the other hand, counterarguments highlighted the challenges of defining priors, especially for complex models like neural networks. It was pointed out that the parameters of neural networks are not necessarily meaningful, and that the focus should be on the functions they represent. Furthermore, the lack of a universal Bayesian theory that explains why Bayesian methods work in all cases raises questions about the reliability of these approaches. The discussion also touched on the role of prior predictive checks, the importance of uncertainty quantification, and the trade-offs between universal models and domain-specific collaborations. Ultimately, the debate underscored the complexity of building subjective Bayesian priors for neural networks and the need for further research in this area.

5.2 Bayes theorem is broken for making predictions with large models

Vincent Fortuin (Helmholtz AI – Neuherberg, DE), Alexander A. Alemi (Kissimmee, US), Jeremias Knoblauch (University College London, GB), Eric Nalisnick (Johns Hopkins University – Baltimore, US), and Mark van der Wilk (University of Oxford, GB)

License: Creative Commons BY 4.0 International license © Vincent Fortuin, Alexander A. Alemi, Jeremias Knoblauch, Eric Nalisnick, and Mark van der Wilk

This guided discussion was held to debate the effectiveness of Bayes’ theorem in making predictions with large models. Those in favor of the motion argued that Bayes’ theorem is meant for beliefs about real-world entities, not abstract model parameters, and that it may not be epistemologically sound in the context of large models. They also pointed out that exact inference is often infeasible, and that priors are frequently misspecified. On the other hand, those against the motion argued that Bayes’ theorem can be applied to any model, regardless of its size or complexity, and that it provides a framework for reasoning about uncertainty. They also suggested that approximate Bayesian approaches can be effective, even if they are not perfect, and that the benefits of Bayes’ theorem may shrink with increasing dataset size, but are still relevant for updating pre-trained models on small datasets. The discussion highlighted the complexity and nuance of the issue, with no clear consensus emerging.

5.3 Bayes is useless if we cannot scale to LLMs

Mariia Vladimirova (Criteo – Paris, FR), Yingzhen Li (Imperial College London, GB), and Tom Rainforth (University of Oxford, GB)

This discussion was held on the relevance of Bayesian methods in the context of large language models (LLMs). The discussion centered around the idea that Bayesian methods need to adapt to the current landscape of machine learning, where LLMs are becoming increasingly prominent. Some argued that Bayesian approaches are essential for small-data regimes, but struggle to scale to larger models. Others countered that Bayes is not just about inference, but also provides a decision-theoretic framework that can be useful in various applications, including scientific experiments and clinical trials. The conversation highlighted the need for the Bayesian community to engage with current developments in machine learning, to focus on promising directions, and to develop practical tools that can showcase the value of Bayesian methods. The discussion also touched on the importance of approximate inference, the potential of using LLMs as priors, and the need to rethink the foundations of epistemology in the context of modern AI systems. Ultimately, the debate emphasized the importance of adapting Bayesian methods to the modern era of machine learning, while also acknowledging the challenges and opportunities that come with scaling to larger models.

6 Participants

$\blacksquare$

Laurence Aitchison – University of Bristol, GB
$\blacksquare$

Alexander A. Alemi – Kissimmee, US
$\blacksquare$

Pierre Alquier – ESSEC Business School – Singapore, SG
$\blacksquare$

Julyan Arbel – INRIA – Grenoble, FR
$\blacksquare$

Thang Bui – Australian National University – Acton, AU
$\blacksquare$

Kamélia Daudel – ESSEC Business School – Cergy Pontoise, FR
$\blacksquare$

Gintare Karolina Dziugaite – Google DeepMind – Toronto, CA
$\blacksquare$

Carl Henrik Ek – University of Cambridge, GB
$\blacksquare$

Maurizio Filippone – EURECOM – Biot, FR
$\blacksquare$

Katharine Fisher – MIT – Cambridge, US
$\blacksquare$

Vincent Fortuin – Helmholtz AI – Neuherberg, DE
$\blacksquare$

Pablo García Arce – Institute of Mathematical Sciences – Madrid, ES
$\blacksquare$

Erin Grant – University College London, GB
$\blacksquare$

Philipp Hennig – Universität Tübingen, DE
$\blacksquare$

Alexander Immer – Bioptimus – Zürich, CH
$\blacksquare$

Desi Ivanova – University of Oxford, GB
$\blacksquare$

Theofanis Karaletsos – Paramid, US
$\blacksquare$

Mohammad Emtiyaz Khan – RIKEN – Tokyo, JP
$\blacksquare$

Jeremias Knoblauch – University College London, GB
$\blacksquare$

Yingzhen Li – Imperial College London, GB
$\blacksquare$

Thomas Möllenhoff – RIKEN – Tokyo, JP
$\blacksquare$

Kevin Murphy – Google DeepMind – Mountain View, US
$\blacksquare$

Eric Nalisnick – Johns Hopkins University – Baltimore, US
$\blacksquare$

Roi Naveiro Flores – CUNEF University – Madrid, ES
$\blacksquare$

Theodore Papamarkou – Zhejiang Normal University – Jinhua, CN
$\blacksquare$

Guiomar Pescador Barrios – Imperial College London, GB
$\blacksquare$

Tom Rainforth – University of Oxford, GB
$\blacksquare$

Daniel Roy – University of Toronto, CA
$\blacksquare$

Tim Rudner – New York University, US
$\blacksquare$

Maja Rudolph – University of Wisconsin – Madison, US
$\blacksquare$

David Rügamer – LMU München, DE
$\blacksquare$

Jan-Willem van de Meent – University of Amsterdam, NL
$\blacksquare$

Tycho van der Ouderaa – University of Oxford, GB
$\blacksquare$

Mark van der Wilk – University of Oxford, GB
$\blacksquare$

Mariia Vladimirova – Criteo – Paris, FR
$\blacksquare$

Florian Wenzel – Mirelo AI – Tübingen, DE
$\blacksquare$

Sinead Williamson – Apple – Seattle, US
$\blacksquare$

Andrew G. Wilson – New York University, US

[bib.bib1] [1] Alquier P. User-friendly introduction to PAC-Bayes bounds. Foundations and Trends in Machine Learning, 2024, vol. 17, no. 2, pp. 174-303.

[bib.bib2] [2] Alquier, P. and Ridgway, J. Concentration of Tempered Posteriors and of their Variational Approximations. The Annals of Statistics, 2020, vol. 48, no. 3, pp. 1475-1497.

[bib.bib3] [3] Dziugaite, G. K. and Roy, D. Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data. Proceedings of UAI, 2017.

[bib.bib4] [4] McAllester, D. A. Some PAC-Bayesian theorems. Proceedings of the eleventh annual conference on Computational learning theory (COLT), 1998.

[bib.bib5] [5] Yang, Y., Pati, D. and Bhattacharya, A. $\alpha$ -variational inference with statistical guarantees. The Annals of Statistics, 2020, vol. 48, no. 2, pp. 886-905.