One-sentence summary:
Extending Bayesian optimization from
estimating global optima to
estimating other function properties defined
by the output of algorithms.

Abstract

In many real world problems, we want to infer some property of an expensive
black-box function \(f\), given a budget of \(T\) function evaluations. One
example is budget constrained global optimization of \(f\), for which Bayesian
optimization is a popular method. Other properties of interest include local
optima, level sets, integrals, or graph-structured information induced by
\(f\). Often, we can find an algorithm \(\mathcal{A}\) to compute the desired
property, but it may require far more than \(T\) queries to execute. Given
such an \(\mathcal{A}\), and a prior distribution over \(f\), we refer to the
problem of inferring the output of \(\mathcal{A}\) using \(T\) evaluations as
Bayesian Algorithm Execution (BAX). To tackle this problem, we present a
procedure, InfoBAX, that sequentially chooses queries that maximize mutual
information with respect to the algorithm's output. Applying this to
Dijkstra's algorithm, for instance, we infer shortest paths in synthetic and
real-world graphs with black-box edge costs. Using evolution strategies, we
yield variants of Bayesian optimization that target local, rather than global,
optima. On these problems, InfoBAX uses up to 500 times fewer queries to \(f\)
than required by the original algorithm. Our method is closely connected to
other Bayesian optimal experimental design procedures such as entropy search
methods and optimal sensor placement using Gaussian processes.

Example

This video shows InfoBAX with Dijkstra's algorithm to estimate
shortest paths in graphs with expensive edge-weight queries. The goal is
to estimate the shortest path between the purple and gold squares (where
dashed line is the true shortest path), using a limited budget of
expensive edge queries (black dots). See details below.

Project Overview

Suppose we have an expensive and possibly-noisy black-box function \(f\).

We can query (evaluate) \(f\) at an input \(x\), but we don't know the function. In
many fields, a common task is to find a global optima of \(f\) via a sequence of
queries—and a popular approach is known as Bayesian optimization.

However, suppose we want to estimate some other property of \(f\). Here are some
example function properties of interest (see also the image at the top of this page):

Now suppose this function property of interest is computable.
This means there exists some algorithm \(\mathcal{A}\) that can be run on \(f\) and
returns the property as output (regardless of how many queries \(\mathcal{A}\) requires).

For example, if our property is the local minimum of \(f\), there exist a
variety of local optimization algorithms (e.g. gradient descent, evolutionary
algorithms, Nelder-Mead method) that return a local minimum of \(f\).

There are many computable function properties, such as optima variants (global,
local, top-\(k\)), Pareto frontiers, sub/super-level sets, phase boundaries, roots,
integrals, expectations, graph properties, and more.

Bayesian algorithm execution (BAX) is the task
of efficiently estimating a computable function property, via a probabilistic model
(i.e. extending Bayesian optimization to function properties beyond global optima).
Equivalently, we can view BAX as the task of estimating the output of the algorithm
\(\mathcal{A}\) using as few function queries as possible (regardless of how many
queries \(\mathcal{A}\) requires to execute).

We've developed an information-based method for BAX called InfoBAX,
which sequentially chooses queries that maximize mutual information with respect to the
algorithm's output:

To optimize the acquisition function (line 2), there are two stages:

Run the algorithm \(\mathcal{A}\) on posterior function samples to get a set of
execution path samples.

Using the cached set of execution path samples, approximate the expected
information gain (EIG) for any input \(x\).
See the paper for details.

Here is an illustration of these two stages:

This is similar to information-based methods for Bayesian optimization, such as
predictive entropy search, and it allows
the acquisition function to be optimized efficiently!

After running InfoBAX, we have a posterior estimate of the computable function
property (or, equivalently, a posterior estimate of the output of algorithm
\(\mathcal{A}\)).

Here's the full story in one figure:

Some Experimental Results

Estimating shortest paths in graphs

Suppose we have a graph with edge weights defined by a black-box function. We can
view the shortest path between two nodes in this graph as a property of this function.
To compute the shortest path, we could use a classical graph traversal algorithm such as
Dijkstra's algorithm,
which iteratively accesses edge weights during its execution.

Suppose, however, that accessing edge weights is expensive. For example, in
real-world transportation networks—such as road, air, or shipping networks—accessing
each edge weight might involve a costly measurement (e.g. via satellite) or price
negotiation. In these cases, a naive deployment of Dijkstra's algorithm may use many
more edge queries than our budget allows. Instead, we aim to use InfoBAX to
estimate the output of Dijkstra's algorithm using the fewest possible number of edge
queries.

As a concrete example, suppose we have a grid-shaped graph, where a black-box
function over edge weights is shown in green. We can plot the set of edges accessed by
Dijkstra's algorithm (black dots) and the shortest path output of the algorithm (dashed
black line) between the start and goal nodes (purple and gold boxes).

Here, Dijkstra's algorithm requires querying over 300 edge weights. To reduce this
number, we could instead try the following: query a smaller set of edge weights and
infer the shortest path using a probabilistic model. In this setting, InfoBAX is
designed to choose edge queries to most efficiently infer the shortest path.

We compare different sampling strategies for estimating the output of Dijkstra's
algorithm, such as random search (left), uncertainty sampling (middle), or
InfoBAX (right). For each method, the distribution over inferred shortest paths
is shown in blue.

Using InfoBAX, we can infer the shortest path with high accuracy using a much
smaller set of edge queries.

Here is an animation showing InfoBAX with Dijkstra's algorithm on a larger grid graph:

Bayesian local optimization

While Bayesian optimization is a popular method for global optimization,
there also exist many local optimization algorithms such as
evolution strategies,
the Nelder-Mead method, and
finite-difference gradient descent,
which may show strong performance in certain settings—such as in high-dimensions—when function
evaluations are cheap and many queries can be made.

We instead propose to run InfoBAX with a local optimization algorithm. This yields a
variant of Bayesian optimization for expensive black-box functions that targets local,
rather than global, optima. The main idea is that we approach Bayesian optimization as
the task of inferring the output of a local optimization algorithm run on \(f\), rather
than inferring a global optima of \(f\).

As a concrete example, suppose we have a two-dimensional black-box function \(f\),
shown in green, with three optima (yellow squares). We can plot the set of queries made
by an evolution strategy algorithm (black dots), and the algorithm output (pink star).

Here, this evolution strategy algorithm makes over 200 function queries. To reduce
this number, we can apply InfoBAX, and make queries to estimate the output of the
evolution strategy.

We show the sampling pattern after 18 queries chosen by InfoBAX (right), random
search (left), and max-value entropy search (middle), a Bayesian (global) optimization
algorithm.

Here is an animation showing InfoBAX with an evolution strategy algorithm. Posterior
samples of the algorithm execution path and output are respectively shown as blue and
purple dots, and the next query (i.e. acquisition function optimizer) is shown as a pink
dot.

Top-\(k\) estimation

Given a finite set of points \(X\), we may wish to estimate the subset of \(k\)
points that have the highest value under a black-box function \(f\). This generalizes
standard Bayesian optimization (i.e. top-1 estimation) to top-\(k\) estimation. For
example, this type of task appears in materials discovery
and interactive learning problems.

If we had an unlimited budget, we could simply deploy a top-\(k\) algorithm: scan
through each point \(x \in X\), query its value \(f(x)\), and afterwards sort the points
in decreasing order by their values and return the first \(k\) points. However, this
requires us to query each point in \(X\).

Instead we will run InfoBAX with this top-\(k\) algorithm. In the following
animation, each point in the set \(X\) is shown as a grey dot, and the (ground truth)
top-\(k\) points in \(X\) are shown as gold stars. At each iteration, posterior samples
of the top-\(k\) points are shown by blue squares, and the next query (i.e. acquisition
function optimizer) is shown as a pink dot. Intuitively, we want the blue squares to
concentrate their mass on the gold stars in as few queries as possible.

BibTeX

@inproceedings{neiswanger2021bayesian,
title = {Bayesian Algorithm Execution: Estimating Computable Properties of Black-box Functions Using Mutual Information},
author = {Neiswanger, Willie and Wang, Ke Alexander and Ermon, Stefano},
booktitle = {International Conference on Machine Learning},
year = {2021},
organization = {PMLR}
}