Bayesian Optimization

Bayesian Optimization #

Intuition #

Bayesian Optimization is used to find the best value of a function that is:

Expensive to evaluate
Black-box (unknown formula)
Noisy

Instead of trying many points blindly, it builds a model (usually a Gaussian Process) that estimates:

What the function might look like (prediction)
Where we are uncertain (confidence)

Using this uncertainty, it chooses the next point that is most promising, balancing:

Exploration (try uncertain areas)
Exploitation (try areas likely to give better results)

This approach finds the optimum in far fewer evaluations than grid search or random search.

Workflow (Simple Step-by-Step)#

Step 1: Initialize

Pick a small number of random points and evaluate the objective function. Example: try 3–5 random hyperparameter values.

Step 2: Fit a Surrogate Model

Train a Gaussian Process (GP) or other surrogate model on observed data.

GP gives for any point:

predicted value \(\mu(x)\)
uncertainty \(\sigma(x)\)

Step 3: Use Acquisition Function

Acquisition function (EI, PI, UCB) decides where to evaluate next.

It answers:

Which point is worth evaluating next based on predicted improvement + uncertainty?

Step 4: Evaluate the True Function

Evaluate the chosen point on the real expensive function (e.g., train model with that hyperparameter).

Add the new (x, f(x)) pair to data.

Step 5: Update Model

Retrain the GP with the updated dataset.

Step 6: Iterate

Repeat Steps 2–5 until:

budget is exhausted
improvement becomes small
convergence criteria met

One-Line Summary Bayesian Optimization = Smart search that builds a model of the objective function and chooses the next evaluation point using uncertainty + predicted performance, requiring very few total evaluations.

Gaussian Process (GP)#

Gaussian Processes are a probabilistic, non-parametric model used for regression and Bayesian Optimization.

They are powerful because they provide:

a prediction
a measure of uncertainty

at every point in the input space.

Intuition #

A Gaussian Process is a distribution over functions.

Instead of assuming a function has a fixed form (like linear regression, neural networks), GP assumes:

Any set of function values follows a joint multivariate Gaussian distribution.

This means:

At every point (x), GP predicts a mean ( \mu(x) )
And an uncertainty / variance ( \sigma^2(x) )

The model gets more confident near observed points and less confident far away.

Core Idea #

A GP is defined by:

\[ f(x) \sim \mathcal{GP}(m(x), k(x, x')) \]

Where:

\(m(x)\): mean function (often 0)
\(k(x, x')\): kernel (covariance) function

The kernel encodes similarity between points.

If two points (x) and (x’) are similar, GP expects their outputs to be correlated.

Kernel (Covariance Function)#

Most important part.

Common kernels:

Squared Exponential (RBF)#

\[ k(x,x')=\exp\left(-\frac{(x-x')^2}{2l^2}\right) \]

Smooth
Infinite differentiability
The most widely used

Matern kernel#

Rougher functions; common in real-world optimization.

Linear kernel#

For linear relationships.

How GP Regression Works #

Given data:

\[ X = [x_1, x_2, ..., x_n] \]

\[ y = [y_1, y_2, ..., y_n] \]

We want to predict ( f(x_) ) at a new point (x_).

Step 1: GP assumes joint Gaussian:#

\[ \begin{bmatrix} y \ f(x_*) \end{bmatrix} \sim \mathcal{N}\left(0, \begin{bmatrix} K(X,X) + \sigma^2 I & K(X,x_*)\ K(x_*,X) & K(x_*,x_*) \end{bmatrix} \right) \]

Step 2: Compute posterior mean and variance #

Mean:

\[ \mu(x_*) = K(x_*, X)[K(X,X) + \sigma^2 I]^{-1} y \]

Variance:

\[ \sigma^2(x_*) = K(x_*,x_*) - K(x_*, X)[K(X,X) + \sigma^2 I]^{-1}K(X,x_*) \]

Interpretation:

If \(x_*\) is near training data → low variance
Far from training data → high variance

Why Gaussian Processes Are Useful #

They handle small datasets well #

GPs shine with small datasets (10–100 data points).

Provide predictive uncertainty #

This uncertainty lets Bayesian Optimization know where to explore next.

Flexible #

Can model very complex functions via kernel choice.

Closed-form Bayesian inference #

GP posterior has exact formulas (unlike neural networks).

Simple Visual Intuition #

Grey region: uncertainty
Blue curve: GP mean prediction
Points: observed data

Notice:

Certainty is high near data points
Uncertainty grows between them

GP in Bayesian Optimization #

GP is the surrogate model.

For every point (x):

\(\mu(x)\): how good this hyperparameter probably is
\(\sigma(x)\): how uncertain we are

Acquisition functions (EI, UCB, PI) use both.

When GPs Fail #

High dimensionality (> 30 dimensions)
Huge datasets (> 10,000 samples)
Very noisy data
Categorical-heavy spaces

Then alternatives like Random Forest TPE, Gradient Boosted Surrogates, Neural Surrogates, or Bayesian Neural Networks are used.

Short Summary

A Gaussian Process is a probability distribution over functions that gives both prediction and uncertainty, making it ideal for problems where every evaluation is expensive, especially Bayesian Optimization.

Acquisition Function #

In Bayesian Optimization, the acquisition function decides where to sample next based on:

The surrogate model’s prediction \(\mu(x)\)
The surrogate model’s uncertainty \(\sigma(x)\)

It balances:

Exploitation: try points that look good
Exploration: try points where the model is uncertain

It directly controls the search efficiency.

Purpose of the Acquisition Function #

Given the Gaussian Process (GP), for every point (x), we know:

Mean prediction: \(\mu(x)\)
Uncertainty: \(\sigma(x)\)

The acquisition function \(a(x)\):

\[ x_{\text{next}} = \arg\max_x a(x) \]

We evaluate the real, expensive function only at the point with the maximum acquisition value.

Why We Need It #

GP is only a model (a guess).
Acquisition function looks at this model and chooses the most informative next point.
Ensures sample efficiency — fewer costly evaluations.

Common Acquisition Functions #

Expected Improvement (EI)#

Most widely used.

\[ EI(x) = (\mu(x) - f_{\text{best}} - \xi)\Phi(Z) + \sigma(x)\phi(Z) \]

Where:

\[ Z = \frac{\mu(x) - f_{\text{best}} - \xi}{\sigma(x)} \]

Meaning:

Large if prediction is good (exploitation)
Large if uncertainty is high (exploration)

EI balances both naturally.

Probability of Improvement (PI)#

\[ PI(x) = \Phi\left(\frac{\mu(x) - f_{\text{best}} - \xi}{\sigma(x)}\right) \]

Interpretation:

Probability this point will give a better result than the current best.

Problem:

Ignores how much improvement we expect.
Tends to over-exploit.

Upper Confidence Bound (UCB)#

For minimization, it’s usually:

\[ UCB(x) = \mu(x) - \kappa \sigma(x) \]

For maximization:

\[ UCB(x) = \mu(x) + \kappa \sigma(x) \]

Where \(\kappa\) controls exploration.

Meaning:

Large positive \(\sigma(x)\) → high exploration
Low \(\mu(x)\) → high exploitation

Simple and effective.

Lower Confidence Bound (LCB)#

Used for minimization:

\[ LCB(x) = \mu(x) - \kappa\sigma(x) \]

Minimize the LCB value.

Visual Intuition #

In the plot:

Blue line = GP mean prediction
Shaded region = uncertainty
Black dots = observed data
Green curve = acquisition score

The next point is where the green curve is highest.

How Acquisition Looks in the Optimization Loop #

Step-by-step:#

Fit GP on current data
For every candidate x:
- compute ( \mu(x) ), ( \sigma(x) )
- compute acquisition ( a(x) )
Choose x with maximum ( a(x) )
Evaluate real function at x
Update data
Repeat

One-line Definitions

Acquisition function = a strategy function that picks the next best point based on predicted performance + uncertainty.
It balances exploitation (good predictions) and exploration (high uncertainty).
Examples: EI, PI, UCB.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from math import sqrt
import numpy as np

data = fetch_california_housing()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

def objective(params):
    n_estimators, max_depth = params

    model = RandomForestRegressor(
        n_estimators=int(n_estimators),
        max_depth=int(max_depth),
        random_state=42,
        n_jobs=-1
    )
    
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    rmse = sqrt(mean_squared_error(y_test, pred))
    return rmse

from skopt import gp_minimize
from skopt.space import Integer

result = gp_minimize(
    func=objective,                      # function to minimize
    dimensions=[
        Integer(50, 300),                # n_estimators
        Integer(3, 20)                   # max_depth
    ],
    n_calls=20,                          # total iterations
    n_initial_points=5,                  # random warmup
    acq_func="EI"                        # Expected Improvement
)

c:\Users\sangouda\Python3.10\lib\site-packages\skopt\optimizer\optimizer.py:517: UserWarning: The objective has been evaluated at point [np.int64(300), np.int64(20)] before, using random point [np.int64(223), np.int64(16)]
  warnings.warn(
c:\Users\sangouda\Python3.10\lib\site-packages\skopt\optimizer\optimizer.py:517: UserWarning: The objective has been evaluated at point [np.int64(300), np.int64(20)] before, using random point [np.int64(121), np.int64(13)]
  warnings.warn(
c:\Users\sangouda\Python3.10\lib\site-packages\skopt\optimizer\optimizer.py:517: UserWarning: The objective has been evaluated at point [np.int64(300), np.int64(20)] before, using random point [np.int64(55), np.int64(8)]
  warnings.warn(
c:\Users\sangouda\Python3.10\lib\site-packages\skopt\optimizer\optimizer.py:517: UserWarning: The objective has been evaluated at point [np.int64(300), np.int64(20)] before, using random point [np.int64(52), np.int64(5)]
  warnings.warn(
c:\Users\sangouda\Python3.10\lib\site-packages\skopt\optimizer\optimizer.py:517: UserWarning: The objective has been evaluated at point [np.int64(300), np.int64(20)] before, using random point [np.int64(70), np.int64(16)]
  warnings.warn(

print("Best hyperparameters:")
print("n_estimators =", result.x[0])
print("max_depth =", result.x[1])

print("Best RMSE =", result.fun)

Best hyperparameters:
n_estimators = 242
max_depth = 18
Best RMSE = 0.5039831396983969

https://distill.pub/2020/bayesian-optimization

https://ekamperi.github.io/machine learning/2021/06/11/acquisition-functions.html

https://www.researchgate.net/figure/An-illustration-of-one-iteration-of-the-expected-improvement-EI-algorithm-The_fig2_397700848

Bayesian Optimization

Contents