Bayesian Optimization#

Intuition#

Bayesian Optimization is used to find the best value of a function that is:

  • Expensive to evaluate

  • Black-box (unknown formula)

  • Noisy

Instead of trying many points blindly, it builds a model (usually a Gaussian Process) that estimates:

  • What the function might look like (prediction)

  • Where we are uncertain (confidence)

Using this uncertainty, it chooses the next point that is most promising, balancing:

  • Exploration (try uncertain areas)

  • Exploitation (try areas likely to give better results)

This approach finds the optimum in far fewer evaluations than grid search or random search.


Workflow (Simple Step-by-Step)#

Step 1: Initialize

Pick a small number of random points and evaluate the objective function. Example: try 3–5 random hyperparameter values.


Step 2: Fit a Surrogate Model

Train a Gaussian Process (GP) or other surrogate model on observed data.

GP gives for any point:

  • predicted value \(\mu(x)\)

  • uncertainty \(\sigma(x)\)


Step 3: Use Acquisition Function

Acquisition function (EI, PI, UCB) decides where to evaluate next.

It answers:

Which point is worth evaluating next based on predicted improvement + uncertainty?


Step 4: Evaluate the True Function

Evaluate the chosen point on the real expensive function (e.g., train model with that hyperparameter).

Add the new (x, f(x)) pair to data.


Step 5: Update Model

Retrain the GP with the updated dataset.


Step 6: Iterate

Repeat Steps 2–5 until:

  • budget is exhausted

  • improvement becomes small

  • convergence criteria met


One-Line Summary Bayesian Optimization = Smart search that builds a model of the objective function and chooses the next evaluation point using uncertainty + predicted performance, requiring very few total evaluations.

Gaussian Process (GP)#

Gaussian Processes are a probabilistic, non-parametric model used for regression and Bayesian Optimization.

They are powerful because they provide:

  • a prediction

  • a measure of uncertainty

at every point in the input space.


Intuition#

A Gaussian Process is a distribution over functions.

Instead of assuming a function has a fixed form (like linear regression, neural networks), GP assumes:

Any set of function values follows a joint multivariate Gaussian distribution.

This means:

  • At every point (x), GP predicts a mean ( \mu(x) )

  • And an uncertainty / variance ( \sigma^2(x) )

The model gets more confident near observed points and less confident far away.


Core Idea#

A GP is defined by:

\[ f(x) \sim \mathcal{GP}(m(x), k(x, x')) \]

Where:

  • \(m(x)\): mean function (often 0)

  • \(k(x, x')\): kernel (covariance) function

The kernel encodes similarity between points.

If two points (x) and (x’) are similar, GP expects their outputs to be correlated.


Kernel (Covariance Function)#

Most important part.

Common kernels:

Squared Exponential (RBF)#

\[ k(x,x')=\exp\left(-\frac{(x-x')^2}{2l^2}\right) \]
  • Smooth

  • Infinite differentiability

  • The most widely used

Matern kernel#

Rougher functions; common in real-world optimization.

Linear kernel#

For linear relationships.


How GP Regression Works#

Given data:

\[ X = [x_1, x_2, ..., x_n] \]
\[ y = [y_1, y_2, ..., y_n] \]

We want to predict ( f(x_) ) at a new point (x_).

Step 1: GP assumes joint Gaussian:#

\[ \begin{bmatrix} y \ f(x_*) \end{bmatrix} \sim \mathcal{N}\left(0, \begin{bmatrix} K(X,X) + \sigma^2 I & K(X,x_*)\ K(x_*,X) & K(x_*,x_*) \end{bmatrix} \right) \]

Step 2: Compute posterior mean and variance#

Mean:

\[ \mu(x_*) = K(x_*, X)[K(X,X) + \sigma^2 I]^{-1} y \]

Variance:

\[ \sigma^2(x_*) = K(x_*,x_*) - K(x_*, X)[K(X,X) + \sigma^2 I]^{-1}K(X,x_*) \]

Interpretation:

  • If \(x_*\) is near training data → low variance

  • Far from training data → high variance


Why Gaussian Processes Are Useful#

They handle small datasets well#

GPs shine with small datasets (10–100 data points).

Provide predictive uncertainty#

This uncertainty lets Bayesian Optimization know where to explore next.

Flexible#

Can model very complex functions via kernel choice.

Closed-form Bayesian inference#

GP posterior has exact formulas (unlike neural networks).


Simple Visual Intuition#

  • Grey region: uncertainty

  • Blue curve: GP mean prediction

  • Points: observed data

Notice:

  • Certainty is high near data points

  • Uncertainty grows between them


GP in Bayesian Optimization#

GP is the surrogate model.

For every point (x):

  • \(\mu(x)\): how good this hyperparameter probably is

  • \(\sigma(x)\): how uncertain we are

Acquisition functions (EI, UCB, PI) use both.


When GPs Fail#

  • High dimensionality (> 30 dimensions)

  • Huge datasets (> 10,000 samples)

  • Very noisy data

  • Categorical-heavy spaces

Then alternatives like Random Forest TPE, Gradient Boosted Surrogates, Neural Surrogates, or Bayesian Neural Networks are used.


Short Summary

A Gaussian Process is a probability distribution over functions that gives both prediction and uncertainty, making it ideal for problems where every evaluation is expensive, especially Bayesian Optimization.

Acquisition Function#

In Bayesian Optimization, the acquisition function decides where to sample next based on:

  • The surrogate model’s prediction \(\mu(x)\)

  • The surrogate model’s uncertainty \(\sigma(x)\)

It balances:

  • Exploitation: try points that look good

  • Exploration: try points where the model is uncertain

It directly controls the search efficiency.


Purpose of the Acquisition Function#

Given the Gaussian Process (GP), for every point (x), we know:

  • Mean prediction: \(\mu(x)\)

  • Uncertainty: \(\sigma(x)\)

The acquisition function \(a(x)\):

\[ x_{\text{next}} = \arg\max_x a(x) \]

We evaluate the real, expensive function only at the point with the maximum acquisition value.


Why We Need It#

  • GP is only a model (a guess).

  • Acquisition function looks at this model and chooses the most informative next point.

  • Ensures sample efficiency — fewer costly evaluations.


Common Acquisition Functions#

Expected Improvement (EI)#

Most widely used.

\[ EI(x) = (\mu(x) - f_{\text{best}} - \xi)\Phi(Z) + \sigma(x)\phi(Z) \]

Where:

\[ Z = \frac{\mu(x) - f_{\text{best}} - \xi}{\sigma(x)} \]

Meaning:

  • Large if prediction is good (exploitation)

  • Large if uncertainty is high (exploration)

EI balances both naturally.


Probability of Improvement (PI)#

\[ PI(x) = \Phi\left(\frac{\mu(x) - f_{\text{best}} - \xi}{\sigma(x)}\right) \]

Interpretation:

  • Probability this point will give a better result than the current best.

Problem:

  • Ignores how much improvement we expect.

  • Tends to over-exploit.


Upper Confidence Bound (UCB)#

For minimization, it’s usually:

\[ UCB(x) = \mu(x) - \kappa \sigma(x) \]

For maximization:

\[ UCB(x) = \mu(x) + \kappa \sigma(x) \]

Where \(\kappa\) controls exploration.

Meaning:

  • Large positive \(\sigma(x)\) → high exploration

  • Low \(\mu(x)\) → high exploitation

Simple and effective.


Lower Confidence Bound (LCB)#

Used for minimization:

\[ LCB(x) = \mu(x) - \kappa\sigma(x) \]

Minimize the LCB value.


Visual Intuition#

In the plot:

  • Blue line = GP mean prediction

  • Shaded region = uncertainty

  • Black dots = observed data

  • Green curve = acquisition score

The next point is where the green curve is highest.


How Acquisition Looks in the Optimization Loop#

Step-by-step:#

  1. Fit GP on current data

  2. For every candidate x:

    • compute ( \mu(x) ), ( \sigma(x) )

    • compute acquisition ( a(x) )

  3. Choose x with maximum ( a(x) )

  4. Evaluate real function at x

  5. Update data

  6. Repeat


One-line Definitions

  • Acquisition function = a strategy function that picks the next best point based on predicted performance + uncertainty.

  • It balances exploitation (good predictions) and exploration (high uncertainty).

  • Examples: EI, PI, UCB.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from math import sqrt
import numpy as np
data = fetch_california_housing()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
def objective(params):
    n_estimators, max_depth = params

    model = RandomForestRegressor(
        n_estimators=int(n_estimators),
        max_depth=int(max_depth),
        random_state=42,
        n_jobs=-1
    )
    
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    rmse = sqrt(mean_squared_error(y_test, pred))
    return rmse
from skopt import gp_minimize
from skopt.space import Integer
result = gp_minimize(
    func=objective,                      # function to minimize
    dimensions=[
        Integer(50, 300),                # n_estimators
        Integer(3, 20)                   # max_depth
    ],
    n_calls=20,                          # total iterations
    n_initial_points=5,                  # random warmup
    acq_func="EI"                        # Expected Improvement
)
c:\Users\sangouda\Python3.10\lib\site-packages\skopt\optimizer\optimizer.py:517: UserWarning: The objective has been evaluated at point [np.int64(300), np.int64(20)] before, using random point [np.int64(223), np.int64(16)]
  warnings.warn(
c:\Users\sangouda\Python3.10\lib\site-packages\skopt\optimizer\optimizer.py:517: UserWarning: The objective has been evaluated at point [np.int64(300), np.int64(20)] before, using random point [np.int64(121), np.int64(13)]
  warnings.warn(
c:\Users\sangouda\Python3.10\lib\site-packages\skopt\optimizer\optimizer.py:517: UserWarning: The objective has been evaluated at point [np.int64(300), np.int64(20)] before, using random point [np.int64(55), np.int64(8)]
  warnings.warn(
c:\Users\sangouda\Python3.10\lib\site-packages\skopt\optimizer\optimizer.py:517: UserWarning: The objective has been evaluated at point [np.int64(300), np.int64(20)] before, using random point [np.int64(52), np.int64(5)]
  warnings.warn(
c:\Users\sangouda\Python3.10\lib\site-packages\skopt\optimizer\optimizer.py:517: UserWarning: The objective has been evaluated at point [np.int64(300), np.int64(20)] before, using random point [np.int64(70), np.int64(16)]
  warnings.warn(
print("Best hyperparameters:")
print("n_estimators =", result.x[0])
print("max_depth =", result.x[1])

print("Best RMSE =", result.fun)
Best hyperparameters:
n_estimators = 242
max_depth = 18
Best RMSE = 0.5039831396983969

https://distill.pub/2020/bayesian-optimization

https://ekamperi.github.io/machine learning/2021/06/11/acquisition-functions.html

https://www.researchgate.net/figure/An-illustration-of-one-iteration-of-the-expected-improvement-EI-algorithm-The_fig2_397700848