Bayesian Optimization#
Intuition#
Bayesian Optimization is used to find the best value of a function that is:
Expensive to evaluate
Black-box (unknown formula)
Noisy
Instead of trying many points blindly, it builds a model (usually a Gaussian Process) that estimates:
What the function might look like (prediction)
Where we are uncertain (confidence)
Using this uncertainty, it chooses the next point that is most promising, balancing:
Exploration (try uncertain areas)
Exploitation (try areas likely to give better results)
This approach finds the optimum in far fewer evaluations than grid search or random search.
Workflow (Simple Step-by-Step)#
Step 1: Initialize
Pick a small number of random points and evaluate the objective function. Example: try 3–5 random hyperparameter values.
Step 2: Fit a Surrogate Model
Train a Gaussian Process (GP) or other surrogate model on observed data.
GP gives for any point:
predicted value \(\mu(x)\)
uncertainty \(\sigma(x)\)
Step 3: Use Acquisition Function
Acquisition function (EI, PI, UCB) decides where to evaluate next.
It answers:
Which point is worth evaluating next based on predicted improvement + uncertainty?
Step 4: Evaluate the True Function
Evaluate the chosen point on the real expensive function (e.g., train model with that hyperparameter).
Add the new (x, f(x)) pair to data.
Step 5: Update Model
Retrain the GP with the updated dataset.
Step 6: Iterate
Repeat Steps 2–5 until:
budget is exhausted
improvement becomes small
convergence criteria met
One-Line Summary Bayesian Optimization = Smart search that builds a model of the objective function and chooses the next evaluation point using uncertainty + predicted performance, requiring very few total evaluations.
Gaussian Process (GP)#
Gaussian Processes are a probabilistic, non-parametric model used for regression and Bayesian Optimization.
They are powerful because they provide:
a prediction
a measure of uncertainty
at every point in the input space.
Intuition#
A Gaussian Process is a distribution over functions.
Instead of assuming a function has a fixed form (like linear regression, neural networks), GP assumes:
Any set of function values follows a joint multivariate Gaussian distribution.
This means:
At every point (x), GP predicts a mean ( \mu(x) )
And an uncertainty / variance ( \sigma^2(x) )
The model gets more confident near observed points and less confident far away.
Core Idea#
A GP is defined by:
Where:
\(m(x)\): mean function (often 0)
\(k(x, x')\): kernel (covariance) function
The kernel encodes similarity between points.
If two points (x) and (x’) are similar, GP expects their outputs to be correlated.
Kernel (Covariance Function)#
Most important part.
Common kernels:
Squared Exponential (RBF)#
Smooth
Infinite differentiability
The most widely used
Matern kernel#
Rougher functions; common in real-world optimization.
Linear kernel#
For linear relationships.
How GP Regression Works#
Given data:
We want to predict ( f(x_) ) at a new point (x_).
Step 1: GP assumes joint Gaussian:#
Step 2: Compute posterior mean and variance#
Mean:
Variance:
Interpretation:
If \(x_*\) is near training data → low variance
Far from training data → high variance
Why Gaussian Processes Are Useful#
They handle small datasets well#
GPs shine with small datasets (10–100 data points).
Provide predictive uncertainty#
This uncertainty lets Bayesian Optimization know where to explore next.
Flexible#
Can model very complex functions via kernel choice.
Closed-form Bayesian inference#
GP posterior has exact formulas (unlike neural networks).
Simple Visual Intuition#
Grey region: uncertainty
Blue curve: GP mean prediction
Points: observed data
Notice:
Certainty is high near data points
Uncertainty grows between them
GP in Bayesian Optimization#
GP is the surrogate model.
For every point (x):
\(\mu(x)\): how good this hyperparameter probably is
\(\sigma(x)\): how uncertain we are
Acquisition functions (EI, UCB, PI) use both.
When GPs Fail#
High dimensionality (> 30 dimensions)
Huge datasets (> 10,000 samples)
Very noisy data
Categorical-heavy spaces
Then alternatives like Random Forest TPE, Gradient Boosted Surrogates, Neural Surrogates, or Bayesian Neural Networks are used.
Short Summary
A Gaussian Process is a probability distribution over functions that gives both prediction and uncertainty, making it ideal for problems where every evaluation is expensive, especially Bayesian Optimization.
Acquisition Function#
In Bayesian Optimization, the acquisition function decides where to sample next based on:
The surrogate model’s prediction \(\mu(x)\)
The surrogate model’s uncertainty \(\sigma(x)\)
It balances:
Exploitation: try points that look good
Exploration: try points where the model is uncertain
It directly controls the search efficiency.
Purpose of the Acquisition Function#
Given the Gaussian Process (GP), for every point (x), we know:
Mean prediction: \(\mu(x)\)
Uncertainty: \(\sigma(x)\)
The acquisition function \(a(x)\):
We evaluate the real, expensive function only at the point with the maximum acquisition value.
Why We Need It#
GP is only a model (a guess).
Acquisition function looks at this model and chooses the most informative next point.
Ensures sample efficiency — fewer costly evaluations.
Common Acquisition Functions#
Expected Improvement (EI)#
Most widely used.
Where:
Meaning:
Large if prediction is good (exploitation)
Large if uncertainty is high (exploration)
EI balances both naturally.
Probability of Improvement (PI)#
Interpretation:
Probability this point will give a better result than the current best.
Problem:
Ignores how much improvement we expect.
Tends to over-exploit.
Upper Confidence Bound (UCB)#
For minimization, it’s usually:
For maximization:
Where \(\kappa\) controls exploration.
Meaning:
Large positive \(\sigma(x)\) → high exploration
Low \(\mu(x)\) → high exploitation
Simple and effective.
Lower Confidence Bound (LCB)#
Used for minimization:
Minimize the LCB value.
Visual Intuition#
In the plot:
Blue line = GP mean prediction
Shaded region = uncertainty
Black dots = observed data
Green curve = acquisition score
The next point is where the green curve is highest.
How Acquisition Looks in the Optimization Loop#
Step-by-step:#
Fit GP on current data
For every candidate x:
compute ( \mu(x) ), ( \sigma(x) )
compute acquisition ( a(x) )
Choose x with maximum ( a(x) )
Evaluate real function at x
Update data
Repeat
One-line Definitions
Acquisition function = a strategy function that picks the next best point based on predicted performance + uncertainty.
It balances exploitation (good predictions) and exploration (high uncertainty).
Examples: EI, PI, UCB.
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from math import sqrt
import numpy as np
data = fetch_california_housing()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
def objective(params):
n_estimators, max_depth = params
model = RandomForestRegressor(
n_estimators=int(n_estimators),
max_depth=int(max_depth),
random_state=42,
n_jobs=-1
)
model.fit(X_train, y_train)
pred = model.predict(X_test)
rmse = sqrt(mean_squared_error(y_test, pred))
return rmse
from skopt import gp_minimize
from skopt.space import Integer
result = gp_minimize(
func=objective, # function to minimize
dimensions=[
Integer(50, 300), # n_estimators
Integer(3, 20) # max_depth
],
n_calls=20, # total iterations
n_initial_points=5, # random warmup
acq_func="EI" # Expected Improvement
)
c:\Users\sangouda\Python3.10\lib\site-packages\skopt\optimizer\optimizer.py:517: UserWarning: The objective has been evaluated at point [np.int64(300), np.int64(20)] before, using random point [np.int64(223), np.int64(16)]
warnings.warn(
c:\Users\sangouda\Python3.10\lib\site-packages\skopt\optimizer\optimizer.py:517: UserWarning: The objective has been evaluated at point [np.int64(300), np.int64(20)] before, using random point [np.int64(121), np.int64(13)]
warnings.warn(
c:\Users\sangouda\Python3.10\lib\site-packages\skopt\optimizer\optimizer.py:517: UserWarning: The objective has been evaluated at point [np.int64(300), np.int64(20)] before, using random point [np.int64(55), np.int64(8)]
warnings.warn(
c:\Users\sangouda\Python3.10\lib\site-packages\skopt\optimizer\optimizer.py:517: UserWarning: The objective has been evaluated at point [np.int64(300), np.int64(20)] before, using random point [np.int64(52), np.int64(5)]
warnings.warn(
c:\Users\sangouda\Python3.10\lib\site-packages\skopt\optimizer\optimizer.py:517: UserWarning: The objective has been evaluated at point [np.int64(300), np.int64(20)] before, using random point [np.int64(70), np.int64(16)]
warnings.warn(
print("Best hyperparameters:")
print("n_estimators =", result.x[0])
print("max_depth =", result.x[1])
print("Best RMSE =", result.fun)
Best hyperparameters:
n_estimators = 242
max_depth = 18
Best RMSE = 0.5039831396983969
https://distill.pub/2020/bayesian-optimization
https://ekamperi.github.io/machine learning/2021/06/11/acquisition-functions.html