Search for probability and statistics terms on Statlect


by , PhD

In machine learning, the best predictive performance is often obtained by averaging the forecasts from different models, a process which is called ensembling or ensemble learning.

Warning: ensembling is a vast topic and we are going to only scratch the surface here.

Table of Contents


Suppose that we have trained $M$ different predictive models [eq1]

The set of $M$ models is called an ensemble.

The ensemble average is [eq2]

It is possible to prove (e.g., Sollich and Krogh 1996) that [eq3]

Note that [eq4]is the mean squared error (MSE) of the ensemble average and [eq5]is the MSE of a single predictive model.

The last term, that is, [eq6] is a measure of the diversity (or disagreement) among the models.

Therefore, the MSE of the ensemble average is less than the average MSE of the models in the ensemble. How much less? It depends on the diversity of the ensemble. The more diverse the ensemble, the greater the reduction in MSE.



How to exploit the above theoretical result in practice is more of an art than a science.

There is a vast literature on ensembling which we cannot cover in this introductory course.

Here, we provide a simple recipe that can be applied in most scenarios.

Suppose that we have decided to use a certain algorithm (e.g., boosted trees, as implemented in LightGBM). Then, it is basically a free lunch to use the same algorithm to train different models by randomizing along the following dimensions:

In the next lecture, we will also see how to create ensembles by using a smart form of cross-validation called K-fold cross-validation.

Python example

For this example, we use the same artificially-generated data set used in the lecture on boosted trees:

Import the data and use scikit-learn to split into train_val-and-test (80-20)

We import the data and split it into train_val and test.

Subsequently, train_val will be split randomly into train and val in a different manner for each model in the ensemble.

Note that the split is done in such a way that the test set is identical to that used in previous lectures.

# Import the packages used to load and manipulate the data
import numpy as np # Numpy is a Matlab-like package for array manipulation and linear algebra
import pandas as pd # Pandas is a data-analysis and table-manipulation tool
import urllib.request # Urlib will be used to download the dataset

# Import the function that performs sample splits from scikit-learn
from sklearn.model_selection import train_test_split

# Load the output variable with pandas (download with urllib if not downloaded previously)
remoteAddress = ''
localAddress = './y_artificial.csv'
    y = pd.read_csv(localAddress, header=None)
    urllib.request.urlretrieve(remoteAddress, localAddress)
    y = pd.read_csv(localAddress, header=None)
y = y.values # Transform y into a numpy array

# Print some information about the output variable
print('Class and dimension of output variable:')

# Load the input variables with pandas 
remoteAddress = ''
localAddress = './x_artificial.csv'
    x = pd.read_csv(localAddress, header=None)
    urllib.request.urlretrieve(remoteAddress, localAddress)
    x = pd.read_csv(localAddress, header=None)
x = x.values

# Print some information about the input variables
print('Class and dimension of input variables:')

# The code below is ugly! Done to have same test set as in previous lectures
x_train, x_val_test, y_train, y_val_test 
  = train_test_split(x, y, test_size=0.4, random_state=0)

x_val, x_test, y_val, y_test 
  = train_test_split(x_val_test, y_val_test, test_size=0.5, random_state=0)
y_test = np.squeeze(y_test)

x_train_val = np.vstack((x_train, x_val))
y_train_val = np.vstack((y_train, y_val))

# Print the numerosities of the three samples
print('Numerosities of training, validation and test samples:')
print(x_train.shape[0], x_val.shape[0], x_test.shape[0])

The output is:

Class and dimension of output variable:
class 'numpy.ndarray'
(500, 1)
Class and dimension of input variables:
class 'numpy.ndarray'
(500, 300)
Numerosities of training, validation and test samples:
300 100 100

Create an ensemble of 100 models with LightGBM

Our ensemble comprises 100 different models.

Differences among models are generated by:

#Import the lightGBM package
import lightgbm as lgb

# Import model-evaluation metrics
from sklearn.metrics import mean_squared_error, r2_score

# Import random number generator and set seed
import random

# Set number of models in the ensemble and model list
n_models = 100
ensemble = []
for j in range(n_models):    
    # Randomly partition the train_val set
    x_train, x_val, y_train, y_val 
      = train_test_split(x_train_val, y_train_val, test_size=0.25, random_state=j)
    # Prepare dataset in LightGMB format
    y_train = np.squeeze(y_train)
    y_val = np.squeeze(y_val)
    train_set = lgb.Dataset(x_train, y_train, silent=True)
    valid_set = lgb.Dataset(x_val, y_val, silent=True)
    # Randomly choose hyperparameter values
    learning_rate = random.choice([0.05, 0.075, 0.10, 0.125, 0.15])
    max_depth = random.choice([2, 3])
    min_data_in_leaf = random.choice([5, 10, 15])
    early_stopping_rounds = random.choice([15, 20, 25])
    # Set algorithm parameters
    params = {
        'objective': 'regression',
        'learning_rate': learning_rate,
        'metric': 'mse',
        'nthread': 8,
        'min_data_in_leaf': min_data_in_leaf,
        'max_depth': max_depth,
        'seed': j,
        'feature_fraction': 0.8,
        'verbose': -1
    } # The feature_fraction parameter allows us to randomize over inputs
    # Train the model 
    boosted_tree = lgb.train(
        params = params,
        train_set = train_set,
        valid_sets = valid_set,
        num_boost_round = 10000,
        early_stopping_rounds =  early_stopping_rounds,
        verbose_eval = False,
    # Save the model in the ensemble list

# Compute ensemble average and MSEs of single models 
mses_single_models = []
y_test_pred_ensemble_avg = 0
for j in range(n_models):
    y_test_pred = ensemble[j].predict(x_test)
    y_test_pred_ensemble_avg += y_test_pred / n_models
    mse = mean_squared_error(y_test, y_test_pred)

# Compute average MSE of models and MSE of ensemble average on test set 
print('Average test MSE of models in the ensemble:')
print('Test MSE of ensemble average:')
print(mean_squared_error(y_test, y_test_pred_ensemble_avg))

# Print R squared on test set
print('R squared of ensemble average on test set:')
print(r2_score(y_test, y_test_pred_ensemble_avg))

The output is:

Average test MSE of models in the ensemble:
Test MSE of ensemble average:

R squared of ensemble average on test set:

The test MSE of the ensemble average is significantly lower than the average MSE of the models in the ensemble. By generating an ensemble, with little effort we achieve an average reduction in test MSE larger than 10 per cent.


Sollich, P. and Krogh, A. (1996) "Learning with ensembles: How overfitting can be useful," Advances in Neural Information Processing Systems, volume 8, pp. 190-196.

How to cite

Please cite as:

Taboga, Marco (2021). "Ensembling", Lectures on machine learning.

The books

Most of the learning materials found on this website are now available in a traditional textbook format.