Losses¶

This module contains the loss classes.

Specific losses are used for regression, binary classification or multiclass classification.

class pygbm.loss.BinaryCrossEntropy[source]¶

Binary cross-entropy loss, for binary classification.

For a given sample x_i, the binary cross-entropy loss is defined as the negative log-likelihood of the model which can be expressed as:

loss(x_i) = log(1 + exp(raw_pred_i)) - y_true_i * raw_pred_i

See The Elements of Statistical Learning, by Hastie, Tibshirani, Friedman.

get_baseline_prediction(y_train, prediction_dim)[source]¶

Return initial predictions (before the first iteration).

Parameters:	y_train (array-like, shape=(n_samples,)) – The target training values. prediction_dim (int) – The dimension of one prediction: 1 for binary classification and regression, n_classes for multiclass classification.
Returns:	baseline_prediction – The baseline prediction.
Return type:	float or array of shape (1, prediction_dim)

init_gradients_and_hessians(n_samples, prediction_dim)¶

Return initial gradients and hessians.

Unless hessians are constant, arrays are initialized with undefined values.

Parameters:

n_samples (int) – The number of samples passed to fit()
prediction_dim (int) – The dimension of a raw prediction, i.e. the number of trees built at each iteration. Equals 1 for regression and binary classification, or K where K is the number of classes for multiclass classification.

Returns:

gradients (array-like, shape=(n_samples * prediction_dim))
hessians (array-like, shape=(n_samples * prediction_dim).) – If hessians are constant (e.g. for LeastSquares loss, shape is (1,) and the array is initialized to 1.

inverse_link_function = <ufunc 'expit'>¶

update_gradients_and_hessians(gradients, hessians, y_true, raw_predictions)[source]¶

Update gradients and hessians arrays, inplace.

The gradients (resp. hessians) are the first (resp. second) order derivatives of the loss for each sample with respect to the predictions of model, evaluated at iteration i - 1.

Parameters:

gradients (array-like, shape=(n_samples * prediction_dim)) – The gradients (treated as OUT array).
hessians (array-like, shape=(n_samples * prediction_dim) or (1,)) – The hessians (treated as OUT array).
y_true (array-like, shape=(n_samples,)) – The true target values or each training sample.
raw_predictions (array-like, shape=(n_samples, prediction_dim)) – The raw_predictions (i.e. values from the trees) of the tree ensemble at iteration i - 1.

class pygbm.loss.CategoricalCrossEntropy[source]¶

Categorical cross-entropy loss, for multiclass classification.

For a given sample x_i, the categorical cross-entropy loss is defined as the negative log-likelihood of the model and generalizes the binary cross-entropy to more than 2 classes.

get_baseline_prediction(y_train, prediction_dim)[source]¶

Return initial predictions (before the first iteration).

Parameters:	y_train (array-like, shape=(n_samples,)) – The target training values. prediction_dim (int) – The dimension of one prediction: 1 for binary classification and regression, n_classes for multiclass classification.
Returns:	baseline_prediction – The baseline prediction.
Return type:	float or array of shape (1, prediction_dim)

init_gradients_and_hessians(n_samples, prediction_dim)¶

Return initial gradients and hessians.

Unless hessians are constant, arrays are initialized with undefined values.

Parameters:

n_samples (int) – The number of samples passed to fit()
prediction_dim (int) – The dimension of a raw prediction, i.e. the number of trees built at each iteration. Equals 1 for regression and binary classification, or K where K is the number of classes for multiclass classification.

Returns:

gradients (array-like, shape=(n_samples * prediction_dim))
hessians (array-like, shape=(n_samples * prediction_dim).) – If hessians are constant (e.g. for LeastSquares loss, shape is (1,) and the array is initialized to 1.

update_gradients_and_hessians(gradients, hessians, y_true, raw_predictions)[source]¶

Update gradients and hessians arrays, inplace.

The gradients (resp. hessians) are the first (resp. second) order derivatives of the loss for each sample with respect to the predictions of model, evaluated at iteration i - 1.

Parameters:

gradients (array-like, shape=(n_samples * prediction_dim)) – The gradients (treated as OUT array).
hessians (array-like, shape=(n_samples * prediction_dim) or (1,)) – The hessians (treated as OUT array).
y_true (array-like, shape=(n_samples,)) – The true target values or each training sample.
raw_predictions (array-like, shape=(n_samples, prediction_dim)) – The raw_predictions (i.e. values from the trees) of the tree ensemble at iteration i - 1.

class pygbm.loss.LeastSquares[source]¶

Least squares loss, for regression.

For a given sample x_i, least squares loss is defined as:

loss(x_i) = (y_true_i - raw_pred_i)**2

get_baseline_prediction(y_train, prediction_dim)[source]¶

Return initial predictions (before the first iteration).

Parameters:	y_train (array-like, shape=(n_samples,)) – The target training values. prediction_dim (int) – The dimension of one prediction: 1 for binary classification and regression, n_classes for multiclass classification.
Returns:	baseline_prediction – The baseline prediction.
Return type:	float or array of shape (1, prediction_dim)

init_gradients_and_hessians(n_samples, prediction_dim)¶

Return initial gradients and hessians.

Unless hessians are constant, arrays are initialized with undefined values.

Parameters:

n_samples (int) – The number of samples passed to fit()
prediction_dim (int) – The dimension of a raw prediction, i.e. the number of trees built at each iteration. Equals 1 for regression and binary classification, or K where K is the number of classes for multiclass classification.

Returns:

gradients (array-like, shape=(n_samples * prediction_dim))
hessians (array-like, shape=(n_samples * prediction_dim).) – If hessians are constant (e.g. for LeastSquares loss, shape is (1,) and the array is initialized to 1.

update_gradients_and_hessians(gradients, hessians, y_true, raw_predictions)[source]¶

Update gradients and hessians arrays, inplace.

The gradients (resp. hessians) are the first (resp. second) order derivatives of the loss for each sample with respect to the predictions of model, evaluated at iteration i - 1.

Parameters:

gradients (array-like, shape=(n_samples * prediction_dim)) – The gradients (treated as OUT array).
hessians (array-like, shape=(n_samples * prediction_dim) or (1,)) – The hessians (treated as OUT array).
y_true (array-like, shape=(n_samples,)) – The true target values or each training sample.
raw_predictions (array-like, shape=(n_samples, prediction_dim)) – The raw_predictions (i.e. values from the trees) of the tree ensemble at iteration i - 1.

Binning¶

This module contains the BinMapper class.

BinMapper is used for mapping a real-valued dataset into integer-valued bins with equally-spaced thresholds.

class pygbm.binning.BinMapper(max_bins=256, subsample=100000, random_state=None)[source]¶

Transformer that maps a dataset into integer-valued bins.

The bins are created in a feature-wise fashion, with equally-spaced quantiles.

Large datasets are subsampled, but the feature-wise quantiles should remain stable.

If the number of unique values for a given feature is less than max_bins, then the unique values of this feature are used instead of the quantiles.

Parameters:

max_bins (int, optional (default=256)) – The maximum number of bins to use. If for a given feature the number of unique values is less than max_bins, then those unique values will be used to compute the bin thresholds, instead of the quantiles.
subsample (int or None, optional (default=1e5)) – If n_samples > subsample, then sub_samples samples will be randomly choosen to compute the quantiles. If None, the whole data is used.
random_state (int or numpy.random.RandomState or None, optional (default=None)) – Pseudo-random number generator to control the random sub-sampling. See scikit-learn glossary.

fit(X, y=None)[source]¶

Fit data X by computing the binning thresholds.

Parameters:	X (array-like) – The data to bin
Returns:	self
Return type:	object

transform(X)[source]¶

Bin data X.

Parameters:	X (array-like) – The data to bin
Returns:	X_binned – The binned data
Return type:	array-like

Grower¶

This module contains the TreeGrower class.

TreeGrowee builds a regression tree fitting a Newton-Raphson step, based on the gradients and hessians of the training data.

class pygbm.grower.TreeGrower(X_binned, gradients, hessians, max_leaf_nodes=None, max_depth=None, min_samples_leaf=20, min_gain_to_split=0.0, max_bins=256, n_bins_per_feature=None, l2_regularization=0.0, min_hessian_to_split=0.001, shrinkage=1.0)[source]¶

Tree grower class used to build a tree.

The tree is fitted to predict the values of a Newton-Raphson step. The splits are considered in a best-first fashion, and the quality of a split is defined in splitting._split_gain.

Parameters:

X_binned (array-like of int, shape=(n_samples, n_features)) – The binned input samples. Must be Fortran-aligned.
gradients (array-like, shape=(n_samples,)) – The gradients of each training sample. Those are the gradients of the loss w.r.t the predictions, evaluated at iteration i - 1.
hessians (array-like, shape=(n_samples,)) – The hessians of each training sample. Those are the hessians of the loss w.r.t the predictions, evaluated at iteration i - 1.
max_leaf_nodes (int or None, optional(default=None)) – The maximum number of leaves for each tree. If None, there is no maximum limit.
max_depth (int or None, optional(default=None)) – The maximum depth of each tree. The depth of a tree is the number of nodes to go from the root to the deepest leaf.
min_samples_leaf (int, optional(default=20)) – The minimum number of samples per leaf.
min_gain_to_split (float, optional(default=0.)) – The minimum gain needed to split a node. Splits with lower gain will be ignored.
max_bins (int, optional(default=256)) – The maximum number of bins. Used to define the shape of the histograms.
n_bins_per_feature (array-like of int or int, optional(default=None)) – The actual number of bins needed for each feature, which is lower or equal to max_bins. If it’s an int, all features are considered to have the same number of bins. If None, all features are considered to have max_bins bins.
l2_regularization (float, optional(default=0)) – The L2 regularization parameter.
min_hessian_to_split (float, optional(default=1e-3)) – The minimum sum of hessians needed in each node. Splits that result in at least one child having a sum of hessians less than min_hessian_to_split are discarded.
shrinkage (float, optional(default=1)) – The shrinkage parameter to apply to the leaves values, also known as learning rate.

can_split_further()[source]¶: Return True if there are still nodes to split.

grow()[source]¶: Grow the tree, from root to leaves.

make_predictor(numerical_thresholds=None)[source]¶

Make a TreePredictor object out of the current tree.

Parameters:	numerical_thresholds (array-like of floats, optional (default=None)) – The actual thresholds values of each bin, expected to be in sorted increasing order. None if the training data was pre-binned.
Returns:
Return type:	A TreePredictor object.

split_next()[source]¶

Split the node with highest potential gain.

Returns:	left (TreeNode) – The resulting left child. right (TreeNode) – The resulting right child.

class pygbm.grower.TreeNode(depth, sample_indices, sum_gradients, sum_hessians, parent=None)[source]¶

Tree Node class used in TreeGrower.

This isn’t used for prediction purposes, only for training (see TreePredictor).

Parameters:

depth (int) – The depth of the node, i.e. its distance from the root
samples_indices (array of int) – The indices of the samples at the node
sum_gradients (float) – The sum of the gradients of the samples at the node
sum_hessians (float) – The sum of the hessians of the samples at the node
parent (TreeNode or None, optional(default=None)) – The parent of the node. None for root.

depth¶: int – The depth of the node, i.e. its distance from the root

samples_indices¶: array of int – The indices of the samples at the node

sum_gradients¶: float – The sum of the gradients of the samples at the node

sum_hessians¶: float – The sum of the hessians of the samples at the node

parent¶: TreeNode or None, optional(default=None) – The parent of the node. None for root.

split_info¶: SplitInfo or None – The result of the split evaluation

left_child¶: TreeNode or None – The left child of the node. None for leaves.

right_child¶: TreeNode or None – The right child of the node. None for leaves.

value¶: float or None – The value of the leaf, as computed in finalize_leaf(). None for non-leaf nodes

find_split_time¶: float – The total time spent computing the histogram and finding the best split at the node.

construction_speed¶: float – The Number of samples at the node divided find_split_time.

apply_split_time¶: float – The total time spent actually splitting the node, e.g. splitting samples_indices into left and right child.

hist_subtraction¶: bool – Wheter the subtraction method was used for computing the histograms.

Splitting¶

This module contains njitted routines and data structures to:

Find the best possible split of a node. For a given node, a split is characterized by a feature and a bin.
Apply a split to a node, i.e. split the indices of the samples at the node into the newly created left and right childs.

pygbm.splitting.find_node_split[source]¶

For each feature, find the best bin to split on at a given node.

Returns the best split info among all features, and the histograms of all the features. The histograms are computed by scanning the whole data.

Parameters:

context (SplittingContext) – The splitting context
sample_indices (array of int) – The indices of the samples at the node to split.

Returns:

best_split_info (SplitInfo) – The info about the best possible split among all features.
histograms (array of HISTOGRAM_DTYPE, shape=(n_features, max_bins)) – The histograms of each feature. A histogram is an array of HISTOGRAM_DTYPE of size max_bins (only n_bins_per_features[feature] entries are relevant).

pygbm.splitting.find_node_split_subtraction[source]¶

For each feature, find the best bin to split on at a given node.

Returns the best split info among all features, and the histograms of all the features.

This does the same job as find_node_split() but uses the histograms of the parent and sibling of the node to split. This allows to use the identity: histogram(parent) = histogram(node) - histogram(sibling), which is significantly faster than computing the histograms from data.

Returns the best SplitInfo among all features, along with all the feature histograms that can be latter used to compute the sibling or children histograms by substraction.

Parameters:

context (SplittingContext) – The splitting context
sample_indices (array of int) – The indices of the samples at the node to split.
parent_histograms (array of HISTOGRAM_DTYPE of shape(n_features, max_bins)) – The histograms of the parent
sibling_histograms (array of HISTOGRAM_DTYPE of shape(n_features, max_bins)) – The histograms of the sibling

Returns:

best_split_info (SplitInfo) – The info about the best possible split among all features.
histograms (array of HISTOGRAM_DTYPE, shape=(n_features, max_bins)) – The histograms of each feature. A histogram is an array of HISTOGRAM_DTYPE of size max_bins (only n_bins_per_features[feature] entries are relevant).

pygbm.splitting.split_indices[source]¶

Split samples into left and right arrays.

Parameters:

context (SplittingContext) – The splitting context
split_ingo (SplitInfo) – The SplitInfo of the node to split
sample_indices (array of int) – The indices of the samples at the node to split. This is a view on context.partition, and it is modified inplace by placing the indices of the left child at the beginning, and the indices of the right child at the end.

Returns:

left_indices (array of int) – The indices of the samples in the left child. This is a view on context.partition.
right_indices (array of int) – The indices of the samples in the right child. This is a view on context.partition.