Warning

Pygbm’s API and default values are likely to be changed in future version, without any deprecation cycle.

Gradient Boosting Estimators¶

Gradient Boosting decision trees for classification and regression.

class pygbm.gradient_boosting.GradientBoostingClassifier(loss='auto', learning_rate=0.1, max_iter=100, max_leaf_nodes=31, max_depth=None, min_samples_leaf=20, l2_regularization=0.0, max_bins=256, scoring='neg_log_loss', validation_split=0.1, n_iter_no_change=5, tol=1e-07, verbose=0, random_state=None)[source]¶

Scikit-learn compatible Gradient Boosting Tree for classification.

Parameters:

loss ({'auto', 'binary_crossentropy', 'categorical_crossentropy'}, optional(default='auto')) – The loss function to use in the boosting process. ‘binary_crossentropy’ (also known as logistic loss) is used for binary classification and generalizes to ‘categorical_crossentropy’ for multiclass classification. ‘auto’ will automatically choose either loss depending on the nature of the problem.
learning_rate (float, optional(default=1)) – The learning rate, also known as shrinkage. This is used as a multiplicative factor for the leaves values. Use 1 for no shrinkage.
max_iter (int, optional(default=100)) – The maximum number of iterations of the boosting process, i.e. the maximum number of trees for binary classification. For multiclass classification, n_classes trees per iteration are built.
max_leaf_nodes (int or None, optional(default=None)) – The maximum number of leaves for each tree. If None, there is no maximum limit.
max_depth (int or None, optional(default=None)) – The maximum depth of each tree. The depth of a tree is the number of nodes to go from the root to the deepest leaf.
min_samples_leaf (int, optional(default=20)) – The minimum number of samples per leaf.
l2_regularization (float, optional(default=0)) – The L2 regularization parameter. Use 0 for no regularization.
max_bins (int, optional(default=256)) – The maximum number of bins to use. Before training, each feature of the input array X is binned into at most max_bins bins, which allows for a much faster training stage. Features with a small number of unique values may use less than max_bins bins. Must be no larger than 256.
scoring (str or callable or None, optional (default='accuracy')) – Scoring parameter to use for early stopping (see sklearn.metrics for available options). If None, no early stopping is done.
validation_split (int or float or None, optional(default=0.1)) – Proportion (or absolute size) of training data to set aside as validation data for early stopping. If None, early stopping is done on the whole training data.
n_iter_no_change (int, optional (default=5)) – Used to determine when to “early stop”. The fitting process is stopped when none of the last n_iter_no_change scores are better than the ``n_iter_no_change - 1``th-to-last one, up to some tolerance.
tol (float or None optional (default=1e-7)) – The absolute tolerance to use when comparing scores. The higher the tolerance, the more likely we are to early stop: higher tolerance means that it will be harder for subsequent iterations to be considered an improvement upon the reference score.
verbose (int, optional(default=0)) – The verbosity level. If not zero, print some information about the fitting process.
random_state (int, np.random.RandomStateInstance or None, optional(default=None)) – Pseudo-random number generator to control the subsampling in the binning process, and the train/validation data split if early stopping is enabled. See scikit-learn glossary.

Examples

>>> from sklearn.datasets import load_iris
>>> from pygbm import GradientBoostingClassifier
>>> X, y = load_iris(return_X_y=True)
>>> clf = GradientBoostingClassifier().fit(X, y)
>>> clf.score(X, y)
0.97...

fit(X, y)¶

Fit the gradient boosting model.

Parameters:	X (array-like, shape=(n_samples, n_features)) – The input samples. y (array-like, shape=(n_samples,)) – Target values.
Returns:	self
Return type:	object

get_params(deep=True)¶

Get parameters for this estimator.

Parameters:	deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:	params – Parameter names mapped to their values.
Return type:	mapping of string to any

predict(X)[source]¶

Predict classes for X.

Parameters:	X (array-like, shape=(n_samples, n_features)) – The input samples. If `X.dtype == np.uint8`, the data is assumed to be pre-binned.
Returns:	y – The predicted classes.
Return type:	array, shape (n_samples,)

predict_proba(X)[source]¶

Predict class probabilities for X.

Parameters:	X (array-like, shape=(n_samples, n_features)) – The input samples. If `X.dtype == np.uint8`, the data is assumed to be pre-binned.
Returns:	p – The class probabilities of the input samples.
Return type:	array, shape (n_samples, n_classes)

score(X, y, sample_weight=None)¶

Returns the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:	X (array-like, shape = (n_samples, n_features)) – Test samples. y (array-like, shape = (n_samples) or (n_samples, n_outputs)) – True labels for X. sample_weight (array-like, shape = [n_samples], optional) – Sample weights.
Returns:	score – Mean accuracy of self.predict(X) wrt. y.
Return type:	float

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
Return type:	self

class pygbm.gradient_boosting.GradientBoostingRegressor(loss='least_squares', learning_rate=0.1, max_iter=100, max_leaf_nodes=31, max_depth=None, min_samples_leaf=20, l2_regularization=0.0, max_bins=256, scoring='neg_mean_squared_error', validation_split=0.1, n_iter_no_change=5, tol=1e-07, verbose=0, random_state=None)[source]¶

Scikit-learn compatible Gradient Boosting Tree for regression.