Gradient Boosting Estimators¶

Gradient Boosting decision trees for classification and regression.

class pygbm.gradient_boosting.GradientBoostingClassifier(loss='auto', learning_rate=0.1, max_iter=100, max_leaf_nodes=31, max_depth=None, min_samples_leaf=20, l2_regularization=0.0, max_bins=256, scoring=None, validation_split=0.1, n_iter_no_change=5, tol=1e-07, verbose=0, random_state=None)[source]¶

Scikit-learn compatible Gradient Boosting Tree for classification.

Parameters:

loss ({'auto', 'binary_crossentropy', 'categorical_crossentropy'}, optional(default='auto')) – The loss function to use in the boosting process. ‘binary_crossentropy’ (also known as logistic loss) is used for binary classification and generalizes to ‘categorical_crossentropy’ for multiclass classification. ‘auto’ will automatically choose either loss depending on the nature of the problem.
learning_rate (float, optional(default=1)) – The learning rate, also known as shrinkage. This is used as a multiplicative factor for the leaves values. Use 1 for no shrinkage.
max_iter (int, optional(default=100)) – The maximum number of iterations of the boosting process, i.e. the maximum number of trees for binary classification. For multiclass classification, n_classes trees per iteration are built.
max_leaf_nodes (int or None, optional(default=None)) – The maximum number of leaves for each tree. If None, there is no maximum limit.
max_depth (int or None, optional(default=None)) – The maximum depth of each tree. The depth of a tree is the number of nodes to go from the root to the deepest leaf.
min_samples_leaf (int, optional(default=20)) – The minimum number of samples per leaf.
l2_regularization (float, optional(default=0)) – The L2 regularization parameter. Use 0 for no regularization.
max_bins (int, optional(default=256)) – The maximum number of bins to use. Before training, each feature of the input array X is binned into at most max_bins bins, which allows for a much faster training stage. Features with a small number of unique values may use less than max_bins bins. Must be no larger than 256.
scoring (str or callable or None, optional (default=None)) – Scoring parameter to use for early stopping (see sklearn.metrics for available options). If None, early stopping is check w.r.t the loss value.
validation_split (int or float or None, optional(default=0.1)) – Proportion (or absolute size) of training data to set aside as validation data for early stopping. If None, early stopping is done on the training data.
n_iter_no_change (int or None, optional (default=5)) – Used to determine when to “early stop”. The fitting process is stopped when none of the last n_iter_no_change scores are better than the ``n_iter_no_change - 1``th-to-last one, up to some tolerance. If None or 0, no early-stopping is done.
tol (float or None optional (default=1e-7)) – The absolute tolerance to use when comparing scores. The higher the tolerance, the more likely we are to early stop: higher tolerance means that it will be harder for subsequent iterations to be considered an improvement upon the reference score.
verbose (int, optional(default=0)) – The verbosity level. If not zero, print some information about the fitting process.
random_state (int, np.random.RandomStateInstance or None, optional(default=None)) – Pseudo-random number generator to control the subsampling in the binning process, and the train/validation data split if early stopping is enabled. See scikit-learn glossary.

Examples

>>> from sklearn.datasets import load_iris
>>> from pygbm import GradientBoostingClassifier
>>> X, y = load_iris(return_X_y=True)
>>> clf = GradientBoostingClassifier().fit(X, y)
>>> clf.score(X, y)
0.97...

fit(X, y)¶

Fit the gradient boosting model.

Parameters:	X (array-like, shape=(n_samples, n_features)) – The input samples. If `X.dtype == np.uint8`, the data is assumed to be pre-binned and the prediction methods (`predict`, `predict_proba`) will only accept pre-binned data as well. y (array-like, shape=(n_samples,)) – Target values.
Returns:	self
Return type:	object

get_params(deep=True)¶

Get parameters for this estimator.

Parameters:	deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:	params – Parameter names mapped to their values.
Return type:	mapping of string to any

predict(X)[source]¶

Predict classes for X.

Parameters:	X (array-like, shape=(n_samples, n_features)) – The input samples. If `X.dtype == np.uint8`, the data is assumed to be pre-binned and the estimator must have been fitted with pre-binned data.
Returns:	y – The predicted classes.
Return type:	array, shape (n_samples,)

predict_proba(X)[source]¶

Predict class probabilities for X.

Parameters:	X (array-like, shape=(n_samples, n_features)) – The input samples. If `X.dtype == np.uint8`, the data is assumed to be pre-binned and the estimator must have been fitted with pre-binned data.
Returns:	p – The class probabilities of the input samples.
Return type:	array, shape (n_samples, n_classes)

score(X, y, sample_weight=None)¶

Returns the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:	X (array-like, shape = (n_samples, n_features)) – Test samples. y (array-like, shape = (n_samples) or (n_samples, n_outputs)) – True labels for X. sample_weight (array-like, shape = [n_samples], optional) – Sample weights.
Returns:	score – Mean accuracy of self.predict(X) wrt. y.
Return type:	float

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
Return type:	self

class pygbm.gradient_boosting.GradientBoostingRegressor(loss='least_squares', learning_rate=0.1, max_iter=100, max_leaf_nodes=31, max_depth=None, min_samples_leaf=20, l2_regularization=0.0, max_bins=256, scoring=None, validation_split=0.1, n_iter_no_change=5, tol=1e-07, verbose=0, random_state=None)[source]¶

Scikit-learn compatible Gradient Boosting Tree for regression.