Warning
Pygbm’s API and default values are likely to be changed in future version, without any deprecation cycle.
Gradient Boosting Estimators¶
Gradient Boosting decision trees for classification and regression.
-
class
pygbm.gradient_boosting.
GradientBoostingClassifier
(loss='auto', learning_rate=0.1, max_iter=100, max_leaf_nodes=31, max_depth=None, min_samples_leaf=20, l2_regularization=0.0, max_bins=256, scoring=None, validation_split=0.1, n_iter_no_change=5, tol=1e-07, verbose=0, random_state=None)[source]¶ Scikit-learn compatible Gradient Boosting Tree for classification.
Parameters: - loss ({'auto', 'binary_crossentropy', 'categorical_crossentropy'}, optional(default='auto')) – The loss function to use in the boosting process. ‘binary_crossentropy’ (also known as logistic loss) is used for binary classification and generalizes to ‘categorical_crossentropy’ for multiclass classification. ‘auto’ will automatically choose either loss depending on the nature of the problem.
- learning_rate (float, optional(default=1)) – The learning rate, also known as shrinkage. This is used as a
multiplicative factor for the leaves values. Use
1
for no shrinkage. - max_iter (int, optional(default=100)) – The maximum number of iterations of the boosting process, i.e. the maximum number of trees for binary classification. For multiclass classification, n_classes trees per iteration are built.
- max_leaf_nodes (int or None, optional(default=None)) – The maximum number of leaves for each tree. If None, there is no maximum limit.
- max_depth (int or None, optional(default=None)) – The maximum depth of each tree. The depth of a tree is the number of nodes to go from the root to the deepest leaf.
- min_samples_leaf (int, optional(default=20)) – The minimum number of samples per leaf.
- l2_regularization (float, optional(default=0)) – The L2 regularization parameter. Use 0 for no regularization.
- max_bins (int, optional(default=256)) – The maximum number of bins to use. Before training, each feature of
the input array
X
is binned into at mostmax_bins
bins, which allows for a much faster training stage. Features with a small number of unique values may use less thanmax_bins
bins. Must be no larger than 256. - scoring (str or callable or None, optional (default=None)) – Scoring parameter to use for early stopping (see sklearn.metrics for available options). If None, early stopping is check w.r.t the loss value.
- validation_split (int or float or None, optional(default=0.1)) – Proportion (or absolute size) of training data to set aside as validation data for early stopping. If None, early stopping is done on the training data.
- n_iter_no_change (int or None, optional (default=5)) – Used to determine when to “early stop”. The fitting process is
stopped when none of the last
n_iter_no_change
scores are better than the ``n_iter_no_change - 1``th-to-last one, up to some tolerance. If None or 0, no early-stopping is done. - tol (float or None optional (default=1e-7)) – The absolute tolerance to use when comparing scores. The higher the tolerance, the more likely we are to early stop: higher tolerance means that it will be harder for subsequent iterations to be considered an improvement upon the reference score.
- verbose (int, optional(default=0)) – The verbosity level. If not zero, print some information about the fitting process.
- random_state (int, np.random.RandomStateInstance or None, optional(default=None)) – Pseudo-random number generator to control the subsampling in the binning process, and the train/validation data split if early stopping is enabled. See scikit-learn glossary.
Examples
>>> from sklearn.datasets import load_iris >>> from pygbm import GradientBoostingClassifier >>> X, y = load_iris(return_X_y=True) >>> clf = GradientBoostingClassifier().fit(X, y) >>> clf.score(X, y) 0.97...
-
fit
(X, y)¶ Fit the gradient boosting model.
Parameters: - X (array-like, shape=(n_samples, n_features)) – The input samples. If
X.dtype == np.uint8
, the data is assumed to be pre-binned and the prediction methods (predict
,predict_proba
) will only accept pre-binned data as well. - y (array-like, shape=(n_samples,)) – Target values.
Returns: self
Return type: object
- X (array-like, shape=(n_samples, n_features)) – The input samples. If
-
get_params
(deep=True)¶ Get parameters for this estimator.
Parameters: deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators. Returns: params – Parameter names mapped to their values. Return type: mapping of string to any
-
predict
(X)[source]¶ Predict classes for X.
Parameters: X (array-like, shape=(n_samples, n_features)) – The input samples. If X.dtype == np.uint8
, the data is assumed to be pre-binned and the estimator must have been fitted with pre-binned data.Returns: y – The predicted classes. Return type: array, shape (n_samples,)
-
predict_proba
(X)[source]¶ Predict class probabilities for X.
Parameters: X (array-like, shape=(n_samples, n_features)) – The input samples. If X.dtype == np.uint8
, the data is assumed to be pre-binned and the estimator must have been fitted with pre-binned data.Returns: p – The class probabilities of the input samples. Return type: array, shape (n_samples, n_classes)
-
score
(X, y, sample_weight=None)¶ Returns the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
Parameters: - X (array-like, shape = (n_samples, n_features)) – Test samples.
- y (array-like, shape = (n_samples) or (n_samples, n_outputs)) – True labels for X.
- sample_weight (array-like, shape = [n_samples], optional) – Sample weights.
Returns: score – Mean accuracy of self.predict(X) wrt. y.
Return type: float
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.Returns: Return type: self
-
class
pygbm.gradient_boosting.
GradientBoostingRegressor
(loss='least_squares', learning_rate=0.1, max_iter=100, max_leaf_nodes=31, max_depth=None, min_samples_leaf=20, l2_regularization=0.0, max_bins=256, scoring=None, validation_split=0.1, n_iter_no_change=5, tol=1e-07, verbose=0, random_state=None)[source]¶ Scikit-learn compatible Gradient Boosting Tree for regression.
Parameters: - loss ({'least_squares'}, optional(default='least_squares')) – The loss function to use in the boosting process.
- learning_rate (float, optional(default=0.1)) – The learning rate, also known as shrinkage. This is used as a
multiplicative factor for the leaves values. Use
1
for no shrinkage. - max_iter (int, optional(default=100)) – The maximum number of iterations of the boosting process, i.e. the maximum number of trees.
- max_leaf_nodes (int or None, optional(default=None)) – The maximum number of leaves for each tree. If None, there is no maximum limit.
- max_depth (int or None, optional(default=None)) – The maximum depth of each tree. The depth of a tree is the number of nodes to go from the root to the deepest leaf.
- min_samples_leaf (int, optional(default=20)) – The minimum number of samples per leaf.
- l2_regularization (float, optional(default=0)) – The L2 regularization parameter. Use 0 for no regularization.
- max_bins (int, optional(default=256)) – The maximum number of bins to use. Before training, each feature of
the input array
X
is binned into at mostmax_bins
bins, which allows for a much faster training stage. Features with a small number of unique values may use less thanmax_bins
bins. Must be no larger than 256. - scoring (str or callable or None, optional (default=None)) – Scoring parameter to use for early stopping (see sklearn.metrics for available options). If None, early stopping is check w.r.t the loss value.
- validation_split (int or float or None, optional(default=0.1)) – Proportion (or absolute size) of training data to set aside as validation data for early stopping. If None, early stopping is done on the training data.
- n_iter_no_change (int or None, optional (default=5)) – Used to determine when to “early stop”. The fitting process is
stopped when none of the last
n_iter_no_change
scores are better than the ``n_iter_no_change - 1``th-to-last one, up to some tolerance. If None or 0, no early-stopping is done. - tol (float or None optional (default=1e-7)) – The absolute tolerance to use when comparing scores. The higher the tolerance, the more likely we are to early stop: higher tolerance means that it will be harder for subsequent iterations to be considered an improvement upon the reference score.
- verbose (int, optional (default=0)) – The verbosity level. If not zero, print some information about the fitting process.
- random_state (int, np.random.RandomStateInstance or None, optional (default=None)) –
Pseudo-random number generator to control the subsampling in the binning process, and the train/validation data split if early stopping is enabled. See scikit-learn glossary.
Examples
>>> from sklearn.datasets import load_boston >>> from pygbm import GradientBoostingRegressor >>> X, y = load_boston(return_X_y=True) >>> est = GradientBoostingRegressor().fit(X, y) >>> est.score(X, y) 0.92...
-
fit
(X, y)¶ Fit the gradient boosting model.
Parameters: - X (array-like, shape=(n_samples, n_features)) – The input samples. If
X.dtype == np.uint8
, the data is assumed to be pre-binned and the prediction methods (predict
,predict_proba
) will only accept pre-binned data as well. - y (array-like, shape=(n_samples,)) – Target values.
Returns: self
Return type: object
- X (array-like, shape=(n_samples, n_features)) – The input samples. If
-
get_params
(deep=True)¶ Get parameters for this estimator.
Parameters: deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators. Returns: params – Parameter names mapped to their values. Return type: mapping of string to any
-
predict
(X)[source]¶ Predict values for X.
Parameters: X (array-like, shape=(n_samples, n_features)) – The input samples. If X.dtype == np.uint8
, the data is assumed to be pre-binned and the estimator must have been fitted with pre-binned data.Returns: y – The predicted values. Return type: array, shape (n_samples,)
-
score
(X, y, sample_weight=None)¶ Returns the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters: - X (array-like, shape = (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix instead, shape = (n_samples, n_samples_fitted], where n_samples_fitted is the number of samples used in the fitting for the estimator.
- y (array-like, shape = (n_samples) or (n_samples, n_outputs)) – True values for X.
- sample_weight (array-like, shape = [n_samples], optional) – Sample weights.
Returns: score – R^2 of self.predict(X) wrt. y.
Return type: float
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.Returns: Return type: self