ML之Xgboost:利用Xgboost模型(7f-CrVa+网格搜索调参)对数据集(比马印第安人糖尿病)进行二分类预测
生活随笔
收集整理的這篇文章主要介紹了
ML之Xgboost:利用Xgboost模型(7f-CrVa+网格搜索调参)对数据集(比马印第安人糖尿病)进行二分类预测
小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
ML之Xgboost:利用Xgboost模型(7f-CrVa+網(wǎng)格搜索調(diào)參)對(duì)數(shù)據(jù)集(比馬印第安人糖尿病)進(jìn)行二分類預(yù)測(cè)
?
?
目錄
輸出結(jié)果
設(shè)計(jì)思路
核心代碼
?
?
?
?
輸出結(jié)果
?
設(shè)計(jì)思路
?
?
?
核心代碼
grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold) grid_result = grid_search.fit(X, Y) param_grid = dict(learning_rate=learning_rate) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)?
class GridSearchCV(BaseSearchCV):"""Exhaustive search over specified parameter values for an estimator.Important members are fit, predict.GridSearchCV implements a "fit" and a "score" method.It also implements "predict", "predict_proba", "decision_function","transform" and "inverse_transform" if they are implemented in theestimator used.The parameters of the estimator used to apply these methods are optimizedby cross-validated grid-search over a parameter grid.Read more in the :ref:`User Guide <grid_search>`.Parameters----------estimator : estimator object.This is assumed to implement the scikit-learn estimator interface.Either estimator needs to provide a ``score`` function,or ``scoring`` must be passed.param_grid : dict or list of dictionariesDictionary with parameters names (string) as keys and lists ofparameter settings to try as values, or a list of suchdictionaries, in which case the grids spanned by each dictionaryin the list are explored. This enables searching over any sequenceof parameter settings.scoring : string, callable, list/tuple, dict or None, default: NoneA single string (see :ref:`scoring_parameter`) or a callable(see :ref:`scoring`) to evaluate the predictions on the test set.For evaluating multiple metrics, either give a list of (unique) stringsor a dict with names as keys and callables as values.NOTE that when using custom scorers, each scorer should return a singlevalue. Metric functions returning a list/array of values can be wrappedinto multiple scorers that return one value each.See :ref:`multimetric_grid_search` for an example.If None, the estimator's default scorer (if available) is used.fit_params : dict, optionalParameters to pass to the fit method... deprecated:: 0.19``fit_params`` as a constructor argument was deprecated in version0.19 and will be removed in version 0.21. Pass fit parameters tothe ``fit`` method instead.n_jobs : int, default=1Number of jobs to run in parallel.pre_dispatch : int, or string, optionalControls the number of jobs that get dispatched during parallelexecution. Reducing this number can be useful to avoid anexplosion of memory consumption when more jobs get dispatchedthan CPUs can process. This parameter can be:- None, in which case all the jobs are immediatelycreated and spawned. Use this for lightweight andfast-running jobs, to avoid delays due to on-demandspawning of the jobs- An int, giving the exact number of total jobs that arespawned- A string, giving an expression as a function of n_jobs,as in '2*n_jobs'iid : boolean, default=TrueIf True, the data is assumed to be identically distributed acrossthe folds, and the loss minimized is the total loss per sample,and not the mean loss across the folds.cv : int, cross-validation generator or an iterable, optionalDetermines the cross-validation splitting strategy.Possible inputs for cv are:- None, to use the default 3-fold cross validation,- integer, to specify the number of folds in a `(Stratified)KFold`,- An object to be used as a cross-validation generator.- An iterable yielding train, test splits.For integer/None inputs, if the estimator is a classifier and ``y`` iseither binary or multiclass, :class:`StratifiedKFold` is used. In allother cases, :class:`KFold` is used.Refer :ref:`User Guide <cross_validation>` for the variouscross-validation strategies that can be used here.refit : boolean, or string, default=TrueRefit an estimator using the best found parameters on the wholedataset.For multiple metric evaluation, this needs to be a string denoting thescorer is used to find the best parameters for refitting the estimatorat the end.The refitted estimator is made available at the ``best_estimator_``attribute and permits using ``predict`` directly on this``GridSearchCV`` instance.Also for multiple metric evaluation, the attributes ``best_index_``,``best_score_`` and ``best_parameters_`` will only be available if``refit`` is set and all of them will be determined w.r.t this specificscorer.See ``scoring`` parameter to know more about multiple metricevaluation.verbose : integerControls the verbosity: the higher, the more messages.error_score : 'raise' (default) or numericValue to assign to the score if an error occurs in estimator fitting.If set to 'raise', the error is raised. If a numeric value is given,FitFailedWarning is raised. This parameter does not affect the refitstep, which will always raise the error.return_train_score : boolean, optionalIf ``False``, the ``cv_results_`` attribute will not include trainingscores.Current default is ``'warn'``, which behaves as ``True`` in additionto raising a warning when a training score is looked up.That default will be changed to ``False`` in 0.21.Computing training scores is used to get insights on how differentparameter settings impact the overfitting/underfitting trade-off.However computing the scores on the training set can be computationallyexpensive and is not strictly required to select the parameters thatyield the best generalization performance.Examples-------->>> from sklearn import svm, datasets>>> from sklearn.model_selection import GridSearchCV>>> iris = datasets.load_iris()>>> parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}>>> svc = svm.SVC()>>> clf = GridSearchCV(svc, parameters)>>> clf.fit(iris.data, iris.target)... # doctest: +NORMALIZE_WHITESPACE +ELLIPSISGridSearchCV(cv=None, error_score=...,estimator=SVC(C=1.0, cache_size=..., class_weight=..., coef0=...,decision_function_shape='ovr', degree=..., gamma=...,kernel='rbf', max_iter=-1, probability=False,random_state=None, shrinking=True, tol=...,verbose=False),fit_params=None, iid=..., n_jobs=1,param_grid=..., pre_dispatch=..., refit=..., return_train_score=...,scoring=..., verbose=...)>>> sorted(clf.cv_results_.keys())... # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS['mean_fit_time', 'mean_score_time', 'mean_test_score',...'mean_train_score', 'param_C', 'param_kernel', 'params',...'rank_test_score', 'split0_test_score',...'split0_train_score', 'split1_test_score', 'split1_train_score',...'split2_test_score', 'split2_train_score',...'std_fit_time', 'std_score_time', 'std_test_score', 'std_train_score'...]Attributes----------cv_results_ : dict of numpy (masked) ndarraysA dict with keys as column headers and values as columns, that can beimported into a pandas ``DataFrame``.For instance the below given table+------------+-----------+------------+-----------------+---+---------+|param_kernel|param_gamma|param_degree|split0_test_score|...|rank_t...|+============+===========+============+=================+===+=========+| 'poly' | -- | 2 | 0.8 |...| 2 |+------------+-----------+------------+-----------------+---+---------+| 'poly' | -- | 3 | 0.7 |...| 4 |+------------+-----------+------------+-----------------+---+---------+| 'rbf' | 0.1 | -- | 0.8 |...| 3 |+------------+-----------+------------+-----------------+---+---------+| 'rbf' | 0.2 | -- | 0.9 |...| 1 |+------------+-----------+------------+-----------------+---+---------+will be represented by a ``cv_results_`` dict of::{'param_kernel': masked_array(data = ['poly', 'poly', 'rbf', 'rbf'],mask = [False False False False]...)'param_gamma': masked_array(data = [-- -- 0.1 0.2],mask = [ True True False False]...),'param_degree': masked_array(data = [2.0 3.0 -- --],mask = [False False True True]...),'split0_test_score' : [0.8, 0.7, 0.8, 0.9],'split1_test_score' : [0.82, 0.5, 0.7, 0.78],'mean_test_score' : [0.81, 0.60, 0.75, 0.82],'std_test_score' : [0.02, 0.01, 0.03, 0.03],'rank_test_score' : [2, 4, 3, 1],'split0_train_score' : [0.8, 0.9, 0.7],'split1_train_score' : [0.82, 0.5, 0.7],'mean_train_score' : [0.81, 0.7, 0.7],'std_train_score' : [0.03, 0.03, 0.04],'mean_fit_time' : [0.73, 0.63, 0.43, 0.49],'std_fit_time' : [0.01, 0.02, 0.01, 0.01],'mean_score_time' : [0.007, 0.06, 0.04, 0.04],'std_score_time' : [0.001, 0.002, 0.003, 0.005],'params' : [{'kernel': 'poly', 'degree': 2}, ...],}NOTEThe key ``'params'`` is used to store a list of parametersettings dicts for all the parameter candidates.The ``mean_fit_time``, ``std_fit_time``, ``mean_score_time`` and``std_score_time`` are all in seconds.For multi-metric evaluation, the scores for all the scorers areavailable in the ``cv_results_`` dict at the keys ending with thatscorer's name (``'_<scorer_name>'``) instead of ``'_score'`` shownabove. ('split0_test_precision', 'mean_train_precision' etc.)best_estimator_ : estimator or dictEstimator that was chosen by the search, i.e. estimatorwhich gave highest score (or smallest loss if specified)on the left out data. Not available if ``refit=False``.See ``refit`` parameter for more information on allowed values.best_score_ : floatMean cross-validated score of the best_estimatorFor multi-metric evaluation, this is present only if ``refit`` isspecified.best_params_ : dictParameter setting that gave the best results on the hold out data.For multi-metric evaluation, this is present only if ``refit`` isspecified.best_index_ : intThe index (of the ``cv_results_`` arrays) which corresponds to the bestcandidate parameter setting.The dict at ``search.cv_results_['params'][search.best_index_]`` givesthe parameter setting for the best model, that gives the highestmean score (``search.best_score_``).For multi-metric evaluation, this is present only if ``refit`` isspecified.scorer_ : function or a dictScorer function used on the held out data to choose the bestparameters for the model.For multi-metric evaluation, this attribute holds the validated``scoring`` dict which maps the scorer key to the scorer callable.n_splits_ : intThe number of cross-validation splits (folds/iterations).Notes------The parameters selected are those that maximize the score of the left outdata, unless an explicit score is passed in which case it is used instead.If `n_jobs` was set to a value higher than one, the data is copied for eachpoint in the grid (and not `n_jobs` times). This is done for efficiencyreasons if individual jobs take very little time, but may raise errors ifthe dataset is large and not enough memory is available. A workaround inthis case is to set `pre_dispatch`. Then, the memory is copied only`pre_dispatch` many times. A reasonable value for `pre_dispatch` is `2 *n_jobs`.See Also---------:class:`ParameterGrid`:generates all the combinations of a hyperparameter grid.:func:`sklearn.model_selection.train_test_split`:utility function to split the data into a development set usablefor fitting a GridSearchCV instance and an evaluation set forits final evaluation.:func:`sklearn.metrics.make_scorer`:Make a scorer from a performance metric or loss function."""def __init__(self, estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score='raise', return_train_score="warn"):super(GridSearchCV, self).__init__(estimator=estimator, scoring=scoring, fit_params=fit_params, n_jobs=n_jobs, iid=iid, refit=refit, cv=cv, verbose=verbose, pre_dispatch=pre_dispatch, error_score=error_score, return_train_score=return_train_score)self.param_grid = param_grid_check_param_grid(param_grid)def _get_param_iterator(self):"""Return ParameterGrid instance for the given param_grid"""return ParameterGrid(self.param_grid) class StratifiedKFold(_BaseKFold):"""Stratified K-Folds cross-validatorProvides train/test indices to split data in train/test sets.This cross-validation object is a variation of KFold that returnsstratified folds. The folds are made by preserving the percentage ofsamples for each class.Read more in the :ref:`User Guide <cross_validation>`.Parameters----------n_splits : int, default=3Number of folds. Must be at least 2.shuffle : boolean, optionalWhether to shuffle each stratification of the data before splittinginto batches.random_state : int, RandomState instance or None, optional, default=NoneIf int, random_state is the seed used by the random number generator;If RandomState instance, random_state is the random number generator;If None, the random number generator is the RandomState instance usedby `np.random`. Used when ``shuffle`` == True.Examples-------->>> from sklearn.model_selection import StratifiedKFold>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])>>> y = np.array([0, 0, 1, 1])>>> skf = StratifiedKFold(n_splits=2)>>> skf.get_n_splits(X, y)2>>> print(skf) # doctest: +NORMALIZE_WHITESPACEStratifiedKFold(n_splits=2, random_state=None, shuffle=False)>>> for train_index, test_index in skf.split(X, y):... print("TRAIN:", train_index, "TEST:", test_index)... X_train, X_test = X[train_index], X[test_index]... y_train, y_test = y[train_index], y[test_index]TRAIN: [1 3] TEST: [0 2]TRAIN: [0 2] TEST: [1 3]Notes-----All the folds have size ``trunc(n_samples / n_splits)``, the last one hasthe complementary.See also--------RepeatedStratifiedKFold: Repeats Stratified K-Fold n times."""def __init__(self, n_splits=3, shuffle=False, random_state=None):super(StratifiedKFold, self).__init__(n_splits, shuffle, random_state)def _make_test_folds(self, X, y=None):rng = self.random_statey = np.asarray(y)type_of_target_y = type_of_target(y)allowed_target_types = 'binary', 'multiclass'if type_of_target_y not in allowed_target_types:raise ValueError('Supported target types are: {}. Got {!r} instead.'.format(allowed_target_types, type_of_target_y))y = column_or_1d(y)n_samples = y.shape[0]unique_y, y_inversed = np.unique(y, return_inverse=True)y_counts = np.bincount(y_inversed)min_groups = np.min(y_counts)if np.all(self.n_splits > y_counts):raise ValueError("n_splits=%d cannot be greater than the"" number of members in each class." % (self.n_splits))if self.n_splits > min_groups:warnings.warn(("The least populated class in y has only %d"" members, which is too few. The minimum"" number of members in any class cannot"" be less than n_splits=%d." % (min_groups, self.n_splits)), Warning) # pre-assign each sample to a test fold index using individual KFold# splitting strategies for each class so as to respect the balance of# classes# NOTE: Passing the data corresponding to ith class say X[y==class_i]# will break when the data is not 100% stratifiable for all classes.# So we pass np.zeroes(max(c, n_splits)) as data to the KFoldper_cls_cvs = [KFold(self.n_splits, shuffle=self.shuffle, random_state=rng).split(np.zeros(max(count, self.n_splits))) for count in y_counts]test_folds = np.zeros(n_samples, dtype=np.int)for test_fold_indices, per_cls_splits in enumerate(zip(*per_cls_cvs)):for cls, (_, test_split) in zip(unique_y, per_cls_splits):cls_test_folds = test_folds[y == cls]# the test split can be too big because we used# KFold(...).split(X[:max(c, n_splits)]) when data is not 100%# stratifiable for all the classes# (we use a warning instead of raising an exception)# If this is the case, let's trim it:test_split = test_split[test_split < len(cls_test_folds)]cls_test_folds[test_split] = test_fold_indicestest_folds[y == cls] = cls_test_foldsreturn test_foldsdef _iter_test_masks(self, X, y=None, groups=None):test_folds = self._make_test_folds(X, y)for i in range(self.n_splits):yield test_folds == idef split(self, X, y, groups=None):"""Generate indices to split data into training and test set.Parameters----------X : array-like, shape (n_samples, n_features)Training data, where n_samples is the number of samplesand n_features is the number of features.Note that providing ``y`` is sufficient to generate the splits andhence ``np.zeros(n_samples)`` may be used as a placeholder for``X`` instead of actual training data.y : array-like, shape (n_samples,)The target variable for supervised learning problems.Stratification is done based on the y labels.groups : objectAlways ignored, exists for compatibility.Returns-------train : ndarrayThe training set indices for that split.test : ndarrayThe testing set indices for that split.Notes-----Randomized CV splitters may return different results for each call ofsplit. You can make the results identical by setting ``random_state``to an integer."""y = check_array(y, ensure_2d=False, dtype=None)return super(StratifiedKFold, self).split(X, y, groups)?
?
總結(jié)
以上是生活随笔為你收集整理的ML之Xgboost:利用Xgboost模型(7f-CrVa+网格搜索调参)对数据集(比马印第安人糖尿病)进行二分类预测的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 成功解决preprocessing\la
- 下一篇: 成功解决pandas\core\gene