ML之sklearn:sklearn库中的ShuffleSplit()函数和StratifiedShuffleSplit()函数的讲解
ML之sklearn:sklearn庫中的ShuffleSplit()函數和StratifiedShuffleSplit()函數的講解
?
?
目錄
sklearn庫中的ShuffleSplit()函數和StratifiedShuffleSplit()函數的講解
ShuffleSplit()函數
StratifiedShuffleSplit()函數
?
?
?
sklearn庫中的ShuffleSplit()函數和StratifiedShuffleSplit()函數的講解
from sklearn.model_selection import ShuffleSplit,StratifiedShuffleSplit
? ? ? ? ?這兩個函數均是實現了對數據集進行打亂劃分,即在數據集在進行劃分之前,先進行打亂操作,否則容易產生過擬合,模型泛化能力下降。其中,StratifiedShuffleSplit函數是StratifiedKFold和ShuffleSplit的合并,它將返回StratifiedKFold。折疊是通過保存每個類的樣本百分比來實現的。
???????? ? ? ? ?首先將樣本隨機打亂,然后根據設置參數劃分出train/test對。通過n_splits產生指定數量的獨立的【train/test】數據集,劃分數據集劃分成n組(n組索引值),其創建的每一組劃分將保證每組類比的比例相同。比如第一組訓練數據類別比例為2:1,則后面每組類別都滿足這個比例。
ShuffleSplit()函數
cv_split = ShuffleSplit(n_splits=6, train_size=0.7, test_size=0.2)
| class ShuffleSplit(BaseShuffleSplit): ????"""Random permutation cross-validator ?Yields indices to split data into training and test sets. ??Note: contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets. ????Read more in the :ref:`User Guide <cross_validation>`. ????Parameters ????---------- n_splits : int, default=10. Number of re-shuffling & splitting iterations. ? test_size : float or int, default=None. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If ``train_size`` is also None, it will ?be set to 0.1. ? train_size : float or int, default=None. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If ?int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size. ? ????random_state : int or RandomState instance, default=None. ?Controls the randomness of the training and testing indices ?produced. Pass an int for reproducible output across multiple function calls. ????See :term:`Glossary <random_state>`. | 類ShuffleSplit (BaseShuffleSplit): ??隨機排列交叉驗證 生成將數據分割為訓練集和測試集的索引。 注:與其他交叉驗證策略相反,隨機分割并不能保證所有的折疊都是不同的,盡管對于較大的數據集,這種情況仍然很可能發生。 更多信息請參見:ref: ' User Guide <cross_validation> '。</cross_validation> 參數 ---------- n_splits : int,默認=10。重新洗牌和分裂迭代的數量。將訓練數據分成【train/test】對的組數。 ? test_size: float或int,默認=None。如果是浮動的,則應該在0.0和1.0之間,并表示要包含在test分割中的數據集的比例。如果int,表示測試樣本的絕對數量。如果沒有,則將該值設置為train_size的補集。如果train_size也是None,它將被設置為0.1。 test_size用來設置【train/test】對中test所占的比例。 ? train_size: float或int,默認=None。如果是浮點數,則應該在0.0和1.0之間,并表示要包含在train分割序列中的數據集的比例。如果int,表示train樣本的絕對數量。如果沒有,該值將自動設置為train size的補集。train_size用來設置【train/test】對中train所占的比例。 ? random_state: int或RandomState實例,默認為None。控制產生的訓練和測試指標的隨機性。在多個函數調用之間傳遞可重復輸出的int。 控制將樣本隨機打亂,用于隨機抽樣的偽隨機數發生器狀態。 看:術語:“術語表< random_state >”。 |
| ????Examples ????-------- ????>>> import numpy as np ????>>> from sklearn.model_selection import ShuffleSplit ????>>> X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [3, 4], [5, 6]]) ????>>> y = np.array([1, 2, 1, 2, 1, 2]) ????>>> rs = ShuffleSplit(n_splits=5, test_size=.25, random_state=0) ????>>> rs.get_n_splits(X) ????5 ????>>> print(rs) ????ShuffleSplit(n_splits=5, random_state=0, test_size=0.25, ?train_size=None) ????>>> for train_index, test_index in rs.split(X): ????... ????print("TRAIN:", train_index, "TEST:", test_index) ????TRAIN: [1 3 0 4] TEST: [5 2] ????TRAIN: [4 0 2 5] TEST: [1 3] ????TRAIN: [1 2 4 0] TEST: [3 5] ????TRAIN: [3 4 1 0] TEST: [5 2] ????TRAIN: [3 5 1 0] TEST: [2 4] ????>>> rs = ShuffleSplit(n_splits=5, train_size=0.5, test_size=.25, random_state=0) ????>>> for train_index, test_index in rs.split(X): ????... ????print("TRAIN:", train_index, "TEST:", test_index) ????TRAIN: [1 3 0] TEST: [5 2] ????TRAIN: [4 0 2] TEST: [1 3] ????TRAIN: [1 2 4] TEST: [3 5] ????TRAIN: [3 4 1] TEST: [5 2] ????TRAIN: [3 5 1] TEST: [2 4] ????""" | ? |
| ????@_deprecate_positional_args ????def __init__(self, n_splits=10, *, test_size=None, train_size=None, ????????random_state=None): ????????super().__init__(n_splits=n_splits, test_size=test_size, ?train_size=train_size, random_state=random_state) ????????self._default_test_size = 0.1 ???? ????def _iter_indices(self, X, y=None, groups=None): ????????n_samples = _num_samples(X) ????????n_train, n_test = _validate_shuffle_split( ????????????n_samples, self.test_size, self.train_size, ????????????default_test_size=self._default_test_size) ????????rng = check_random_state(self.random_state) ????????for i in range(self.n_splits): ????????????# random partition ????????????permutation = rng.permutation(n_samples) ????????????ind_test = permutation[:n_test] ????????????ind_train = permutation[n_test:n_test + n_train] ????????????yield ind_train, ind_test | ? |
| ????Examples ????-------- ????>>> import numpy as np ????>>> from sklearn.model_selection import ShuffleSplit ????>>> X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [3, 4], [5, 6]]) ????>>> y = np.array([1, 2, 1, 2, 1, 2]) ????>>> rs = ShuffleSplit(n_splits=5, test_size=.25, random_state=0) ????>>> rs.get_n_splits(X) ????5 ????>>> print(rs) ????ShuffleSplit(n_splits=5, random_state=0, test_size=0.25, ?train_size=None) ????>>> for train_index, test_index in rs.split(X): ????... ????print("TRAIN:", train_index, "TEST:", test_index) ????TRAIN: [1 3 0 4] TEST: [5 2] ????TRAIN: [4 0 2 5] TEST: [1 3] ????TRAIN: [1 2 4 0] TEST: [3 5] ????TRAIN: [3 4 1 0] TEST: [5 2] ????TRAIN: [3 5 1 0] TEST: [2 4] ????>>> rs = ShuffleSplit(n_splits=5, train_size=0.5, test_size=.25, random_state=0) ????>>> for train_index, test_index in rs.split(X): ????... ????print("TRAIN:", train_index, "TEST:", test_index) ????TRAIN: [1 3 0] TEST: [5 2] ????TRAIN: [4 0 2] TEST: [1 3] ????TRAIN: [1 2 4] TEST: [3 5] ????TRAIN: [3 4 1] TEST: [5 2] ????TRAIN: [3 5 1] TEST: [2 4] ????""" | ? |
| ????@_deprecate_positional_args ????def __init__(self, n_splits=10, *, test_size=None, train_size=None, ????????random_state=None): ????????super().__init__(n_splits=n_splits, test_size=test_size, ?train_size=train_size, random_state=random_state) ????????self._default_test_size = 0.1 ???? ????def _iter_indices(self, X, y=None, groups=None): ????????n_samples = _num_samples(X) ????????n_train, n_test = _validate_shuffle_split( ????????????n_samples, self.test_size, self.train_size, ????????????default_test_size=self._default_test_size) ????????rng = check_random_state(self.random_state) ????????for i in range(self.n_splits): ????????????# random partition ????????????permutation = rng.permutation(n_samples) ????????????ind_test = permutation[:n_test] ????????????ind_train = permutation[n_test:n_test + n_train] ????????????yield ind_train, ind_test | ? |
?
?
StratifiedShuffleSplit()函數
StratifiedShuffleSplit(n_splits=10, test_size=’default’, train_size=None, random_state=None)
| class StratifiedShuffleSplit(BaseShuffleSplit): ????"""Stratified Shuffle?Split cross-validator ????Provides train/test indices to split data in train/test sets. ???? ????This cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class. ???? ????Note: like the ShuffleSplit strategy, stratified random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets. ???? ????Read more in the :ref:`User Guide <cross_validation>`. ???? ????Parameters ????---------- ????n_splits : int, default=10 ????Number of re-shuffling & splitting iterations. ???? ????test_size : float or int, default=None. ?If float, should be between 0.0 and 1.0 and represent the ?proportion ?of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If ``train_size`` is also None, it will be set to 0.1. ???? ????train_size : float or int, default=None. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If ?int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size. ???? ????random_state : int or RandomState instance, default=None. Controls the randomness of the training and testing indices ?produced. Pass an int for reproducible output across multiple function calls. ????See :term:`Glossary <random_state>`. | ? 分層洗牌分裂交叉驗證器 提供訓練/測試索引來分割訓練/測試集中的數據。 ? 這個交叉驗證對象是StratifiedKFold和ShuffleSplit的合并,它將返回StratifiedKFold。折疊是通過保存每個類的樣本百分比來實現的。 ? 注意:就像ShuffleSplit策略一樣,分層隨機分割不能保證所有的折疊都是不同的,盡管這對于相當大的數據集仍然很有可能。 ? 更多信息請參見:ref: ' User Guide <cross_validation> '。</cross_validation> ? 參數 ---------- int,默認=10 重新洗牌和分裂迭代的數量。 ? test_size: float或int,默認=None。如果是浮動的,則應該在0.0和1.0之間,并表示要包含在測試分割中的數據集的比例。如果int,表示測試樣本的絕對數量。如果沒有,則將該值設置為train size的補集。如果' ' train_size ' '也是None,它將被設置為0.1。 ? train_size: float或int,默認=None。如果是浮點數,則應該在0.0和1.0之間,并表示要包含在分割序列中的數據集的比例。如果int,表示train樣本的絕對數量。如果沒有,該值將自動設置為train size的補集。 ? random_state: int或RandomState實例,默認為None。控制產生的訓練和測試指標的隨機性。在多個函數調用之間傳遞可重復輸出的int。 看:術語:“術語表< random_state >”。 |
| ????Examples ????-------- ????>>> import numpy as np ????>>> from sklearn.model_selection import StratifiedShuffleSplit ????>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]]) ????>>> y = np.array([0, 0, 0, 1, 1, 1]) ????>>> sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, ?????random_state=0) ????>>> sss.get_n_splits(X, y) ????5 ????>>> print(sss) ????StratifiedShuffleSplit(n_splits=5, random_state=0, ...) ????>>> for train_index, test_index in sss.split(X, y): ????... ????print("TRAIN:", train_index, "TEST:", test_index) ????... ????X_train, X_test = X[train_index], X[test_index] ????... ????y_train, y_test = y[train_index], y[test_index] ????TRAIN: [5 2 3] TEST: [4 1 0] ????TRAIN: [5 1 4] TEST: [0 2 3] ????TRAIN: [5 0 2] TEST: [4 3 1] ????TRAIN: [4 1 0] TEST: [2 3 5] ????TRAIN: [0 5 1] TEST: [3 4 2] | ? |
| ????""" ????@_deprecate_positional_args ????def __init__(self, n_splits=10, *, test_size=None, train_size=None, ????????random_state=None): ????????super().__init__(n_splits=n_splits, test_size=test_size, ?????????train_size=train_size, random_state=random_state) ????????self._default_test_size = 0.1 ???? ????def _iter_indices(self, X, y, groups=None): ????????n_samples = _num_samples(X) ????????y = check_array(y, ensure_2d=False, dtype=None) ????????n_train, n_test = _validate_shuffle_split( ????????????n_samples, self.test_size, self.train_size, ????????????default_test_size=self._default_test_size) ????????if y.ndim == 2: ????????????# for multi-label y, map each distinct row to a string repr ????????????# using join because str(row) uses an ellipsis if len(row) > ?????????????1000 ????????????y = np.array([' '.join(row.astype('str')) for row in y]) ????????classes, y_indices = np.unique(y, return_inverse=True) ????????n_classes = classes.shape[0] ????????class_counts = np.bincount(y_indices) ????????if np.min(class_counts) < 2: ????????????raise ValueError("The least populated class in y has only 1" ????????????????" member, which is too few. The minimum" ????????????????" number of groups for any class cannot" ????????????????" be less than 2.") ????????if n_train < n_classes: ????????????raise ValueError( ????????????????'The train_size = %d should be greater or ' ????????????????'equal to the number of classes = %d' % ????????????????(n_train, n_classes)) ????????if n_test < n_classes: ????????????raise ValueError('The test_size = %d should be greater or ' ????????????????'equal to the number of classes = %d' % ????????????????(n_test, n_classes)) # Find the sorted list of instances for ?????????????????each class: ????????# (np.unique above performs a sort, so code is O(n logn) ?????????already) ????????class_indices = np.split(np.argsort(y_indices, ?????????kind='mergesort'), np.cumsum(class_counts)[:-1]) ????????rng = check_random_state(self.random_state) ????????for _ in range(self.n_splits): ????????????# if there are ties in the class-counts, we want ????????????# to make sure to break them anew in each iteration ????????????n_i = _approximate_mode(class_counts, n_train, rng) ????????????class_counts_remaining = class_counts - n_i ????????????t_i = _approximate_mode(class_counts_remaining, n_test, ?????????????rng) ????????????train = [] ????????????test = [] ????????????for i in range(n_classes): ????????????????permutation = rng.permutation(class_counts[i]) ????????????????perm_indices_class_i = class_indices[i].take(permutation, ????????????????????mode='clip') ????????????????train.extend(perm_indices_class_i[:n_i[i]]) ????????????????test.extend(perm_indices_class_i[n_i[i]:n_i[i] + t_i[i]]) ???????????? ????????????train = rng.permutation(train) ????????????test = rng.permutation(test) ????????????yield train, test ???? ????def split(self, X, y, groups=None): ????????"""Generate indices to split data into training and test set. ? ????????Parameters ????????---------- ????????X : array-like of shape (n_samples, n_features) ????????????Training data, where n_samples is the number of samples ????????????and n_features is the number of features. ? ????????????Note that providing ``y`` is sufficient to generate the splits ?????????????and ????????????hence ``np.zeros(n_samples)`` may be used as a placeholder ?????????????for ????????????``X`` instead of actual training data. ? ????????y : array-like of shape (n_samples,) or (n_samples, n_labels) ????????????The target variable for supervised learning problems. ????????????Stratification is done based on the y labels. ? ????????groups : object ????????????Always ignored, exists for compatibility. ? ????????Yields ????????------ ????????train : ndarray ????????????The training set indices for that split. ? ????????test : ndarray ????????????The testing set indices for that split. ? ????????Notes ????????----- ????????Randomized CV splitters may return different results for each ?????????call of ????????split. You can make the results identical by setting ?????????`random_state` ????????to an integer. ????????""" ????????y = check_array(y, ensure_2d=False, dtype=None) ????????return super().split(X, y, groups) | ? |
| ????Examples ????-------- ????>>> import numpy as np ????>>> from sklearn.model_selection import StratifiedShuffleSplit ????>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]]) ????>>> y = np.array([0, 0, 0, 1, 1, 1]) ????>>> sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, ?????random_state=0) ????>>> sss.get_n_splits(X, y) ????5 ????>>> print(sss) ????StratifiedShuffleSplit(n_splits=5, random_state=0, ...) ????>>> for train_index, test_index in sss.split(X, y): ????... ????print("TRAIN:", train_index, "TEST:", test_index) ????... ????X_train, X_test = X[train_index], X[test_index] ????... ????y_train, y_test = y[train_index], y[test_index] ????TRAIN: [5 2 3] TEST: [4 1 0] ????TRAIN: [5 1 4] TEST: [0 2 3] ????TRAIN: [5 0 2] TEST: [4 3 1] ????TRAIN: [4 1 0] TEST: [2 3 5] ????TRAIN: [0 5 1] TEST: [3 4 2] | ? |
| ????""" ????@_deprecate_positional_args ????def __init__(self, n_splits=10, *, test_size=None, train_size=None, ????????random_state=None): ????????super().__init__(n_splits=n_splits, test_size=test_size, ?????????train_size=train_size, random_state=random_state) ????????self._default_test_size = 0.1 ???? ????def _iter_indices(self, X, y, groups=None): ????????n_samples = _num_samples(X) ????????y = check_array(y, ensure_2d=False, dtype=None) ????????n_train, n_test = _validate_shuffle_split( ????????????n_samples, self.test_size, self.train_size, ????????????default_test_size=self._default_test_size) ????????if y.ndim == 2: ????????????# for multi-label y, map each distinct row to a string repr ????????????# using join because str(row) uses an ellipsis if len(row) > ?????????????1000 ????????????y = np.array([' '.join(row.astype('str')) for row in y]) ????????classes, y_indices = np.unique(y, return_inverse=True) ????????n_classes = classes.shape[0] ????????class_counts = np.bincount(y_indices) ????????if np.min(class_counts) < 2: ????????????raise ValueError("The least populated class in y has only 1" ????????????????" member, which is too few. The minimum" ????????????????" number of groups for any class cannot" ????????????????" be less than 2.") ????????if n_train < n_classes: ????????????raise ValueError( ????????????????'The train_size = %d should be greater or ' ????????????????'equal to the number of classes = %d' % ????????????????(n_train, n_classes)) ????????if n_test < n_classes: ????????????raise ValueError('The test_size = %d should be greater or ' ????????????????'equal to the number of classes = %d' % ????????????????(n_test, n_classes)) # Find the sorted list of instances for ?????????????????each class: ????????# (np.unique above performs a sort, so code is O(n logn) ?????????already) ????????class_indices = np.split(np.argsort(y_indices, ?????????kind='mergesort'), np.cumsum(class_counts)[:-1]) ????????rng = check_random_state(self.random_state) ????????for _ in range(self.n_splits): ????????????# if there are ties in the class-counts, we want ????????????# to make sure to break them anew in each iteration ????????????n_i = _approximate_mode(class_counts, n_train, rng) ????????????class_counts_remaining = class_counts - n_i ????????????t_i = _approximate_mode(class_counts_remaining, n_test, ?????????????rng) ????????????train = [] ????????????test = [] ????????????for i in range(n_classes): ????????????????permutation = rng.permutation(class_counts[i]) ????????????????perm_indices_class_i = class_indices[i].take(permutation, ????????????????????mode='clip') ????????????????train.extend(perm_indices_class_i[:n_i[i]]) ????????????????test.extend(perm_indices_class_i[n_i[i]:n_i[i] + t_i[i]]) ???????????? ????????????train = rng.permutation(train) ????????????test = rng.permutation(test) ????????????yield train, test ???? ????def split(self, X, y, groups=None): ????????"""Generate indices to split data into training and test set. ? ????????Parameters ????????---------- ????????X : array-like of shape (n_samples, n_features) ????????????Training data, where n_samples is the number of samples ????????????and n_features is the number of features. ? ????????????Note that providing ``y`` is sufficient to generate the splits ?????????????and ????????????hence ``np.zeros(n_samples)`` may be used as a placeholder ?????????????for ????????????``X`` instead of actual training data. ? ????????y : array-like of shape (n_samples,) or (n_samples, n_labels) ????????????The target variable for supervised learning problems. ????????????Stratification is done based on the y labels. ? ????????groups : object ????????????Always ignored, exists for compatibility. ? ????????Yields ????????------ ????????train : ndarray ????????????The training set indices for that split. ? ????????test : ndarray ????????????The testing set indices for that split. ? ????????Notes ????????----- ????????Randomized CV splitters may return different results for each ?????????call of ????????split. You can make the results identical by setting ?????????`random_state` ????????to an integer. ????????""" ????????y = check_array(y, ensure_2d=False, dtype=None) ????????return super().split(X, y, groups) | ? |
?
?
?
總結
以上是生活随笔為你收集整理的ML之sklearn:sklearn库中的ShuffleSplit()函数和StratifiedShuffleSplit()函数的讲解的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Sklearn:sklearn.prep
- 下一篇: DataTransmission:免费薅