下采样downsample代码
生活随笔
收集整理的這篇文章主要介紹了
下采样downsample代码
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
針對二分類表格數據任務
[1]代碼如下:
def df_train(df_train):number_records_fraud = len(df_train[df_train.isFraud == 1])#多少數據fraud_indices = np.array(df_train[df_train.isFraud == 1].index)#isFraud=1數據的下標normal_indices = df_train[df_train.isFraud == 0].index#isFraud=0數據的下標random_normal_indices = np.random.choice(normal_indices, number_records_fraud*14, replace = False)# replace=True: 可以從a 中反復選取同一個元素。# replace=False: a 中同一個元素只能被選取一次。random_normal_indices = np.array(random_normal_indices)under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])#所有數據的indexunder_sample_data = df_train.iloc[under_sample_indices,:].reset_index(drop=True)#獲取所有數據(isFraud=1的所有數據,以及isFruad=0的抽樣數據)del df_trainreturn under_sample_data?
代碼用法是:
修改上面的代碼中的14,
14表示isFraud=0是14表示isFraud=1的數據量的14倍
根據[2]中提到的,根據作者實踐:
不要采用上采樣(數據量小的那一類中進行重復抽樣),
優先采樣下采樣(數據量大的那一類中進行不重復抽樣).
?
[3]中的代碼是:
def Negativedownsampling(train, ratio) :# Number of data points in the minority classnumber_records_fraud = len(train[train.isFraud == 1])fraud_indices = np.array(train[train.isFraud == 1].index)# Picking the indices of the normal classesnormal_indices = train[train.isFraud == 0].index# Out of the indices we picked, randomly select "x" number (number_records_fraud)random_normal_indices = np.random.choice(normal_indices, number_records_fraud*ratio, replace = False)random_normal_indices = np.array(random_normal_indices)# Appending the 2 indicesunder_sample_indices = np.concatenate([fraud_indices,random_normal_indices])# Under sample datasetunder_sample_data = train.iloc[under_sample_indices,:]# Showing ratioprint("Percentage of normal transactions: ", round(len(under_sample_data[under_sample_data.isFraud == 0])/len(under_sample_data),2)* 100,"%")print("Percentage of fraud transactions: ", round(len(under_sample_data[under_sample_data.isFraud == 1])/len(under_sample_data),2)* 100,"%")print("Total number of transactions in resampled data: ", len(under_sample_data))return under_sample_data代碼用法是:
df_train_resampling_1 = Negativedownsampling(df_train, 9) Percentage of normal transactions: 90.0 % Percentage of fraud transactions: 10.0 % Total number of transactions in resampled data: 206630 df_train_resampling_2 = Negativedownsampling(df_train, 3) Percentage of normal transactions: 75.0 % Percentage of fraud transactions: 25.0 % Total number of transactions in resampled data: 82652?
如何和GroupKFold配合使用的話,注意記得使用reset_index(drop=True)
?
轉載自:
[1]https://www.kaggle.com/c/ieee-fraud-detection/discussion/108616#latest-625841
[2]https://www.kaggle.com/c/ieee-fraud-detection/discussion/100268#latest-581989
[3]https://www.kaggle.com/zero92/negative-downsampling-improve-the-score
總結
以上是生活随笔為你收集整理的下采样downsample代码的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: notebook中安装lightgbm的
- 下一篇: git commit时避免填写Commi