當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Everyone Do this at the Beginning!!-Kaggle 数据预处理方案

發布時間：2025/3/15 编程问答 18 豆豆

生活随笔收集整理的這篇文章主要介紹了 Everyone Do this at the Beginning!!-Kaggle 数据预处理方案小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

英文文檔鏈接?

?

對于數據分析來說，對原生數據的預處理的方式以及處理結果會對數據分析的結果產生非常重大的影響，而且，當下的機器學習算法都要求我們需要用品質足夠高的數據去對其進行訓練以便得到一個高質量的模型。

在這里我為大家介紹一個來自Kaggle的惡意代碼特征數據集的預處理方法。

我們可以移除此數據集中的17種特征！

(M)mostly-missing feaures ：缺失值占比達到99%以上的特征。
(S)too-skewed features ： 出現次數最多的值占該特征所有值出現次數的99%以上。
(C)hightly-correlated features ： 對特征之間的相關性進行計算，挑選出相關性>=0.99的特征對，比較特征對的特征值多樣性大小，淘汰特征值多樣性較小的特征。

如下，我們可以以上述4種標準移除共17種特征：

(M) PuaMode

(M) Census_ProcessorClass

(S) Census_IsWIMBootEnabled

(S) IsBeta

(S) Census_IsFlightsDisabled

(S) Census_IsFlightingInternal

(S) AutoSampleOptIn

(S) Census_ThresholdOptIn

(S) SMode

(S) Census_IsPortableOperatingSystem

(S) Census_DeviceFamily

(S) UacLuaenable

(S) Census_IsVirtualDevice

說明：在此文章中只展示對train的處理，這樣得到的結果與同時處理train/test是一樣的。

一、加載數據 Load Data

import pandas as pd import numpy as np import matplotlib.pyplot as plt

?自定義類型，有助于加快文件加載速度（此文件為3g多），此自定義類型方法引用自另一個Kaggle?

# referred https://www.kaggle.com/theoviel/load-the-totality-of-the-datadtypes = {'MachineIdentifier': 'category','ProductName': 'category','EngineVersion': 'category','AppVersion': 'category','AvSigVersion': 'category','IsBeta': 'int8','RtpStateBitfield': 'float16','IsSxsPassiveMode': 'int8','DefaultBrowsersIdentifier': 'float32','AVProductStatesIdentifier': 'float32','AVProductsInstalled': 'float16','AVProductsEnabled': 'float16','HasTpm': 'int8','CountryIdentifier': 'int16','CityIdentifier': 'float32','OrganizationIdentifier': 'float16','GeoNameIdentifier': 'float16','LocaleEnglishNameIdentifier': 'int16','Platform': 'category','Processor': 'category','OsVer': 'category','OsBuild': 'int16','OsSuite': 'int16','OsPlatformSubRelease': 'category','OsBuildLab': 'category','SkuEdition': 'category','IsProtected': 'float16','AutoSampleOptIn': 'int8','PuaMode': 'category','SMode': 'float16','IeVerIdentifier': 'float16','SmartScreen': 'category','Firewall': 'float16','UacLuaenable': 'float32','UacLuaenable': 'float64', # was 'float32''Census_MDC2FormFactor': 'category','Census_DeviceFamily': 'category','Census_OEMNameIdentifier': 'float32', # was 'float16''Census_OEMModelIdentifier': 'float32','Census_ProcessorCoreCount': 'float16','Census_ProcessorManufacturerIdentifier': 'float16','Census_ProcessorModelIdentifier': 'float32', # was 'float16''Census_ProcessorClass': 'category','Census_PrimaryDiskTotalCapacity': 'float64', # was 'float32''Census_PrimaryDiskTypeName': 'category','Census_SystemVolumeTotalCapacity': 'float64', # was 'float32''Census_HasOpticalDiskDrive': 'int8','Census_TotalPhysicalRAM': 'float32','Census_ChassisTypeName': 'category','Census_InternalPrimaryDiagonalDisplaySizeInInches': 'float32', # was 'float16''Census_InternalPrimaryDisplayResolutionHorizontal': 'float32', # was 'float16''Census_InternalPrimaryDisplayResolutionVertical': 'float32', # was 'float16''Census_PowerPlatformRoleName': 'category','Census_InternalBatteryType': 'category','Census_InternalBatteryNumberOfCharges': 'float64', # was 'float32''Census_OSVersion': 'category','Census_OSArchitecture': 'category','Census_OSBranch': 'category','Census_OSBuildNumber': 'int16','Census_OSBuildRevision': 'int32','Census_OSEdition': 'category','Census_OSSkuName': 'category','Census_OSInstallTypeName': 'category','Census_OSInstallLanguageIdentifier': 'float16','Census_OSUILocaleIdentifier': 'int16','Census_OSWUAutoUpdateOptionsName': 'category','Census_IsPortableOperatingSystem': 'int8','Census_GenuineStateName': 'category','Census_ActivationChannel': 'category','Census_IsFlightingInternal': 'float16','Census_IsFlightsDisabled': 'float16','Census_FlightRing': 'category','Census_ThresholdOptIn': 'float16','Census_FirmwareManufacturerIdentifier': 'float16','Census_FirmwareVersionIdentifier': 'float32','Census_IsSecureBootEnabled': 'int8','Census_IsWIMBootEnabled': 'float16','Census_IsVirtualDevice': 'float16','Census_IsTouchEnabled': 'int8','Census_IsPenCapable': 'int8','Census_IsAlwaysOnAlwaysConnectedCapable': 'float16','Wdft_IsGamer': 'float16','Wdft_RegionIdentifier': 'float16','HasDetections': 'int8'}train = pd.read_csv('../input/train.csv', dtype=dtypes)train.shape (8921483, 83)

二、特征工程? Feature Engineering

定義一個空列表用于防止需要移除的特征名稱

droppable_features = []

（1）

mostly-missing Columns
#求每種特征的缺失值的計數占比 (train.isnull().sum()/train.shape[0]).sort_values(ascending=False) PuaMode 0.999741 Census_ProcessorClass 0.995894 DefaultBrowsersIdentifier 0.951416 Census_IsFlightingInternal 0.830440 Census_InternalBatteryType 0.710468 Census_ThresholdOptIn 0.635245 Census_IsWIMBootEnabled 0.634390 SmartScreen 0.356108 OrganizationIdentifier 0.308415 SMode 0.060277 CityIdentifier 0.036475 Wdft_IsGamer 0.034014 Wdft_RegionIdentifier 0.034014 Census_InternalBatteryNumberOfCharges 0.030124 Census_FirmwareManufacturerIdentifier 0.020541 Census_IsFlightsDisabled 0.017993 Census_FirmwareVersionIdentifier 0.017949 Census_OEMModelIdentifier 0.011459 Census_OEMNameIdentifier 0.010702 Firewall 0.010239 Census_TotalPhysicalRAM 0.009027 Census_IsAlwaysOnAlwaysConnectedCapable 0.007997 Census_OSInstallLanguageIdentifier 0.006735 IeVerIdentifier 0.006601 Census_PrimaryDiskTotalCapacity 0.005943 Census_SystemVolumeTotalCapacity 0.005941 Census_InternalPrimaryDiagonalDisplaySizeInInches 0.005283 Census_InternalPrimaryDisplayResolutionHorizontal 0.005267 Census_InternalPrimaryDisplayResolutionVertical 0.005267 Census_ProcessorModelIdentifier 0.004634... ProductName 0.000000 HasTpm 0.000000 OsBuild 0.000000 IsBeta 0.000000 OsSuite 0.000000 IsSxsPassiveMode 0.000000 HasDetections 0.000000 SkuEdition 0.000000 Census_OSInstallTypeName 0.000000 Census_IsPenCapable 0.000000 Census_IsTouchEnabled 0.000000 Census_IsSecureBootEnabled 0.000000 Census_FlightRing 0.000000 Census_ActivationChannel 0.000000 Census_GenuineStateName 0.000000 Census_IsPortableOperatingSystem 0.000000 Census_OSWUAutoUpdateOptionsName 0.000000 Census_OSUILocaleIdentifier 0.000000 Census_OSSkuName 0.000000 AutoSampleOptIn 0.000000 Census_OSEdition 0.000000 Census_OSBuildRevision 0.000000 Census_OSBuildNumber 0.000000 Census_OSBranch 0.000000 Census_OSArchitecture 0.000000 Census_OSVersion 0.000000 Census_HasOpticalDiskDrive 0.000000 Census_DeviceFamily 0.000000 Census_MDC2FormFactor 0.000000 MachineIdentifier 0.000000 Length: 83, dtype: float64

可以看到，有2種特征的缺失值的計數占比大于99%，故移除：

#將其放入先前定義的空列表中 droppable_features.append('PuaMode') droppable_features.append('Census_ProcessorClass')

Too skewed columns

#pd.options.display : 為編碼者提供自定i一的格式 ''''{:,.4f}' ：保留4位小數 '{:,100.4f}' : 也是保留4位小數所以我們可以看到，小數點后的數決定了保留幾位小數。''' #train[c].nunique() ：出現了多少種不同的特征值 #.value_counts(normalize=True).values[0] '''value_counts(): 每個特征值出現的次數value_counts(normalize=True):每個特征值的計數占比，默認降序排序value_counts(normalize=True).values[0]：返回計數占比最大的特征值的計數占比''' pd.options.display.float_format = '{:,.4f}'.format sk_df = pd.DataFrame([{'column': c, 'uniq': train[c].nunique(), 'skewness': train[c].value_counts(normalize=True).values[0] * 100} for c in train.columns]) sk_df = sk_df.sort_values('skewness', ascending=False) sk_df columnskewnessuniq75569682771296528353376112732116201878267970455519663977...57642321302245960626348427261178116585436813374041473440

Census_IsWIMBootEnabled	100.0000	2
IsBeta	99.9992	2
Census_IsFlightsDisabled	99.9990	2
Census_IsFlightingInternal	99.9986	2
AutoSampleOptIn	99.9971	2
Census_ThresholdOptIn	99.9749	2
SMode	99.9537	2
Census_IsPortableOperatingSystem	99.9455	2
PuaMode	99.9134	2
Census_DeviceFamily	99.8383	3
UacLuaenable	99.3925	11
Census_IsVirtualDevice	99.2961	2
ProductName	98.9356	6
HasTpm	98.7971	2
IsSxsPassiveMode	98.2666	2
Firewall	97.8583	2
AVProductsEnabled	97.3984	6
RtpStateBitfield	97.3262	7
OsVer	96.7613	58
Platform	96.6063	4
Census_IsPenCapable	96.1929	2
IsProtected	94.5624	2
Census_IsAlwaysOnAlwaysConnectedCapable	94.2581	2
Census_FlightRing	93.6580	10
Census_HasOpticalDiskDrive	92.2813	2
Census_OSArchitecture	90.8580	3
Processor	90.8530	3
Census_GenuineStateName	88.2992	5
Census_ProcessorManufacturerIdentifier	88.2789	7
Census_IsTouchEnabled	87.4457	2
...	...	...
Census_OSBuildNumber	44.9351	165
Census_OSWUAutoUpdateOptionsName	44.3256	6
OsPlatformSubRelease	43.8887	9
OsBuild	43.8887	76
IeVerIdentifier	43.8454	303
EngineVersion	43.0990	70
OsBuildLab	41.0045	663
Census_OSEdition	38.8948	33
Census_OSSkuName	38.8934	30
Census_OSInstallLanguageIdentifier	35.8777	39
Census_OSUILocaleIdentifier	35.5414	147
Census_InternalPrimaryDiagonalDisplaySizeInInches	34.3398	785
Census_PrimaryDiskTotalCapacity	32.0408	5735
Census_FirmwareManufacturerIdentifier	30.8882	712
Census_OSInstallTypeName	29.2332	9
LocaleEnglishNameIdentifier	23.4780	276
Wdft_RegionIdentifier	20.8877	15
GeoNameIdentifier	17.1716	292
Census_OSBuildRevision	15.8453	285
Census_OSVersion	15.8452	469
Census_OEMNameIdentifier	14.5850	3832
DefaultBrowsersIdentifier	10.6257	2017
CountryIdentifier	4.4519	222
Census_OEMModelIdentifier	3.4559	175365
Census_ProcessorModelIdentifier	3.2576	3428
AvSigVersion	1.1469	8531
CityIdentifier	1.1030	107366
Census_FirmwareVersionIdentifier	1.0228	50494
Census_SystemVolumeTotalCapacity	0.5863	536848
MachineIdentifier	0.0000	8921483

83 rows × 3 columns

可以看到，有12種特征的最大特征值計數占比超過了99%，故移除：

droppable_features.extend(sk_df[sk_df.skewness > 99].column.tolist()) droppable_features ['PuaMode','Census_ProcessorClass','Census_IsWIMBootEnabled','IsBeta','Census_IsFlightsDisabled','Census_IsFlightingInternal','AutoSampleOptIn','Census_ThresholdOptIn','SMode','Census_IsPortableOperatingSystem','PuaMode','Census_DeviceFamily','UacLuaenable','Census_IsVirtualDevice']

我們發現在這已移除的特征中'PuaMode'居然出現了兩次，故我們移除其中一個：

# PuaMode is duplicated in the two categories. droppable_features.remove('PuaMode')# Drop these columns. #axis=1 ：表示對列進行操作 #inplace=True : 不創建新的對象，對原始數據進行修改 train.drop(droppable_features, axis=1, inplace=True)

至此，我們已經移除了2+（12-1）=13 種特征。

（2）

另外，在剩下的特征值中，還存在這許多的缺失值（Nan）,我們需要將其進行處理。

# 返回每一種特征的缺失值的計數占比 #.isnull().sum():分別返回每一種特征的缺失值個數null_counts = train.isnull().sum() null_counts = null_counts / train.shape[0] null_counts[null_counts > 0.1]

DefaultBrowsersIdentifier 0.9514 OrganizationIdentifier 0.3084 SmartScreen 0.3561 Census_InternalBatteryType 0.7105 dtype: float64

可以看到，有4種特征含有大量的缺失值（NaN）。

???? 1.DefaultBrowsersIdentifier

train.DefaultBrowsersIdentifier.value_counts().head(5)

239.0000 46056 3,195.0000 42692 1,632.0000 28751 3,176.0000 24220 146.0000 20756 Name: DefaultBrowsersIdentifier, dtype: int64 '''.fillna(0,inplece=True) : 對缺失值以0填充，并且在原始數據中進行修改，也就是說缺失值全部都用0替代了.fillna(0,inplace=False) : 對缺失值以0填充，但能用來打印看一下，并不會改變原始數據，缺失值還是缺失值 ''' train.DefaultBrowsersIdentifier.fillna(0, inplace=True)

???? 2. SmartScreen

#.value_counts() : 返回該特征中每種特征值出現的次數 train.SmartScreen.value_counts() RequireAdmin 4316183 ExistsNotSet 1046183 Off 186553 Warn 135483 Prompt 34533 Block 22533 off 1350 On 731  416  335 on 147 requireadmin 10 OFF 4 0 3 Promt 2 requireAdmin 1 Enabled 1 prompt 1 warn 1 00000000 1  1 Name: SmartScreen, dtype: int64

?'SmartSreen'中的特征值太雜亂，我們給它們賦值為較正規的字符串：

trans_dict = {'off': 'Off', '': '2', '': '1', 'on': 'On', 'requireadmin': 'RequireAdmin', 'OFF': 'Off', 'Promt': 'Prompt', 'requireAdmin': 'RequireAdmin', 'prompt': 'Prompt', 'warn': 'Warn', '00000000': '0', '': '3', np.nan: 'NoExist' } train.replace({'SmartScreen': trans_dict}, inplace=True) #.replace() ：更名函數 train.SmartScreen.isnull().sum() 0

為什么會是0呢，因為所有缺失值都已經賦值為'NoExist'

???? 3.OrganizationIdentifier

train.OrganizationIdentifier.value_counts() 27.0000 4196457 18.0000 1764175 48.0000 63845 50.0000 45502 11.0000 19436 37.0000 19398 49.0000 13627 46.0000 10974 14.0000 4713 32.0000 4045 36.0000 3909 52.0000 3043 33.0000 2896 2.0000 2595 5.0000 1990 40.0000 1648 28.0000 1591 4.0000 1385 10.0000 1083 51.0000 917 20.0000 915 1.0000 893 8.0000 723 22.0000 418 39.0000 413 6.0000 412 31.0000 398 21.0000 397 47.0000 385 3.0000 331 16.0000 242 19.0000 172 26.0000 160 44.0000 150 29.0000 135 42.0000 132 7.0000 98 41.0000 77 45.0000 73 30.0000 64 43.0000 60 35.0000 32 23.0000 20 15.0000 13 25.0000 12 12.0000 7 34.0000 2 38.0000 1 17.0000 1 Name: OrganizationIdentifier, dtype: int64

這個特征是用來說明一種類似于ID的數據的，所以我們可以用0來給缺失值賦值：

train.replace({'OrganizationIdentifier': {np.nan: 0}}, inplace=True)

???? 4.Census_InternalBatteryType

pd.options.display.max_rows = 99 train.Census_InternalBatteryType.value_counts() lion 2028256 li-i 245617 # 183998 lip 62099 liio 32635 li p 8383 li 6708 nimh 4614 real 2744 bq20 2302 pbac 2274 vbox 1454 unkn 533 lgi0 399 lipo 198 lhp0 182 4cel 170 lipp 83 ithi 79 batt 60 ram 35 bad 33 virt 33 pad0 22 lit 16 ca48 16 a132 10 ots0 9 lai0 8 ???? 8 lio 5 4lio 4 lio 4 asmb 4 li-p 4 0x0b 3 lgs0 3 icp3 3 3ion 2 a140 2 h00j 2 5nm1 2 lhpo 2 a138 2 lilo 1 li-h 1 lp 1 li? 1 ion 1 pbso 1 3500 1 6ion 1 @i 1 li 1 sams 1 ip 1 8 1 #TAB# 1 l&#TAB# 1 lio 1 ˙˙˙ 1 l 1 cl53 1 li?? 1 pa50 1 í-i 1 ÷?ó? 1 li-l 1 h4°s 1 d 1 lgl0 1 4ion 1 0ts0 1 sail 1 p-sn 1 a130 1 2337 1 l??? 1 Name: Census_InternalBatteryType, dtype: int64

此特征中，缺失值、“```”、“unkn”都表示為"unknow"，所以我們將'unknow'賦值給它們：

trans_dict = {'˙˙˙': 'unknown', 'unkn': 'unknown', np.nan: 'unknown' } train.replace({'Census_InternalBatteryType': trans_dict}, inplace=True)

（3）

注意：這4種特征是缺失值占比>10%，含有這4種特征缺失值的樣本我們不能刪除，盡管其有缺失值，我們也要用其他值去填充它，而還有許多缺失值計數占比位于0~10%之間的特征，我們要把這些特征的缺失值給移除（實質是移除了含有這類特征缺失值的樣本（行））。

train.shape

?(8921483, 70)

# .dropna(inplace=True):刪除含有NaN的所有行，保留原來的索引值不變 train.dropna(inplace=True) train.shape (7667789, 70)

最終大約有14%的樣本被刪除了。

另外，還有特征'MachineIdentifier'，它對惡意代碼檢測無作用，我們也要把它刪除：

train.drop('MachineIdentifier', axis=1, inplace=True)

至此我們已經刪除了13+1=14種特征。

（4）

為了使數據能夠用于機器學習，我們需要把一些數據的類型轉化為category類型，原因：請點擊此Category類型?

#將'SmartScreen'/'Census_InternalBatterType'的特征值轉化為category類型 train['SmartScreen'] = train.SmartScreen.astype('category') train['Census_InternalBatteryType'] = train.Census_InternalBatteryType.astype('category')#cate_cols：存放類型為category特征的名稱 cate_cols = train.select_dtypes(include='category').columns.tolist()from sklearn.preprocessing import LabelEncoder le = LabelEncoder()for col in cate_cols:train[col] = le.fit_transform(train[col]) #經過le.fit_transform(),train['SmartScreen]/train['Census_InternalBatteryType']的類型為int64

關于LaberEncoder,詳情請點擊LabelEncoder?

（5）

用一個算法去減小train的大小。我稱之為“減存算法”。欲知詳情請點擊這里減存算法?

代碼如下：

def reduce_mem_usage(df):""" iterate through all the columns of a dataframe and modify the data typeto reduce memory usage. """ #.memory_usage() ???????????????????????start_mem = df.memory_usage().sum() / 1024**2print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))for col in df.columns:col_type = df[col].dtypeif col_type != object:c_min = df[col].min()c_max = df[col].max()if str(col_type)[:3] == 'int':#np.iinfo()的用法我已經放在代碼下面了，請自行觀看if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:df[col] = df[col].astype(np.int8)elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:df[col] = df[col].astype(np.int16)elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:df[col] = df[col].astype(np.int32)elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:df[col] = df[col].astype(np.int64) else:if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:df[col] = df[col].astype(np.float16)elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:df[col] = df[col].astype(np.float32)else:df[col] = df[col].astype(np.float64)#非整型和浮點型（例如string類型）就轉化為category類型else:df[col] = df[col].astype('category')end_mem = df.memory_usage().sum() / 1024**2print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))return df%time train = reduce_mem_usage(train)

np.iinfo()：

CPU times: user 0 ns, sys: 0 ns, total: 0 ns Wall time: 5.48 μs Memory usage of dataframe is 2464.34 MB Memory usage after optimization is: 965.26 MB Decreased by 60.8%

?(6)

Highly correlated features.

因為特征實在太多，所以就以每10個特征生成一個相關(系數)矩陣。相關稀疏矩陣的知識請點這里相關系數矩陣?

cols = train.columns.tolist() corr_remove = [] #用來裝要移除的特征 import seaborn as snsplt.figure(figsize=(10,10)) co_cols = cols[:10] co_cols.append('HasDetections')# sns.heatmap() : 用熱力圖來畫出相關系數矩陣 '''.corr():相關性cmap : 熱力圖的顏色annot=True : 把每一個相關系數都顯示出來center=0.0 : 相關系數為0.0時的顏色深度是居中的顏色深度 ''' sns.heatmap(train[co_cols].corr(), cmap='RdBu_r', annot=True, center=0.0) plt.title('Correlation between 1 ~ 10th columns') plt.show()

沒有出現>=0.99&的關聯系數。繼續~

co_cols = cols[10:20] co_cols.append('HasDetections') plt.figure(figsize=(10,10)) sns.heatmap(train[co_cols].corr(), cmap='RdBu_r', annot=True, center=0.0) plt.title('Correlation between 11 ~ 20th columns') plt.show()

出現了。移除掉特征值多樣性較小的特征：

print(train.Platform.nunique()) print(train.OsVer.nunique()) 3 45

Platform vs OsVer ? 3<45 :? remove Platform

#還記得嗎，corr_remove是我們上面定義的裝待移除特征名稱的空列表 corr_remove.append('Platform')

ok,繼續~

co_cols = cols[20:30] co_cols.append('HasDetections') plt.figure(figsize=(10,10)) sns.heatmap(train[co_cols].corr(), cmap='RdBu_r', annot=True, center=0.0) plt.title('Correlation between 21 ~ 30th columns') plt.show()

可惜沒有出現>=0.99的相關系數，別灰心，繼續加油~

co_cols = cols[30:40] co_cols.append('HasDetections') plt.figure(figsize=(10,10)) sns.heatmap(train[co_cols].corr(), cmap='RdBu_r', annot=True, center=0.0) plt.title('Correlation between 31 ~ 40th columns') plt.show()

還是沒有，繼續繼續~

co_cols = cols[40:50] co_cols.append('HasDetections') plt.figure(figsize=(10,10)) sns.heatmap(train[co_cols].corr(), cmap='RdBu_r', annot=True, center=0.0) plt.title('Correlation between 41 ~ 50th columns') plt.show()

還是沒有，再來~

co_cols = cols[50:60] co_cols.append('HasDetections') plt.figure(figsize=(10,10)) sns.heatmap(train[co_cols].corr(), cmap='RdBu_r', annot=True, center=0) plt.title('Correlation between 51 ~ 60th columns') plt.show()

跟上次找到>=0.99相關系數時那樣處理：

print(train.Census_OSEdition.nunique()) print(train.Census_OSSkuName.nunique(), '\n') print(train.Census_OSInstallLanguageIdentifier.nunique()) print(train.Census_OSUILocaleIdentifier.nunique()) 29 25 39 144

Census_OSEdition vs Census_OSSkuName?29>25 : remove Census_OSSkuName

Census_OSInstallLanguageIdentifier vs Census_OSUILocaleIdentifier?39<144 : remove Census_OSInstallLanguageIdentifier

corr_remove.append('Census_OSSkuName') corr_remove.append('Census_OSInstallLanguageIdentifier')

做事要有始有終，繼續~

co_cols = cols[60:] #co_cols.append('HasDetections') plt.figure(figsize=(10,10)) sns.heatmap(train[co_cols].corr(), cmap='RdBu_r', annot=True, center=0) plt.title('Correlation between from 61th to the last columns') plt.show()

各組特征相關性分析完畢。

從各組特征中總共得到3種要移除的特征。

corr_remove ['Platform', 'Census_OSSkuName', 'Census_OSInstallLanguageIdentifier']

移除此3組的代碼如下：

train.drop(corr_remove, axis=1, inplace=True)

至此，已經移除了 13+3=16 個特征。

對余下的所有數據構建相關系數矩陣：

corr = train.corr() high_corr = (corr >= 0.99).astype('uint8') plt.figure(figsize=(15,15)) sns.heatmap(high_corr, cmap='RdBu_r', annot=True, center=0.0) plt.show()

出現了2個相關性>=0.99的特征。

print(train.Census_OSArchitecture.nunique()) print(train.Processor.nunique()) 3 3

?Census_OSArchitecture vs Processor ? 3=3 ：居然相等。讓我們看看它們與'HasDections'的相關性:

train[['Census_OSArchitecture', 'Processor', 'HasDetections']].corr()

它們與'HasDections'的相關系數都是-0.0758，所以移除哪一個都可以，那我選擇移除 'Processor' ：

corr_remove.append('Processor') #droppable_features是我們最先定義的一個空列表 droppable_features.extend(corr_remove) print(len(droppable_features)) droppable_features 17 ['Census_ProcessorClass','Census_IsWIMBootEnabled','IsBeta','Census_IsFlightsDisabled','Census_IsFlightingInternal','AutoSampleOptIn','Census_ThresholdOptIn','SMode','Census_IsPortableOperatingSystem','PuaMode','Census_DeviceFamily','UacLuaenable','Census_IsVirtualDevice','Platform','Census_OSSkuName','Census_OSInstallLanguageIdentifier','Processor']

大功告成。通過對數據進行分析之后能移除的特征有 17 個。

與50位技術專家面對面20年技術見證，附贈技術全景圖

總結

以上是生活随笔為你收集整理的Everyone Do this at the Beginning!!-Kaggle 数据预处理方案的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：网络安全学习目录
下一篇：作者：纪珍（1982-），女，中国科学院