Everyone Do this at the Beginning!!-Kaggle 数据预处理方案
英文文檔鏈接?
?
對于數據分析來說,對原生數據的預處理的方式以及處理結果會對數據分析的結果產生非常重大的影響,而且,當下的機器學習算法都要求我們需要用品質足夠高的數據去對其進行訓練以便得到一個高質量的模型。
在這里我為大家介紹一個來自Kaggle的惡意代碼特征數據集的預處理方法。
我們可以移除此數據集中的17種特征!
-
(M)mostly-missing feaures :缺失值占比達到99%以上的特征。
-
(S)too-skewed features : 出現次數最多的值占該特征所有值出現次數的99%以上。
-
(C)hightly-correlated features : 對特征之間的相關性進行計算,挑選出相關性>=0.99的特征對,比較特征對的特征值多樣性大小,淘汰特征值多樣性較小的特征。
如下,我們可以以上述4種標準移除共17種特征:
?
說明:在此文章中只展示對train的處理,這樣得到的結果與同時處理train/test是一樣的。
?
一 、 加載數據 Load Data
import pandas as pd import numpy as np import matplotlib.pyplot as plt?自定義類型,有助于加快文件加載速度(此文件為3g多),此自定義類型方法引用自另一個Kaggle?
# referred https://www.kaggle.com/theoviel/load-the-totality-of-the-datadtypes = {'MachineIdentifier': 'category','ProductName': 'category','EngineVersion': 'category','AppVersion': 'category','AvSigVersion': 'category','IsBeta': 'int8','RtpStateBitfield': 'float16','IsSxsPassiveMode': 'int8','DefaultBrowsersIdentifier': 'float32','AVProductStatesIdentifier': 'float32','AVProductsInstalled': 'float16','AVProductsEnabled': 'float16','HasTpm': 'int8','CountryIdentifier': 'int16','CityIdentifier': 'float32','OrganizationIdentifier': 'float16','GeoNameIdentifier': 'float16','LocaleEnglishNameIdentifier': 'int16','Platform': 'category','Processor': 'category','OsVer': 'category','OsBuild': 'int16','OsSuite': 'int16','OsPlatformSubRelease': 'category','OsBuildLab': 'category','SkuEdition': 'category','IsProtected': 'float16','AutoSampleOptIn': 'int8','PuaMode': 'category','SMode': 'float16','IeVerIdentifier': 'float16','SmartScreen': 'category','Firewall': 'float16','UacLuaenable': 'float32','UacLuaenable': 'float64', # was 'float32''Census_MDC2FormFactor': 'category','Census_DeviceFamily': 'category','Census_OEMNameIdentifier': 'float32', # was 'float16''Census_OEMModelIdentifier': 'float32','Census_ProcessorCoreCount': 'float16','Census_ProcessorManufacturerIdentifier': 'float16','Census_ProcessorModelIdentifier': 'float32', # was 'float16''Census_ProcessorClass': 'category','Census_PrimaryDiskTotalCapacity': 'float64', # was 'float32''Census_PrimaryDiskTypeName': 'category','Census_SystemVolumeTotalCapacity': 'float64', # was 'float32''Census_HasOpticalDiskDrive': 'int8','Census_TotalPhysicalRAM': 'float32','Census_ChassisTypeName': 'category','Census_InternalPrimaryDiagonalDisplaySizeInInches': 'float32', # was 'float16''Census_InternalPrimaryDisplayResolutionHorizontal': 'float32', # was 'float16''Census_InternalPrimaryDisplayResolutionVertical': 'float32', # was 'float16''Census_PowerPlatformRoleName': 'category','Census_InternalBatteryType': 'category','Census_InternalBatteryNumberOfCharges': 'float64', # was 'float32''Census_OSVersion': 'category','Census_OSArchitecture': 'category','Census_OSBranch': 'category','Census_OSBuildNumber': 'int16','Census_OSBuildRevision': 'int32','Census_OSEdition': 'category','Census_OSSkuName': 'category','Census_OSInstallTypeName': 'category','Census_OSInstallLanguageIdentifier': 'float16','Census_OSUILocaleIdentifier': 'int16','Census_OSWUAutoUpdateOptionsName': 'category','Census_IsPortableOperatingSystem': 'int8','Census_GenuineStateName': 'category','Census_ActivationChannel': 'category','Census_IsFlightingInternal': 'float16','Census_IsFlightsDisabled': 'float16','Census_FlightRing': 'category','Census_ThresholdOptIn': 'float16','Census_FirmwareManufacturerIdentifier': 'float16','Census_FirmwareVersionIdentifier': 'float32','Census_IsSecureBootEnabled': 'int8','Census_IsWIMBootEnabled': 'float16','Census_IsVirtualDevice': 'float16','Census_IsTouchEnabled': 'int8','Census_IsPenCapable': 'int8','Census_IsAlwaysOnAlwaysConnectedCapable': 'float16','Wdft_IsGamer': 'float16','Wdft_RegionIdentifier': 'float16','HasDetections': 'int8'}train = pd.read_csv('../input/train.csv', dtype=dtypes)train.shape (8921483, 83)二、特征工程? Feature Engineering
定義一個空列表用于防止需要移除的特征名稱
droppable_features = [](1)
-
mostly-missing Columns
#求每種特征的缺失值的計數占比 (train.isnull().sum()/train.shape[0]).sort_values(ascending=False) PuaMode 0.999741 Census_ProcessorClass 0.995894 DefaultBrowsersIdentifier 0.951416 Census_IsFlightingInternal 0.830440 Census_InternalBatteryType 0.710468 Census_ThresholdOptIn 0.635245 Census_IsWIMBootEnabled 0.634390 SmartScreen 0.356108 OrganizationIdentifier 0.308415 SMode 0.060277 CityIdentifier 0.036475 Wdft_IsGamer 0.034014 Wdft_RegionIdentifier 0.034014 Census_InternalBatteryNumberOfCharges 0.030124 Census_FirmwareManufacturerIdentifier 0.020541 Census_IsFlightsDisabled 0.017993 Census_FirmwareVersionIdentifier 0.017949 Census_OEMModelIdentifier 0.011459 Census_OEMNameIdentifier 0.010702 Firewall 0.010239 Census_TotalPhysicalRAM 0.009027 Census_IsAlwaysOnAlwaysConnectedCapable 0.007997 Census_OSInstallLanguageIdentifier 0.006735 IeVerIdentifier 0.006601 Census_PrimaryDiskTotalCapacity 0.005943 Census_SystemVolumeTotalCapacity 0.005941 Census_InternalPrimaryDiagonalDisplaySizeInInches 0.005283 Census_InternalPrimaryDisplayResolutionHorizontal 0.005267 Census_InternalPrimaryDisplayResolutionVertical 0.005267 Census_ProcessorModelIdentifier 0.004634... ProductName 0.000000 HasTpm 0.000000 OsBuild 0.000000 IsBeta 0.000000 OsSuite 0.000000 IsSxsPassiveMode 0.000000 HasDetections 0.000000 SkuEdition 0.000000 Census_OSInstallTypeName 0.000000 Census_IsPenCapable 0.000000 Census_IsTouchEnabled 0.000000 Census_IsSecureBootEnabled 0.000000 Census_FlightRing 0.000000 Census_ActivationChannel 0.000000 Census_GenuineStateName 0.000000 Census_IsPortableOperatingSystem 0.000000 Census_OSWUAutoUpdateOptionsName 0.000000 Census_OSUILocaleIdentifier 0.000000 Census_OSSkuName 0.000000 AutoSampleOptIn 0.000000 Census_OSEdition 0.000000 Census_OSBuildRevision 0.000000 Census_OSBuildNumber 0.000000 Census_OSBranch 0.000000 Census_OSArchitecture 0.000000 Census_OSVersion 0.000000 Census_HasOpticalDiskDrive 0.000000 Census_DeviceFamily 0.000000 Census_MDC2FormFactor 0.000000 MachineIdentifier 0.000000 Length: 83, dtype: float64
可以看到,有2種特征的缺失值的計數占比大于99%,故移除:
#將其放入先前定義的空列表中 droppable_features.append('PuaMode') droppable_features.append('Census_ProcessorClass')-
Too skewed columns
#pd.options.display : 為編碼者提供自定i一的格式 ''''{:,.4f}' : 保留4位小數 '{:,100.4f}' : 也是保留4位小數所以我們可以看到,小數點后的數決定了保留幾位小數。''' #train[c].nunique() : 出現了多少種不同的特征值 #.value_counts(normalize=True).values[0] '''value_counts(): 每個特征值出現的次數value_counts(normalize=True):每個特征值的計數占比,默認降序排序value_counts(normalize=True).values[0]:返回計數占比最大的特征值的計數占比''' pd.options.display.float_format = '{:,.4f}'.format sk_df = pd.DataFrame([{'column': c, 'uniq': train[c].nunique(), 'skewness': train[c].value_counts(normalize=True).values[0] * 100} for c in train.columns]) sk_df = sk_df.sort_values('skewness', ascending=False) sk_dfcolumnskewnessuniq 75 Census_IsWIMBootEnabled 100.0000 2 5 IsBeta 99.9992 2 69 Census_IsFlightsDisabled 99.9990 2 68 Census_IsFlightingInternal 99.9986 2 27 AutoSampleOptIn 99.9971 2 71 Census_ThresholdOptIn 99.9749 2 29 SMode 99.9537 2 65 Census_IsPortableOperatingSystem 99.9455 2 28 PuaMode 99.9134 2 35 Census_DeviceFamily 99.8383 3 33 UacLuaenable 99.3925 11 76 Census_IsVirtualDevice 99.2961 2 1 ProductName 98.9356 6 12 HasTpm 98.7971 2 7 IsSxsPassiveMode 98.2666 2 32 Firewall 97.8583 2 11 AVProductsEnabled 97.3984 6 6 RtpStateBitfield 97.3262 7 20 OsVer 96.7613 58 18 Platform 96.6063 4 78 Census_IsPenCapable 96.1929 2 26 IsProtected 94.5624 2 79 Census_IsAlwaysOnAlwaysConnectedCapable 94.2581 2 70 Census_FlightRing 93.6580 10 45 Census_HasOpticalDiskDrive 92.2813 2 55 Census_OSArchitecture 90.8580 3 19 Processor 90.8530 3 66 Census_GenuineStateName 88.2992 5 39 Census_ProcessorManufacturerIdentifier 88.2789 7 77 Census_IsTouchEnabled 87.4457 2 ... ... ... ... 57 Census_OSBuildNumber 44.9351 165 64 Census_OSWUAutoUpdateOptionsName 44.3256 6 23 OsPlatformSubRelease 43.8887 9 21 OsBuild 43.8887 76 30 IeVerIdentifier 43.8454 303 2 EngineVersion 43.0990 70 24 OsBuildLab 41.0045 663 59 Census_OSEdition 38.8948 33 60 Census_OSSkuName 38.8934 30 62 Census_OSInstallLanguageIdentifier 35.8777 39 63 Census_OSUILocaleIdentifier 35.5414 147 48 Census_InternalPrimaryDiagonalDisplaySizeInInches 34.3398 785 42 Census_PrimaryDiskTotalCapacity 32.0408 5735 72 Census_FirmwareManufacturerIdentifier 30.8882 712 61 Census_OSInstallTypeName 29.2332 9 17 LocaleEnglishNameIdentifier 23.4780 276 81 Wdft_RegionIdentifier 20.8877 15 16 GeoNameIdentifier 17.1716 292 58 Census_OSBuildRevision 15.8453 285 54 Census_OSVersion 15.8452 469 36 Census_OEMNameIdentifier 14.5850 3832 8 DefaultBrowsersIdentifier 10.6257 2017 13 CountryIdentifier 4.4519 222 37 Census_OEMModelIdentifier 3.4559 175365 40 Census_ProcessorModelIdentifier 3.2576 3428 4 AvSigVersion 1.1469 8531 14 CityIdentifier 1.1030 107366 73 Census_FirmwareVersionIdentifier 1.0228 50494 44 Census_SystemVolumeTotalCapacity 0.5863 536848 0 MachineIdentifier 0.0000 8921483
83 rows × 3 columns
可以看到,有12種特征的最大特征值計數占比超過了99%,故移除:
droppable_features.extend(sk_df[sk_df.skewness > 99].column.tolist()) droppable_features ['PuaMode','Census_ProcessorClass','Census_IsWIMBootEnabled','IsBeta','Census_IsFlightsDisabled','Census_IsFlightingInternal','AutoSampleOptIn','Census_ThresholdOptIn','SMode','Census_IsPortableOperatingSystem','PuaMode','Census_DeviceFamily','UacLuaenable','Census_IsVirtualDevice']我們發現在這已移除的特征中'PuaMode'居然出現了兩次,故我們移除其中一個:
# PuaMode is duplicated in the two categories. droppable_features.remove('PuaMode')# Drop these columns. #axis=1 : 表示對列進行操作 #inplace=True : 不創建新的對象,對原始數據進行修改 train.drop(droppable_features, axis=1, inplace=True)至此,我們已經移除了2+(12-1)=13 種特征。
(2)
另外,在剩下的特征值中,還存在這許多的缺失值(Nan),我們需要將其進行處理。
# 返回每一種特征的缺失值的計數占比 #.isnull().sum():分別返回每一種特征的缺失值個數null_counts = train.isnull().sum() null_counts = null_counts / train.shape[0] null_counts[null_counts > 0.1]?
DefaultBrowsersIdentifier 0.9514 OrganizationIdentifier 0.3084 SmartScreen 0.3561 Census_InternalBatteryType 0.7105 dtype: float64可以看到,有4種特征含有大量的缺失值(NaN)。
???? 1.DefaultBrowsersIdentifier
train.DefaultBrowsersIdentifier.value_counts().head(5)?
239.0000 46056 3,195.0000 42692 1,632.0000 28751 3,176.0000 24220 146.0000 20756 Name: DefaultBrowsersIdentifier, dtype: int64 '''.fillna(0,inplece=True) : 對缺失值以0填充,并且在原始數據中進行修改,也就是說缺失值全部都用0替代了.fillna(0,inplace=False) : 對缺失值以0填充,但能用來打印看一下,并不會改變原始數據,缺失值還是缺失值 ''' train.DefaultBrowsersIdentifier.fillna(0, inplace=True)???? 2. SmartScreen
#.value_counts() : 返回該特征中每種特征值出現的次數 train.SmartScreen.value_counts() RequireAdmin 4316183 ExistsNotSet 1046183 Off 186553 Warn 135483 Prompt 34533 Block 22533 off 1350 On 731  416  335 on 147 requireadmin 10 OFF 4 0 3 Promt 2 requireAdmin 1 Enabled 1 prompt 1 warn 1 00000000 1  1 Name: SmartScreen, dtype: int64?'SmartSreen'中的特征值太雜亂,我們給它們賦值為較正規的字符串:
trans_dict = {'off': 'Off', '': '2', '': '1', 'on': 'On', 'requireadmin': 'RequireAdmin', 'OFF': 'Off', 'Promt': 'Prompt', 'requireAdmin': 'RequireAdmin', 'prompt': 'Prompt', 'warn': 'Warn', '00000000': '0', '': '3', np.nan: 'NoExist' } train.replace({'SmartScreen': trans_dict}, inplace=True) #.replace() :更名函數 train.SmartScreen.isnull().sum() 0為什么會是0呢,因為所有缺失值都已經賦值為'NoExist'
???? 3.OrganizationIdentifier
train.OrganizationIdentifier.value_counts() 27.0000 4196457 18.0000 1764175 48.0000 63845 50.0000 45502 11.0000 19436 37.0000 19398 49.0000 13627 46.0000 10974 14.0000 4713 32.0000 4045 36.0000 3909 52.0000 3043 33.0000 2896 2.0000 2595 5.0000 1990 40.0000 1648 28.0000 1591 4.0000 1385 10.0000 1083 51.0000 917 20.0000 915 1.0000 893 8.0000 723 22.0000 418 39.0000 413 6.0000 412 31.0000 398 21.0000 397 47.0000 385 3.0000 331 16.0000 242 19.0000 172 26.0000 160 44.0000 150 29.0000 135 42.0000 132 7.0000 98 41.0000 77 45.0000 73 30.0000 64 43.0000 60 35.0000 32 23.0000 20 15.0000 13 25.0000 12 12.0000 7 34.0000 2 38.0000 1 17.0000 1 Name: OrganizationIdentifier, dtype: int64這個特征是用來說明一種類似于ID的數據的,所以我們可以用0來給缺失值賦值:
train.replace({'OrganizationIdentifier': {np.nan: 0}}, inplace=True)???? 4.Census_InternalBatteryType
pd.options.display.max_rows = 99 train.Census_InternalBatteryType.value_counts() lion 2028256 li-i 245617 # 183998 lip 62099 liio 32635 li p 8383 li 6708 nimh 4614 real 2744 bq20 2302 pbac 2274 vbox 1454 unkn 533 lgi0 399 lipo 198 lhp0 182 4cel 170 lipp 83 ithi 79 batt 60 ram 35 bad 33 virt 33 pad0 22 lit 16 ca48 16 a132 10 ots0 9 lai0 8 ???? 8 lio 5 4lio 4 lio 4 asmb 4 li-p 4 0x0b 3 lgs0 3 icp3 3 3ion 2 a140 2 h00j 2 5nm1 2 lhpo 2 a138 2 lilo 1 li-h 1 lp 1 li? 1 ion 1 pbso 1 3500 1 6ion 1 @i 1 li 1 sams 1 ip 1 8 1 #TAB# 1 l&#TAB# 1 lio 1 ˙˙˙ 1 l 1 cl53 1 li?? 1 pa50 1 í-i 1 ÷?ó? 1 li-l 1 h4°s 1 d 1 lgl0 1 4ion 1 0ts0 1 sail 1 p-sn 1 a130 1 2337 1 l??? 1 Name: Census_InternalBatteryType, dtype: int64此特征中,缺失值、“```”、“unkn”都表示為"unknow",所以我們將'unknow'賦值給它們:
trans_dict = {'˙˙˙': 'unknown', 'unkn': 'unknown', np.nan: 'unknown' } train.replace({'Census_InternalBatteryType': trans_dict}, inplace=True)(3)
注意: 這4種特征是缺失值占比>10%,含有這4種特征缺失值的樣本我們不能刪除,盡管其有缺失值,我們也要用其他值去填充它,而還有許多缺失值計數占比位于0~10%之間的特征,我們要把這些特征的缺失值給移除(實質是移除了含有這類特征缺失值的樣本(行))。
train.shape?(8921483, 70)
# .dropna(inplace=True):刪除含有NaN的所有行,保留原來的索引值不變 train.dropna(inplace=True) train.shape (7667789, 70)最終大約有14%的樣本被刪除了。
另外,還有特征'MachineIdentifier',它對惡意代碼檢測無作用,我們也要把它刪除:
train.drop('MachineIdentifier', axis=1, inplace=True)至此我們已經刪除了13+1=14種特征。
?
(4)
為了使數據能夠用于機器學習,我們需要把一些數據的類型轉化為category類型,原因: 請點擊此Category類型?
#將'SmartScreen'/'Census_InternalBatterType'的特征值轉化為category類型 train['SmartScreen'] = train.SmartScreen.astype('category') train['Census_InternalBatteryType'] = train.Census_InternalBatteryType.astype('category')#cate_cols:存放類型為category特征的名稱 cate_cols = train.select_dtypes(include='category').columns.tolist()from sklearn.preprocessing import LabelEncoder le = LabelEncoder()for col in cate_cols:train[col] = le.fit_transform(train[col]) #經過le.fit_transform(),train['SmartScreen]/train['Census_InternalBatteryType']的類型為int64關于LaberEncoder,詳情請點擊LabelEncoder?
(5)
用一個算法去減小train的大小。我稱之為“減存算法”。欲知詳情請點擊這里減存算法?
代碼如下:
def reduce_mem_usage(df):""" iterate through all the columns of a dataframe and modify the data typeto reduce memory usage. """ #.memory_usage() ???????????????????????start_mem = df.memory_usage().sum() / 1024**2print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))for col in df.columns:col_type = df[col].dtypeif col_type != object:c_min = df[col].min()c_max = df[col].max()if str(col_type)[:3] == 'int':#np.iinfo()的用法我已經放在代碼下面了,請自行觀看if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:df[col] = df[col].astype(np.int8)elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:df[col] = df[col].astype(np.int16)elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:df[col] = df[col].astype(np.int32)elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:df[col] = df[col].astype(np.int64) else:if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:df[col] = df[col].astype(np.float16)elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:df[col] = df[col].astype(np.float32)else:df[col] = df[col].astype(np.float64)#非整型和浮點型(例如string類型)就轉化為category類型else:df[col] = df[col].astype('category')end_mem = df.memory_usage().sum() / 1024**2print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))return df%time train = reduce_mem_usage(train)np.iinfo():
CPU times: user 0 ns, sys: 0 ns, total: 0 ns Wall time: 5.48 μs Memory usage of dataframe is 2464.34 MB Memory usage after optimization is: 965.26 MB Decreased by 60.8%?
?(6)
因為特征實在太多,所以就以每10個特征生成一個相關(系數)矩陣。相關稀疏矩陣的知識請點這里相關系數矩陣?
cols = train.columns.tolist() corr_remove = [] #用來裝要移除的特征 import seaborn as snsplt.figure(figsize=(10,10)) co_cols = cols[:10] co_cols.append('HasDetections')# sns.heatmap() : 用熱力圖來畫出相關系數矩陣 '''.corr():相關性cmap : 熱力圖的顏色annot=True : 把每一個相關系數都顯示出來center=0.0 : 相關系數為0.0時的顏色深度是居中的顏色深度 ''' sns.heatmap(train[co_cols].corr(), cmap='RdBu_r', annot=True, center=0.0) plt.title('Correlation between 1 ~ 10th columns') plt.show()?
?
沒有出現>=0.99&的關聯系數。繼續~
?
co_cols = cols[10:20] co_cols.append('HasDetections') plt.figure(figsize=(10,10)) sns.heatmap(train[co_cols].corr(), cmap='RdBu_r', annot=True, center=0.0) plt.title('Correlation between 11 ~ 20th columns') plt.show()?
出現了。移除掉特征值多樣性較小的特征:
print(train.Platform.nunique()) print(train.OsVer.nunique()) 3 45Platform vs OsVer ? 3<45 :? remove Platform
#還記得嗎,corr_remove是我們上面定義的裝待移除特征名稱的空列表 corr_remove.append('Platform')ok,繼續~
?
co_cols = cols[20:30] co_cols.append('HasDetections') plt.figure(figsize=(10,10)) sns.heatmap(train[co_cols].corr(), cmap='RdBu_r', annot=True, center=0.0) plt.title('Correlation between 21 ~ 30th columns') plt.show()?
可惜沒有出現>=0.99的相關系數,別灰心,繼續加油~
?
co_cols = cols[30:40] co_cols.append('HasDetections') plt.figure(figsize=(10,10)) sns.heatmap(train[co_cols].corr(), cmap='RdBu_r', annot=True, center=0.0) plt.title('Correlation between 31 ~ 40th columns') plt.show()還是沒有,繼續繼續~
?
co_cols = cols[40:50] co_cols.append('HasDetections') plt.figure(figsize=(10,10)) sns.heatmap(train[co_cols].corr(), cmap='RdBu_r', annot=True, center=0.0) plt.title('Correlation between 41 ~ 50th columns') plt.show()還是沒有,再來~
?
co_cols = cols[50:60] co_cols.append('HasDetections') plt.figure(figsize=(10,10)) sns.heatmap(train[co_cols].corr(), cmap='RdBu_r', annot=True, center=0) plt.title('Correlation between 51 ~ 60th columns') plt.show()跟上次找到>=0.99相關系數時那樣處理:
print(train.Census_OSEdition.nunique()) print(train.Census_OSSkuName.nunique(), '\n') print(train.Census_OSInstallLanguageIdentifier.nunique()) print(train.Census_OSUILocaleIdentifier.nunique()) 29 25 39 144Census_OSEdition vs Census_OSSkuName?29>25 : remove Census_OSSkuName
Census_OSInstallLanguageIdentifier vs Census_OSUILocaleIdentifier?39<144 : remove Census_OSInstallLanguageIdentifier
corr_remove.append('Census_OSSkuName') corr_remove.append('Census_OSInstallLanguageIdentifier')做事要有始有終,繼續~
?
co_cols = cols[60:] #co_cols.append('HasDetections') plt.figure(figsize=(10,10)) sns.heatmap(train[co_cols].corr(), cmap='RdBu_r', annot=True, center=0) plt.title('Correlation between from 61th to the last columns') plt.show()各組特征相關性分析完畢。
從各組特征中總共得到3種要移除的特征。
corr_remove ['Platform', 'Census_OSSkuName', 'Census_OSInstallLanguageIdentifier']移除此3組的代碼如下:
train.drop(corr_remove, axis=1, inplace=True)至此,已經移除了 13+3=16 個特征。
?
對余下的所有數據構建相關系數矩陣:
corr = train.corr() high_corr = (corr >= 0.99).astype('uint8') plt.figure(figsize=(15,15)) sns.heatmap(high_corr, cmap='RdBu_r', annot=True, center=0.0) plt.show()出現了2個相關性>=0.99的特征。
print(train.Census_OSArchitecture.nunique()) print(train.Processor.nunique()) 3 3?Census_OSArchitecture vs Processor ? 3=3 :居然相等。讓我們看看它們與'HasDections'的相關性:
train[['Census_OSArchitecture', 'Processor', 'HasDetections']].corr()?
它們與'HasDections'的相關系數都是-0.0758,所以移除哪一個都可以,那我選擇移除 'Processor' :
corr_remove.append('Processor') #droppable_features是我們最先定義的一個空列表 droppable_features.extend(corr_remove) print(len(droppable_features)) droppable_features 17 ['Census_ProcessorClass','Census_IsWIMBootEnabled','IsBeta','Census_IsFlightsDisabled','Census_IsFlightingInternal','AutoSampleOptIn','Census_ThresholdOptIn','SMode','Census_IsPortableOperatingSystem','PuaMode','Census_DeviceFamily','UacLuaenable','Census_IsVirtualDevice','Platform','Census_OSSkuName','Census_OSInstallLanguageIdentifier','Processor']?
大功告成。通過對數據進行分析之后能移除的特征有 17 個。
與50位技術專家面對面20年技術見證,附贈技術全景圖總結
以上是生活随笔為你收集整理的Everyone Do this at the Beginning!!-Kaggle 数据预处理方案的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 网络安全学习目录
- 下一篇: 作者:纪珍(1982-),女,中国科学院