python科学计算笔记(十四)pandas数据过滤、清理、转换
本篇博客主要內容
1)移除重復數據(duplicated)
2)利用函數或映射進行數據轉換(map)
3)替換值(replace)
4)重命名軸索引
5)檢測和過濾異常值(邏輯索引)
6)隨機采樣或選出隨機子集
7)計算啞變量(get_dummies)
1)移除重復數據
檢查某列數據是否重復可用.is_unique
檢查某行數據是否重復可用.duplicated
import pandas as pd import numpy as npdata = pd.DataFrame({'k1':['one'] * 3 + ['two'] * 4,'k2':[1,1,2,3,3,4,4]}) data| one | 1 |
| one | 1 |
| one | 2 |
| two | 3 |
| two | 3 |
| two | 4 |
| two | 4 |
檢查列以及行中重復數據
data.index.is_unique#檢查列 True data.k1.is_unique#檢查列 False data['k2'].is_unique#檢查列 False data.is_unique ---------------------------------------------------------------------------AttributeError Traceback (most recent call last)<ipython-input-12-3c5fb82b7563> in <module>() ----> 1 data.is_uniqueC:\Program Files\anaconda\lib\site-packages\pandas\core\generic.pyc in __getattr__(self, name)2670 if name in self._info_axis:2671 return self[name] -> 2672 return object.__getattribute__(self, name)2673 2674 def __setattr__(self, name, value):AttributeError: 'DataFrame' object has no attribute 'is_unique' data.duplicated()#檢查行 0 False1 True2 False3 False4 True5 False6 Truedtype: bool移除重復行
data.drop_duplicates()| one | 1 |
| one | 2 |
| two | 3 |
| two | 4 |
移除重復值小結
1) drop_duplicates、duolicated函數只能用于DataFrame
2) is_unique不能用于DataFrame
2)利用函數或映射進行數據轉換
data1 = pd.DataFrame({'food':['bacon','pork','bacon','Pastrami',\'beef','Bacon','pastrami','ham','lox'],\'ounces':[4,3,12,6,7.5,8,3,5,6]}) data1| bacon | 4.0 |
| pork | 3.0 |
| bacon | 12.0 |
| Pastrami | 6.0 |
| beef | 7.5 |
| Bacon | 8.0 |
| pastrami | 3.0 |
| ham | 5.0 |
| lox | 6.0 |
添加一列表示肉類來源的動物類型
#step1:構建肉類到動物的映射 meat_to_animal = {'bacon':'pig','pork':'pig','pastrami':'cow','beef':'cow','ham':'pig',\'lox':'salmon'}Series的map方法可以接受一個函數或含有映射關系的字典型對象,字符的大小寫要一致
#step2:映射 data1['animal'] = data1['food'].map(str.lower).map(meat_to_animal) data1| bacon | 4.0 | pig |
| pork | 3.0 | pig |
| bacon | 12.0 | pig |
| Pastrami | 6.0 | cow |
| beef | 7.5 | cow |
| Bacon | 8.0 | pig |
| pastrami | 3.0 | cow |
| ham | 5.0 | pig |
| lox | 6.0 | salmon |
| bacon | 4.0 | pig |
| pork | 3.0 | pig |
| bacon | 12.0 | pig |
| Pastrami | 6.0 | cow |
| beef | 7.5 | cow |
| Bacon | 8.0 | pig |
| pastrami | 3.0 | cow |
| ham | 5.0 | pig |
| lox | 6.0 | salmon |
map是一種實現元素級轉換記憶其他數據清理工作的便捷方式
map會改變原始數據集
3)替換值
替換缺失值的方法:
1)fillna
2)含有重復索引的合并combine_first
3)replace
data2 = pd.Series([1.,-999,2,-999,-1000,3.]) data2 0 1.0 1 -999.0 2 2.0 3 -999.0 4 -1000.0 5 3.0dtype: float64-999可能是一個表示缺失數據的標記值,要將其替換為pandas能夠理解的NA值,可以利用replace
data2.replace(-999,np.nan) 0 1.01 NaN2 2.03 NaN4 -1000.05 3.0dtype: float64 data2 0 1.0 1 -999.0 2 2.0 3 -999.0 4 -1000.0 5 3.0dtype: float64replace不改變原數據集
一次性替換多個值
data2.replace([-999,-1000],np.nan)#一次傳入一個列表即可 0 1.01 NaN2 2.03 NaN4 NaN5 3.0dtype: float64 data2.replace([-999,-1000],[np.nan,0]) 0 1.01 NaN2 2.03 NaN4 0.05 3.0dtype: float64 data2.replace({-999:np.nan,-1000:0}) 0 1.01 NaN2 2.03 NaN4 0.05 3.0dtype: float644)重命名軸索引
跟Series中的值一樣,軸標簽也可以通過函數或映射進行轉換,從而得到一個新對象,軸還可以被就地修改,而無需新建一個數據結構
data3 = pd.DataFrame(np.arange(12).reshape(3,4),index = ['a','b','c'],columns = ['one','two','three','four']) data3| 0 | 1 | 2 | 3 |
| 4 | 5 | 6 | 7 |
| 8 | 9 | 10 | 11 |
| 0 | 1 | 2 | 3 |
| 4 | 5 | 6 | 7 |
| 8 | 9 | 10 | 11 |
| 0 | 1 | 2 | 3 |
| 4 | 5 | 6 | 7 |
| 8 | 9 | 10 | 11 |
還可以通過rename結合字典型對象實現對部分軸標簽的更新
data3.rename(index = {'A':'aaa'},columns = {'three':'liu'})| 0 | 1 | 2 | 3 |
| 4 | 5 | 6 | 7 |
| 8 | 9 | 10 | 11 |
| 0 | 1 | 2 | 3 |
| 4 | 5 | 6 | 7 |
| 8 | 9 | 10 | 11 |
| 0 | 1 | 2 | 3 |
| 4 | 5 | 6 | 7 |
| 8 | 9 | 10 | 11 |
5)檢測和過濾異常值
這里的異常值的閾值已知,因此,異常值的過濾或變換運算很大程度上其實就是邏輯數組運算。
data4 = pd.DataFrame(np.random.randn(1000,4)) data4.info() data4.describe()| 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
| 0.023986 | -0.014049 | 0.032299 | -0.037661 |
| 0.994571 | 1.003522 | 1.009939 | 1.017361 |
| -3.526970 | -3.298974 | -3.429383 | -3.421995 |
| -0.632426 | -0.685564 | -0.665548 | -0.756219 |
| 0.013326 | 0.006130 | -0.017911 | -0.015297 |
| 0.633279 | 0.670261 | 0.673849 | 0.665360 |
| 3.549620 | 3.142503 | 3.991028 | 3.086376 |
找出某列絕對值大于3的值
data4[3][np.abs(data4[3]) > 3] 189 -3.421995 335 3.086376 590 -3.388477 778 -3.100379Name: 3, dtype: float64找出全部或含有“超過3或-3的值”的行
(np.abs(data4) > 3).any(1).head() 0 False1 False2 False3 False4 Falsedtype: bool data4[(np.abs(data4) > 3).any(1)]| 3.549620 | -0.943976 | -0.058490 | 0.941503 |
| -0.071249 | -1.350361 | 0.385375 | -3.421995 |
| 2.337961 | 3.142503 | -0.208999 | -0.485979 |
| 0.230998 | -1.397259 | 2.734229 | 3.086376 |
| -3.526970 | -0.289467 | 1.099487 | 1.206039 |
| 0.011728 | -0.398739 | 3.104470 | 0.459924 |
| 0.357944 | 0.007063 | 3.991028 | 0.722481 |
| -3.019947 | -0.982651 | -1.727289 | 1.484966 |
| 0.211069 | 0.344059 | 0.656351 | -3.388477 |
| 0.930103 | 3.117643 | -1.372034 | -1.208730 |
| 0.362668 | -3.298974 | -1.033128 | 0.900985 |
| 0.094172 | 0.827937 | 2.617724 | -3.100379 |
| -1.450645 | -1.131513 | -3.429383 | -0.828139 |
| 1.188536 | -3.069987 | -0.746700 | 0.745037 |
| 2.449030 | 0.429959 | 3.025705 | -1.571179 |
替換異常值
data4[np.abs(data4) > 3] = np.sign(data) * 3 data4.isnull().sum()#有空值 0 31 42 43 4dtype: int64 data4 = data4.replace(np.nan,0) data4.isnull().sum()#無空值 0 01 02 03 0dtype: int64 data4.describe()###?????????| 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
| 0.026983 | -0.013941 | 0.025608 | -0.030836 |
| 0.977152 | 0.983421 | 0.986831 | 0.996554 |
| -2.749595 | -2.799638 | -2.943564 | -2.743207 |
| -0.630318 | -0.682237 | -0.663014 | -0.739291 |
| 0.012445 | 0.000613 | -0.017171 | -0.004484 |
| 0.631146 | 0.668023 | 0.660236 | 0.659204 |
| 2.829804 | 2.915031 | 2.907655 | 2.679495 |
6)排列和隨機采樣
1)numpy.random.permutation函數
2)np.random.randint生成隨機數
df = pd.DataFrame(np.arange(5 *4).reshape(5,4))sampler = np.random.permutation(5) df| 0 | 1 | 2 | 3 |
| 4 | 5 | 6 | 7 |
| 8 | 9 | 10 | 11 |
| 12 | 13 | 14 | 15 |
| 16 | 17 | 18 | 19 |
| 0 | 1 | 2 | 3 |
| 4 | 5 | 6 | 7 |
| 12 | 13 | 14 | 15 |
| 16 | 17 | 18 | 19 |
| 8 | 9 | 10 | 11 |
| 0 | 1 | 2 | 3 |
| 4 | 5 | 6 | 7 |
| 12 | 13 | 14 | 15 |
| 16 | 17 | 18 | 19 |
| 8 | 9 | 10 | 11 |
通過np.random.randint得到一組隨機整數
sampler1 = np.random.randint(0,len(df),size = 4) sampler1 array([2, 2, 3, 0]) df1 = df.take(sampler1)- 1
| 8 | 9 | 10 | 11 |
| 8 | 9 | 10 | 11 |
| 12 | 13 | 14 | 15 |
| 0 | 1 | 2 | 3 |
7)計算指標/啞變量
將分類變量(categorical variable)轉換為(啞變量矩陣,dummy matrix)或(指標矩陣,indicator matrix)是常用于統計學習建模或機器學習的轉換方式。
即 DataFrame的某一列中含有k個不同的值,則可以派生出一個k列矩陣或DataFrame(其值為1或0)。
pandas中的get_dummies函數可以實現以上功能
df2 = pd.DataFrame({'key':['b','a','b','c','a','b'],'data1':range(6)})df2| 0 | b |
| 1 | a |
| 2 | b |
| 3 | c |
| 4 | a |
| 5 | b |
| 0.0 | 1.0 | 0.0 |
| 1.0 | 0.0 | 0.0 |
| 0.0 | 1.0 | 0.0 |
| 0.0 | 0.0 | 1.0 |
| 1.0 | 0.0 | 0.0 |
| 0.0 | 1.0 | 0.0 |
| 0.0 | 1.0 | 0.0 |
| 1.0 | 0.0 | 0.0 |
| 0.0 | 1.0 | 0.0 |
| 0.0 | 0.0 | 1.0 |
| 1.0 | 0.0 | 0.0 |
| 0.0 | 1.0 | 0.0 |
| 0 | 0.0 | 1.0 | 0.0 |
| 1 | 1.0 | 0.0 | 0.0 |
| 2 | 0.0 | 1.0 | 0.0 |
| 3 | 0.0 | 0.0 | 1.0 |
| 4 | 1.0 | 0.0 | 0.0 |
| 5 | 0.0 | 1.0 | 0.0 |
| 0 | 0.0 | 1.0 | 0.0 |
| 1 | 1.0 | 0.0 | 0.0 |
| 2 | 0.0 | 1.0 | 0.0 |
| 3 | 0.0 | 0.0 | 1.0 |
| 4 | 1.0 | 0.0 | 0.0 |
| 5 | 0.0 | 1.0 | 0.0 |
| 0 |
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
總結
以上是生活随笔為你收集整理的python科学计算笔记(十四)pandas数据过滤、清理、转换的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: python科学计算笔记(十三)pand
- 下一篇: TensorFlow学习笔记(四)自己动