简约而不简单|值得收藏的Pandas基本操作指南
Pandas 是基于NumPy 的一種工具,該工具是為了解決數據分析任務而創建的。Pandas 納入了大量庫和一些標準的數據模型,提供了高效地操作大型數據集所需的工具。pandas提供了大量能使我們快速便捷地處理數據的函數和方法。你很快就會發現,它是使Python成為強大而高效的數據分析環境的重要因素之一。
本文總結了pandas的常用操作,建議讀者用兩天時間看完,本文代碼已經在github公布,建議邊運行邊學習。作者認為,學完這篇文章,pandas的基本操作沒有問題了,以后碰到問題也可以查這篇文章。
本文代碼的github地址:https://github.com/fengdu78/machine_learning_beginner/tree/master/pandas
目錄0.導語1.Series2.DataFrame2.1 DataFrame的簡單運用3.pandas選擇數據3.1 實戰篩選3.2 篩選總結4.Pandas設置值4.1 創建數據4.2 根據位置設置loc和iloc4.3 根據條件設置4.4 按行或列設置4.5 添加Series序列(長度必須對齊)4.6 設定某行某列為特定值4.7 修改一整行數據5.Pandas處理丟失數據5.1 創建含NaN的矩陣5.2 刪除掉有NaN的行或列5.3 替換NaN值為0或者其他5.4 是否有缺失數據NaN6.Pandas導入導出6.1 導入數據6.2 導出數據7.Pandas合并操作7.1 Pandas合并concat7.2.Pandas 合并 merge7.2.1 定義資料集并打印出7.2.2 依據key column合并,并打印7.2.3 兩列合并7.2.4 Indicator設置合并列名稱7.2.5 依據index合并7.2.6 解決overlapping的問題8.Pandas plot出圖9.參考
0.導語
Pandas是基于Numpy構建的,讓Numpy為中心的應用變得更加簡單。
本文為一篇長文,建議收藏,轉發~
1.Series
import?pandas?as?pd import?numpy?as?np#?Series s?=?pd.Series([1,3,6,np.nan,44,1]) print(s) ''' 0?????1.0 1?????3.0 2?????6.0 3?????NaN 4????44.0 5?????1.0 dtype:?float64 ''' #?默認index從0開始,如果想要按照自己的索引設置,則修改index參數,如:index=[3,4,3,7,8,9]2.DataFrame
2.1 DataFrame的簡單運用
#?DataFrame dates?=?pd.date_range('2018-08-19',periods=6) #?dates?=?pd.date_range('2018-08-19','2018-08-24')?#?起始、結束??與上述等價 ''' numpy.random.randn(d0,?d1,?…,?dn)是從標準正態分布中返回一個或多個樣本值。 numpy.random.rand(d0,?d1,?…,?dn)的隨機樣本位于[0,?1)中。 (6,4)表示6行4列數據 ''' df?=?pd.DataFrame(np.random.randn(6,4),index=dates,columns=['a','b','c','d']) print(df) #?DataFrame既有行索引也有列索引,?它可以被看做由Series組成的大字典。 print(df['b']) ''' 2018-08-19???-0.217424 2018-08-20???-1.421058 2018-08-21???-0.424589 2018-08-22????0.534675 2018-08-23???-0.018525 2018-08-24????0.635075 Freq:?D,?Name:?b,?dtype:?float64 ''' #?未指定行標簽和列標簽的數據 df1?=?pd.DataFrame(np.arange(12).reshape(3,4)) print(df1) #?另一種方式 df2?=?pd.DataFrame({'A':?[1,2,3,4],'B':?pd.Timestamp('20180819'),'C':?pd.Series([1,6,9,10],dtype='float32'),'D':?np.array([3]?*?4,dtype='int32'),'E':?pd.Categorical(['test','train','test','train']),'F':?'foo' }) print(df2) '''A??????????B?????C??D??????E????F 0??1?2018-08-19???1.0??3???test??foo 1??2?2018-08-19???6.0??3??train??foo 2??3?2018-08-19???9.0??3???test??foo 3??4?2018-08-19??10.0??3??train??foo ''' print(df2.dtypes) ''' A?????????????int64 B????datetime64[ns] C???????????float32 D?????????????int32 E??????????category F????????????object dtype:?object ''' print(df2.index) #?RangeIndex(start=0,?stop=4,?step=1) print(df2.columns) #?Index(['A',?'B',?'C',?'D',?'E',?'F'],?dtype='object') print(df2.values) ''' [[1?Timestamp('2018-08-19?00:00:00')?1.0?3?'test'?'foo'][2?Timestamp('2018-08-19?00:00:00')?6.0?3?'train'?'foo'][3?Timestamp('2018-08-19?00:00:00')?9.0?3?'test'?'foo'][4?Timestamp('2018-08-19?00:00:00')?10.0?3?'train'?'foo']] ''' #?數據總結 print(df2.describe()) '''A??????????C????D count??4.000000???4.000000??4.0 mean???2.500000???6.500000??3.0 std????1.290994???4.041452??0.0 min????1.000000???1.000000??3.0 25%????1.750000???4.750000??3.0 50%????2.500000???7.500000??3.0 75%????3.250000???9.250000??3.0 max????4.000000??10.000000??3.0 ''' #?翻轉數據 print(df2.T) #?print(np.transpose(df2))等價于上述操作 ''' axis=1表示行 axis=0表示列 默認ascending(升序)為True ascending=True表示升序,ascending=False表示降序 下面兩行分別表示按行升序與按行降序 ''' print(df2.sort_index(axis=1,ascending=True)) print(df2.sort_index(axis=1,ascending=False)) '''A??????????B?????C??D??????E????F 0??1?2018-08-19???1.0??3???test??foo 1??2?2018-08-19???6.0??3??train??foo 2??3?2018-08-19???9.0??3???test??foo 3??4?2018-08-19??10.0??3??train??fooF??????E??D?????C??????????B??A 0??foo???test??3???1.0?2018-08-19??1 1??foo??train??3???6.0?2018-08-19??2 2??foo???test??3???9.0?2018-08-19??3 3??foo??train??3??10.0?2018-08-19??4 ''' #?表示按列降序與按列升序 print(df2.sort_index(axis=0,ascending=False)) print(df2.sort_index(axis=0,ascending=True)) '''A??????????B?????C??D??????E????F 3??4?2018-08-19??10.0??3??train??foo 2??3?2018-08-19???9.0??3???test??foo 1??2?2018-08-19???6.0??3??train??foo 0??1?2018-08-19???1.0??3???test??fooA??????????B?????C??D??????E????F 0??1?2018-08-19???1.0??3???test??foo 1??2?2018-08-19???6.0??3??train??foo 2??3?2018-08-19???9.0??3???test??foo 3??4?2018-08-19??10.0??3??train??foo ''' #?對特定列數值排列 #?表示對C列降序排列 print(df2.sort_values(by='C',ascending=False))3.pandas選擇數據
3.1 實戰篩選
import?pandas?as?pd import?numpy?as?np dates?=?pd.date_range('20180819',?periods=6) df?=?pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,?columns=['A','B','C','D']) print(df) #?檢索A列 print(df['A']) print(df.A) #?選擇跨越多行或多列 #?選取前3行 print(df[0:3]) print(df['2018-08-19':'2018-08-21']) '''A??B???C???D 2018-08-19??0??1???2???3 2018-08-20??4??5???6???7 2018-08-21??8??9??10??11 ''' #?根據標簽選擇數據 #?獲取特定行或列 #?指定行數據 print(df.loc['20180819']) ''' A????0 B????1 C????2 D????3 Name:?2018-08-19?00:00:00,?dtype:?int32 ''' #?指定列 #?兩種方式 print(df.loc[:,'A':'B']) print(df.loc[:,['A','B']]) '''A???B 2018-08-19???0???1 2018-08-20???4???5 2018-08-21???8???9 2018-08-22??12??13 2018-08-23??16??17 2018-08-24??20??21 ''' #?行與列同時檢索 print(df.loc['20180819',['A','B']]) ''' A????0 B????1 Name:?2018-08-19?00:00:00,?dtype:?int32 ''' #?根據序列iloc #?獲取特定位置的值 print(df.iloc[3,1]) print(df.iloc[3:5,1:3])?#?不包含末尾5或3,同列表切片 '''B???C 2018-08-22??13??14 2018-08-23??17??18 ''' #?跨行操作 print(df.iloc[[1,3,5],1:3]) '''B???C 2018-08-20???5???6 2018-08-22??13??14 2018-08-24??21??22 ''' #?混合選擇 print(df.ix[:3,['A','C']]) '''A???C 2018-08-19??0???2 2018-08-20??4???6 2018-08-21??8??10 ''' print(df.iloc[:3,[0,2]])?#?結果同上#?通過判斷的篩選 print(df[df.A>8]) '''A???B???C???D 2018-08-22??12??13??14??15 2018-08-23??16??17??18??19 2018-08-24??20??21??22??23 '''3.2 篩選總結
1.iloc與ix區別總結:相同點:iloc可以取相應的值,操作方便,與ix操作類似。不同點:ix可以混合選擇,可以填入column對應的字符選擇,而iloc只能采用index索引,對于列數較多情況下,ix要方便操作許多。 2.loc與iloc區別總結:相同點:都可以索引處塊數據不同點:iloc可以檢索對應值,兩者操作不同。 3.ix與loc、iloc三者的區別總結:ix是混合loc與iloc操作 如下:對比三者操作 print(df.loc['20180819','A':'B']) print(df.iloc[0,0:2]) print(df.ix[0,'A':'B']) 輸出結果相同,均為: A????0 B????1 Name:?2018-08-19?00:00:00,?dtype:?int324.Pandas設置值
4.1 創建數據
import?pandas?as?pd import?numpy?as?np #?創建數據 dates?=?pd.date_range('20180820',periods=6) df?=?pd.DataFrame(np.arange(24).reshape(6,4),?index=dates,?columns=['A','B','C','D']) print(df) '''A???B???C???D 2018-08-20???0???1???2???3 2018-08-21???4???5???6???7 2018-08-22???8???9??10??11 2018-08-23??12??13??14??15 2018-08-24??16??17??18??19 2018-08-25??20??21??22??23 '''4.2 根據位置設置loc和iloc
#?根據位置設置loc和iloc df.iloc[2,2]?=?111 df.loc['20180820','B']?=?2222 print(df) '''A?????B????C???D 2018-08-20???0??2222????2???3 2018-08-21???4?????5????6???7 2018-08-22???8?????9??111??11 2018-08-23??12????13???14??15 2018-08-24??16????17???18??19 2018-08-25??20????21???22??23 '''4.3 根據條件設置
#?根據條件設置 #?更改B中的數,而更改的位置取決于4的位置,并設相應位置的數為0 df.B[df.A>4]?=?0 print(df) '''A?????B????C???D 2018-08-20???0??2222????2???3 2018-08-21???4?????5????6???7 2018-08-22???8?????0??111??11 2018-08-23??12?????0???14??15 2018-08-24??16?????0???18??19 2018-08-25??20?????0???22??23 '''4.4 按行或列設置
#?按行或列設置 #?列批處理,F列全改為NaN df['F']?=?np.nan print(df)4.5 添加Series序列(長度必須對齊)
df['E']?=?pd.Series([1,2,3,4,5,6],?index=pd.date_range('20180820',periods=6)) print(df)4.6 設定某行某列為特定值
#?設定某行某列為特定值 df.ix['20180820','A']?=?56 df.loc['20180820','A']?=?67 df.iloc[0,0]?=?764.7 修改一整行數據
#?修改一整行數據 df.iloc[1]?=?np.nan?#?df.iloc[1,:]=np.nan df.loc['20180820']?=?np.nan?#?df.loc['20180820,:']=np.nan df.ix[2]?=?np.nan?#?df.ix[2,:]=np.nan df.ix['20180823']?=?np.nan print(df)5.Pandas處理丟失數據
5.1 創建含NaN的矩陣
#?Pandas處理丟失數據 import?pandas?as?pd import?numpy?as?np #?創建含NaN的矩陣 #?如何填充和刪除NaN數據? dates?=?pd.date_range('20180820',periods=6) df?=?pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])?#?a.reshape(6,4)等價于a.reshape((6,4)) df.iloc[0,1]?=?np.nan df.iloc[1,2]?=?np.nan print(df) '''A?????B?????C???D 2018-08-20???0???NaN???2.0???3 2018-08-21???4???5.0???NaN???7 2018-08-22???8???9.0??10.0??11 2018-08-23??12??13.0??14.0??15 2018-08-24??16??17.0??18.0??19 2018-08-25??20??21.0??22.0??23 '''5.2 刪除掉有NaN的行或列
#?刪除掉有NaN的行或列 print(df.dropna())?#?默認是刪除掉含有NaN的行 print(df.dropna(axis=0,?#?0對行進行操作;1對列進行操作how='any'?#?'any':只要存在NaN就drop掉;'all':必須全部是NaN才drop )) '''A?????B?????C???D 2018-08-22???8???9.0??10.0??11 2018-08-23??12??13.0??14.0??15 2018-08-24??16??17.0??18.0??19 2018-08-25??20??21.0??22.0??23 ''' #?刪除掉所有含有NaN的列 print(df.dropna(axis=1,how='any' )) '''A???D 2018-08-20???0???3 2018-08-21???4???7 2018-08-22???8??11 2018-08-23??12??15 2018-08-24??16??19 2018-08-25??20??23 '''5.3 替換NaN值為0或者其他
#?替換NaN值為0或者其他 print(df.fillna(value=0)) '''A?????B?????C???D 2018-08-20???0???0.0???2.0???3 2018-08-21???4???5.0???0.0???7 2018-08-22???8???9.0??10.0??11 2018-08-23??12??13.0??14.0??15 2018-08-24??16??17.0??18.0??19 2018-08-25??20??21.0??22.0??23 '''5.4 是否有缺失數據NaN
#?是否有缺失數據NaN #?是否為空 print(df.isnull()) '''A??????B??????C??????D 2018-08-20??False???True??False??False 2018-08-21??False??False???True??False 2018-08-22??False??False??False??False 2018-08-23??False??False??False??False 2018-08-24??False??False??False??False 2018-08-25??False??False??False??False ''' #?是否為NaN print(df.isna()) '''A??????B??????C??????D 2018-08-20??False???True??False??False 2018-08-21??False??False???True??False 2018-08-22??False??False??False??False 2018-08-23??False??False??False??False 2018-08-24??False??False??False??False 2018-08-25??False??False??False??False ''' #?檢測某列是否有缺失數據NaN print(df.isnull().any()) ''' A????False B?????True C?????True D????False dtype:?bool ''' #?檢測數據中是否存在NaN,如果存在就返回True print(np.any(df.isnull())==True)6.Pandas導入導出
6.1 導入數據
import?pandas?as?pd?#?加載模塊 #?讀取csv data?=?pd.read_csv('student.csv') #?打印出data print(data) #?前三行 print(data.head(3)) #?后三行 print(data.tail(3))6.2 導出數據
#?將資料存取成pickle data.to_pickle('student.pickle') #?讀取pickle文件并打印 print(pd.read_pickle('student.pickle'))7.Pandas合并操作
7.1 Pandas合并concat
import?pandas?as?pd import?numpy?as?np#?定義資料集 df1?=?pd.DataFrame(np.ones((3,4))*0,?columns=['a','b','c','d']) df2?=?pd.DataFrame(np.ones((3,4))*1,?columns=['a','b','c','d']) df3?=?pd.DataFrame(np.ones((3,4))*2,?columns=['a','b','c','d']) print(df1) '''a????b????c????d 0??0.0??0.0??0.0??0.0 1??0.0??0.0??0.0??0.0 2??0.0??0.0??0.0??0.0 ''' print(df2) '''a????b????c????d 0??1.0??1.0??1.0??1.0 1??1.0??1.0??1.0??1.0 2??1.0??1.0??1.0??1.0 ''' print(df3) '''a????b????c????d 0??2.0??2.0??2.0??2.0 1??2.0??2.0??2.0??2.0 2??2.0??2.0??2.0??2.0 ''' #?concat縱向合并 res?=?pd.concat([df1,df2,df3],axis=0)#?打印結果 print(res) '''a????b????c????d 0??0.0??0.0??0.0??0.0 1??0.0??0.0??0.0??0.0 2??0.0??0.0??0.0??0.0 0??1.0??1.0??1.0??1.0 1??1.0??1.0??1.0??1.0 2??1.0??1.0??1.0??1.0 0??2.0??2.0??2.0??2.0 1??2.0??2.0??2.0??2.0 2??2.0??2.0??2.0??2.0 ''' #?上述合并過程中,index重復,下面給出重置index方法 #?只需要將index_ignore設定為True即可 res?=?pd.concat([df1,df2,df3],axis=0,ignore_index=True)#?打印結果 print(res) '''a????b????c????d 0??0.0??0.0??0.0??0.0 1??0.0??0.0??0.0??0.0 2??0.0??0.0??0.0??0.0 3??1.0??1.0??1.0??1.0 4??1.0??1.0??1.0??1.0 5??1.0??1.0??1.0??1.0 6??2.0??2.0??2.0??2.0 7??2.0??2.0??2.0??2.0 8??2.0??2.0??2.0??2.0 ''' #?join?合并方式 #定義資料集 df1?=?pd.DataFrame(np.ones((3,4))*0,?columns=['a','b','c','d'],?index=[1,2,3]) df2?=?pd.DataFrame(np.ones((3,4))*1,?columns=['b','c','d','e'],?index=[2,3,4]) print(df1) print(df2) ''' join='outer',函數默認為join='outer'。此方法是依照column來做縱向合并,有相同的column上下合并在一起, 其他獨自的column各自成列,原來沒有值的位置皆為NaN填充。 ''' #?縱向"外"合并df1與df2 res?=?pd.concat([df1,df2],axis=0,join='outer')print(res)'''a????b????c????d????e 1??0.0??0.0??0.0??0.0??NaN 2??0.0??0.0??0.0??0.0??NaN 3??0.0??0.0??0.0??0.0??NaN 2??NaN??1.0??1.0??1.0??1.0 3??NaN??1.0??1.0??1.0??1.0 4??NaN??1.0??1.0??1.0??1.0 ''' #?修改index res?=?pd.concat([df1,df2],axis=0,join='outer',ignore_index=True)print(res) '''a????b????c????d????e 0??0.0??0.0??0.0??0.0??NaN 1??0.0??0.0??0.0??0.0??NaN 2??0.0??0.0??0.0??0.0??NaN 3??NaN??1.0??1.0??1.0??1.0 4??NaN??1.0??1.0??1.0??1.0 5??NaN??1.0??1.0??1.0??1.0 ''' #?join='inner'合并相同的字段 #?縱向"內"合并df1與df2 res?=?pd.concat([df1,df2],axis=0,join='inner') #?打印結果 print(res) '''b????c????d 1??0.0??0.0??0.0 2??0.0??0.0??0.0 3??0.0??0.0??0.0 2??1.0??1.0??1.0 3??1.0??1.0??1.0 4??1.0??1.0??1.0 ''' #?join_axes(依照axes合并) #定義資料集 df1?=?pd.DataFrame(np.ones((3,4))*0,?columns=['a','b','c','d'],?index=[1,2,3]) df2?=?pd.DataFrame(np.ones((3,4))*1,?columns=['b','c','d','e'],?index=[2,3,4]) print(df1) '''a????b????c????d 1??0.0??0.0??0.0??0.0 2??0.0??0.0??0.0??0.0 3??0.0??0.0??0.0??0.0 ''' print(df2) '''b????c????d????e 2??1.0??1.0??1.0??1.0 3??1.0??1.0??1.0??1.0 4??1.0??1.0??1.0??1.0 ''' #?依照df1.index進行橫向合并 res?=?pd.concat([df1,df2],axis=1,join_axes=[df1.index]) print(res) '''a????b????c????d????b????c????d????e 1??0.0??0.0??0.0??0.0??NaN??NaN??NaN??NaN 2??0.0??0.0??0.0??0.0??1.0??1.0??1.0??1.0 3??0.0??0.0??0.0??0.0??1.0??1.0??1.0??1.0 ''' #?移除join_axes參數,打印結果 res?=?pd.concat([df1,df2],axis=1) print(res) '''a????b????c????d????b????c????d????e 1??0.0??0.0??0.0??0.0??NaN??NaN??NaN??NaN 2??0.0??0.0??0.0??0.0??1.0??1.0??1.0??1.0 3??0.0??0.0??0.0??0.0??1.0??1.0??1.0??1.0 ''' #?append(添加數據) #?append只有縱向合并,沒有橫向合并 #定義資料集 df1?=?pd.DataFrame(np.ones((3,4))*0,?columns=['a','b','c','d']) df2?=?pd.DataFrame(np.ones((3,4))*1,?columns=['a','b','c','d']) df3?=?pd.DataFrame(np.ones((3,4))*2,?columns=['a','b','c','d']) s1?=?pd.Series([1,2,3,4],?index=['a','b','c','d']) #?將df2合并到df1下面,以及重置index,并打印出結果 res?=?df1.append(df2,ignore_index=True) print(res) '''a????b????c????d 0??0.0??0.0??0.0??0.0 1??0.0??0.0??0.0??0.0 2??0.0??0.0??0.0??0.0 3??1.0??1.0??1.0??1.0 4??1.0??1.0??1.0??1.0 5??1.0??1.0??1.0??1.0 ''' #?合并多個df,將df2與df3合并至df1的下面,以及重置index,并打印出結果 res?=?df1.append([df2,df3],?ignore_index=True) print(res) '''a????b????c????d 0??0.0??0.0??0.0??0.0 1??0.0??0.0??0.0??0.0 2??0.0??0.0??0.0??0.0 3??1.0??1.0??1.0??1.0 4??1.0??1.0??1.0??1.0 5??1.0??1.0??1.0??1.0 6??2.0??2.0??2.0??2.0 7??2.0??2.0??2.0??2.0 8??2.0??2.0??2.0??2.0 ''' #?合并series,將s1合并至df1,以及重置index,并打印結果 res?=?df1.append(s1,ignore_index=True) print(res) '''a????b????c????d 0??0.0??0.0??0.0??0.0 1??0.0??0.0??0.0??0.0 2??0.0??0.0??0.0??0.0 3??1.0??2.0??3.0??4.0 ''' #?總結:兩種常用合并方式 res?=?pd.concat([df1,?df2,?df3],?axis=0,?ignore_index=True) res1?=?df1.append([df2,?df3],?ignore_index=True) print(res) print(res1)7.2.Pandas 合并 merge
7.2.1 定義資料集并打印出
import?pandas?as?pd #?依據一組key合并 #?定義資料集并打印出 left?=?pd.DataFrame({'key'?:?['K0','K1','K2','K3'],'A'?:?['A0','A1','A2','A3'],'B'?:?['B0','B1','B2','B3']})right?=?pd.DataFrame({'key':?['K0',?'K1',?'K2',?'K3'],'C'?:?['C0',?'C1',?'C2',?'C3'],'D'?:?['D0',?'D1',?'D2',?'D3']}) print(left) '''A???B?key 0??A0??B0??K0 1??A1??B1??K1 2??A2??B2??K2 3??A3??B3??K3 ''' print(right) '''C???D?key 0??C0??D0??K0 1??C1??D1??K1 2??C2??D2??K2 3??C3??D3??K3 '''7.2.2 依據key column合并,并打印
#?依據key?column合并,并打印 res?=?pd.merge(left,right,on='key') print(res) '''A???B?key???C???D 0??A0??B0??K0??C0??D0 1??A1??B1??K1??C1??D1 2??A2??B2??K2??C2??D2 3??A3??B3??K3??C3??D3 ''' #?依據兩組key合并 #定義資料集并打印出 left?=?pd.DataFrame({'key1':?['K0',?'K0',?'K1',?'K2'],'key2':?['K0',?'K1',?'K0',?'K1'],'A':?['A0',?'A1',?'A2',?'A3'],'B':?['B0',?'B1',?'B2',?'B3']}) right?=?pd.DataFrame({'key1':?['K0',?'K1',?'K1',?'K2'],'key2':?['K0',?'K0',?'K0',?'K0'],'C':?['C0',?'C1',?'C2',?'C3'],'D':?['D0',?'D1',?'D2',?'D3']}) print(left) '''A???B?key1?key2 0??A0??B0???K0???K0 1??A1??B1???K0???K1 2??A2??B2???K1???K0 3??A3??B3???K2???K1 ''' print(right) '''C???D?key1?key2 0??C0??D0???K0???K0 1??C1??D1???K1???K0 2??C2??D2???K1???K0 3??C3??D3???K2???K0 '''7.2.3 兩列合并
依據key1與key2 columns進行合并
#?依據key1與key2?columns進行合并,并打印出四種結果['left',?'right',?'outer',?'inner'] res?=?pd.merge(left,?right,?on=['key1',?'key2'],?how='inner') print(res) res?=?pd.merge(left,?right,?on=['key1',?'key2'],?how='outer') print(res) res?=?pd.merge(left,?right,?on=['key1',?'key2'],?how='left') print(res) res?=?pd.merge(left,?right,?on=['key1',?'key2'],?how='right') print(res) ''' ---------------inner方式---------------A???B?key1?key2???C???D 0??A0??B0???K0???K0??C0??D0 1??A2??B2???K1???K0??C1??D1 2??A2??B2???K1???K0??C2??D2 ---------------outer方式---------------A????B?key1?key2????C????D 0???A0???B0???K0???K0???C0???D0 1???A1???B1???K0???K1??NaN??NaN 2???A2???B2???K1???K0???C1???D1 3???A2???B2???K1???K0???C2???D2 4???A3???B3???K2???K1??NaN??NaN 5??NaN??NaN???K2???K0???C3???D3 ---------------left方式---------------A???B?key1?key2????C????D 0??A0??B0???K0???K0???C0???D0 1??A1??B1???K0???K1??NaN??NaN 2??A2??B2???K1???K0???C1???D1 3??A2??B2???K1???K0???C2???D2 4??A3??B3???K2???K1??NaN??NaN --------------right方式---------------A????B?key1?key2???C???D 0???A0???B0???K0???K0??C0??D0 1???A2???B2???K1???K0??C1??D1 2???A2???B2???K1???K0??C2??D2 3??NaN??NaN???K2???K0??C3??D3 '''7.2.4 Indicator設置合并列名稱
#?Indicator df1?=?pd.DataFrame({'col1':[0,1],'col_left':['a','b']}) df2?=?pd.DataFrame({'col1':[1,2,2],'col_right':[2,2,2]}) print(df1) '''col1?col_left 0?????0????????a 1?????1????????b ''' print(df2) '''col1??col_right 0?????1??????????2 1?????2??????????2 2?????2??????????2 '''#?依據col1進行合并,并啟用indicator=True,最后打印 res?=?pd.merge(df1,df2,on='col1',how='outer',indicator=True) print(res) '''col1?col_left??col_right??????_merge 0?????0????????a????????NaN???left_only 1?????1????????b????????2.0????????both 2?????2??????NaN????????2.0??right_only 3?????2??????NaN????????2.0??right_only ''' #?自定義indicator?column的名稱,并打印出 res?=?pd.merge(df1,df2,on='col1',how='outer',indicator='indicator_column') print(res) '''col1?col_left??col_right?indicator_column 0?????0????????a????????NaN????????left_only 1?????1????????b????????2.0?????????????both 2?????2??????NaN????????2.0???????right_only 3?????2??????NaN????????2.0???????right_only '''7.2.5 依據index合并
#?依據index合并 #定義資料集并打印出 left?=?pd.DataFrame({'A':?['A0',?'A1',?'A2'],'B':?['B0',?'B1',?'B2']},index=['K0',?'K1',?'K2']) right?=?pd.DataFrame({'C':?['C0',?'C2',?'C3'],'D':?['D0',?'D2',?'D3']},index=['K0',?'K2',?'K3']) print(left) '''A???B K0??A0??B0 K1??A1??B1 K2??A2??B2 ''' print(right) '''C???D K0??C0??D0 K2??C2??D2 K3??C3??D3 ''' #?依據左右資料集的index進行合并,how='outer',并打印 res?=?pd.merge(left,right,left_index=True,right_index=True,how='outer') print(res) '''A????B????C????D K0???A0???B0???C0???D0 K1???A1???B1??NaN??NaN K2???A2???B2???C2???D2 K3??NaN??NaN???C3???D3 ''' #?依據左右資料集的index進行合并,how='inner',并打印 res?=?pd.merge(left,right,left_index=True,right_index=True,how='inner') print(res) '''A???B???C???D K0??A0??B0??C0??D0 K2??A2??B2??C2??D2 '''7.2.6 解決overlapping的問題
#?解決overlapping的問題 #定義資料集 boys?=?pd.DataFrame({'k':?['K0',?'K1',?'K2'],?'age':?[1,?2,?3]}) girls?=?pd.DataFrame({'k':?['K0',?'K0',?'K3'],?'age':?[4,?5,?6]}) print(boys) '''age???k 0????1??K0 1????2??K1 2????3??K2 ''' print(girls) '''age???k 0????4??K0 1????5??K0 2????6??K3 ''' #?使用suffixes解決overlapping的問題 #?比如將上面兩個合并時,age重復了,則可通過suffixes設置,以此保證不重復,不同名 res?=?pd.merge(boys,girls,on='k',suffixes=['_boy','_girl'],how='inner') print(res) '''age_boy???k??age_girl 0????????1??K0?????????4 1????????1??K0?????????5 '''8.Pandas plot出圖
import?pandas?as?pd import?numpy?as?np import?matplotlib.pyplot?as?pltdata?=?pd.Series(np.random.randn(1000),?index=np.arange(1000)) print(data) print(data.cumsum()) #?data本來就是一個數據,所以我們可以直接plot data.plot() plt.show() #?np.random.randn(1000,4)?隨機生成1000行4列數據 #?list("ABCD")會變為['A','B','C','D'] data?=?pd.DataFrame(np.random.randn(1000,4),index=np.arange(1000),columns=list("ABCD") ) data.cumsum() data.plot() plt.show() ax?=?data.plot.scatter(x='A',y='B',color='DarkBlue',label='Class1') #?將之下這個?data?畫在上一個?ax?上面 data.plot.scatter(x='A',y='C',color='LightGreen',label='Class2',ax=ax) plt.show()9.參考
1.https://morvanzhou.github.io/tutorials/data-manipulation/np-pd/
本文代碼的github地址:https://github.com/fengdu78/machine_learning_beginner/tree/master/pandas
往期精彩回顧適合初學者入門人工智能的路線及資料下載機器學習及深度學習筆記等資料打印機器學習在線手冊深度學習筆記專輯《統計學習方法》的代碼復現專輯 AI基礎下載機器學習的數學基礎專輯獲取一折本站知識星球優惠券,復制鏈接直接打開:https://t.zsxq.com/yFQV7am本站qq群1003271085。加入微信群請掃碼進群:總結
以上是生活随笔為你收集整理的简约而不简单|值得收藏的Pandas基本操作指南的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 【CV】一文看懂目标检测神器YOLO5
- 下一篇: 30万奖金等你拿,“信也科技杯”第五届数