當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

小白也能看懂的Pandas实操演示教程(上)

發布時間：2024/9/15 编程问答 27 豆豆

生活随笔收集整理的這篇文章主要介紹了小白也能看懂的Pandas实操演示教程(上) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

作者：奔雷手，目前是名在校學生，當前主要在學習機器學習，也在做機器學習方面的助教，相對還是比較了解初學者學習過程的需求和問題，希望通過這個專欄能夠廣結好友，共同成長。?

編輯：王老濕

今天主要帶大家來實操學習下Pandas，因為篇幅原因，分為了兩部分，本篇為上。

1 數據結構的簡介

pandas中有兩類非常重要的數據結構，就是序列Series和數據框DataFrame.Series類似于NumPy中的一維數組，可以使用一維數組的可用函數和方法，而且還可以通過索引標簽的方式獲取數據，還具有索引的自動對齊功能；DataFrame類似于numpy中的二維數組，同樣可以使用numpy數組的函數和方法，還具有一些其它靈活的使用。

1.1 Series的創建 ?三種方法

通過一維數組創建序列m

import?pandas?as?pd import?numpy?as?nparr1=np.arange(10) print("數組arr1：",arr1) print("arr1的數據類型：",type(arr1)) s1=pd.Series(arr1) print("序列s1:\n",s1) print("s1的數據類型：",type(s1))

數組arr1：?[0?1?2?3?4?5?6?7?8?9] arr1的數據類型：?<class?'numpy.ndarray'> 序列s1: 0????0 1????1 2????2 3????3 4????4 5????5 6????6 7????7 8????8 9????9 dtype:?int32 s1的數據類型：?<class?'pandas.core.series.Series'>

通過字典的方式創建序列

dict1={'a':1,'b':2,'c':3,'d':4,'e':5} print("字典dict1：",dict1) print("dict1的數據類型：",type(dict1)) s2=pd.Series(dict1) print("序列s2：",s2) print("s2的數據類型：",type(s2))

字典dict1：?{'a':?1,?'b':?2,?'c':?3,?'d':?4,?'e':?5} dict1的數據類型：?<class?'dict'> 序列s2：a ???1 b????2 c????3 d????4 e????5 dtype:?int64 s2的數據類型：?<class?'pandas.core.series.Series'>

通過已有DataFrame創建

由于涉及到了DataFrame的概念，所以等后面介紹了DataFrame之后補充下如何通過已有的DataFrame來創建Series。

1.2 DataFrame的創建三種方法

通過二維數組創建數據框

print("第一種方法創建DataFrame") arr2=np.array(np.arange(12)).reshape(4,3) print("數組2：",arr2) print("數組2的類型",type(arr2))df1=pd.DataFrame(arr2) print("數據框1：\n",df1) print("數據框1的類型：",type(df1))

第一種方法創建DataFrame 數組2：?[[?0??1??2][?3??4??5][?6??7??8][?9?10?11]] 數組2的類型?<class?'numpy.ndarray'> 數據框1：0???1???2 0??0???1???2 1??3???4???5 2??6???7???8 3??9??10??11 數據框1的類型：?<class 'pandas.core.frame.DataFrame'>

通過字典列表的方式創建數據框

print("第二種方法創建DataFrame") dict2={'a':[1,2,3,4],'b':[5,6,7,8],'c':[9,10,11,12],'d':[13,14,15,16]} print("字典2-字典列表：",dict2) print("字典2的類型",type(dict2))df2=pd.DataFrame(dict2) print("數據框2：\n",df2) print("數據框2的類型：",type(df2))

第二種方法創建DataFrame 字典2-字典列表：?{'a':?[1,?2,?3,?4],?'b':?[5,?6,?7,?8],?'c':?[9,?10,?11,?12],?'d':?[13,?14,?15,?16]} 字典2的類型?<class?'dict'> 數據框2：a??b???c???d 0??1??5???9??13 1??2??6??10??14 2??3??7??11??15 3??4??8??12??16 數據框2的類型：?<class?'pandas.core.frame.DataFrame'>

通過嵌套字典的方式創建數據框

dict3={'one':{'a':1,'b':2,'c':3,'d':4},'two':{'a':5,'b':6,'c':7,'d':8},'three':{'a':9,'b':10,'c':11,'d':12}} print("字典3-嵌套字典：",dict3) print("字典3的類型",type(dict3))df3=pd.DataFrame(dict3) print("數據框3：\n",df3) print("數據框3的類型：",type(df3))

字典3-嵌套字典：?{'one':?{'a':?1,?'b':?2,?'c':?3,?'d':?4},?'two':?{'a':?5,?'b':?6,?'c':?7,?'d':?8},?'three':?{'a':?9,?'b':?10,?'c':?11,?'d':?12}} 字典3的類型?<class?'dict'> 數據框3：one??three??two a????1??????9????5 b????2?????10????6 c????3?????11????7 d????4?????12????8 數據框3的類型：?<class?'pandas.core.frame.DataFrame'>

有了DataFrame之后，這里補充下如何通過DataFrame來創建Series。

s3=df3['one']?#直接拿出數據框3中第一列 print("序列3：\n",s3) print("序列3的類型：",type(s3)) print("------------------------------------------------") s4=df3.iloc[0]?#df3['a']?#直接拿出數據框3中第一行--iloc print("序列4：\n",s4) print("序列4的類型：",type(s4))

序列3：a????1 b????2 c????3 d????4 Name:?one,?dtype:?int64 序列3的類型：?<class?'pandas.core.series.Series'> ------------------------------------------------ 序列4：one??????1 three????9 two??????5 Name:?a,?dtype:?int64 序列4的類型：?<class?'pandas.core.series.Series'>

2 數據索引index

無論數據框還是序列，最左側始終有一個非原始數據對象，這個就是接下來要介紹的數據索引。通過索引獲取目標數據，對數據進行一系列的操作。

2.1 通過索引值或索引標簽獲取數據

s5=pd.Series(np.array([1,2,3,4,5,6])) print(s5)?#如果不給序列一個指定索引值，序列會自動生成一個從0開始的自增索引

0????1 1????2 2????3 3????4 4????5 5????6 dtype:?int32

通過index屬性獲取序列的索引值

s5.index?

RangeIndex(start=0,?stop=6,?step=1)

為index重新賦值

s5.index=['a','b','c','d','e','f']? s5

a????1 b????2 c????3 d????4 e????5 f????6 dtype:?int32

通過索引獲取數據

s5[3] 4

s5['e'] 5

s5[[1,3,5]] b????2 d????4 f????6 dtype:?int32

s5[:4] a????1 b????2 c????3 d????4 dtype:?int32

s5['c':] c????3 d????4 e????5 f????6 dtype:?int32

s5['b':'e']??#通過索引標簽獲取數據，末端標簽的數據也是返回的， b????2 c????3 d????4 e????5 dtype:?int32

2.2 自動化對齊

#當對兩個 s6=pd.Series(np.array([10,15,20,30,55,80]),index=['a','b','c','d','e','f']) print("序列6：",s6) s7=pd.Series(np.array([12,11,13,15,14,16]),index=['a','c','g','b','d','f']) print("序列7：",s7)print(s6+s7)??#s6中不存在g索引，s7中不存在e索引，所以數據運算會產生兩個缺失值NaN。 #可以注意到這里的算術運算自動實現了兩個序列的自動對齊 #對于數據框的對齊，不僅是行索引的自動對齊，同時也會對列索引進行自動對齊，數據框相當于二維數組的推廣 print(s6/s7)

序列6：?a????10 b????15 c????20 d????30 e????55 f????80 dtype:?int32 序列7：?a????12 c????11 g????13 b????15 d????14 f????16 dtype:?int32 a????22.0 b????30.0 c????31.0 d????44.0 e?????NaN f????96.0 g?????NaN dtype:?float64 a????0.833333 b????1.000000 c????1.818182 d????2.142857 e?????????NaN f????5.000000 g?????????NaN dtype:?float64

3 pandas查詢數據

通過布爾索引有針對的選取原數據的子集，指定行，指定列等。

test_data=pd.read_csv('test_set.csv') #?test_data.drop(['ID'],inplace=True,axis=1) test_data.head()

非數值值特征數值化

test_data['job'],jnum=pd.factorize(test_data['job']) test_data['job']=test_data['job']+1test_data['marital'],jnum=pd.factorize(test_data['marital']) test_data['marital']=test_data['marital']+1test_data['education'],jnum=pd.factorize(test_data['education']) test_data['education']=test_data['education']+1test_data['default'],jnum=pd.factorize(test_data['default']) test_data['default']=test_data['default']+1test_data['housing'],jnum=pd.factorize(test_data['housing']) test_data['housing']=test_data['housing']+1test_data['loan'],jnum=pd.factorize(test_data['loan']) test_data['loan']=test_data['loan']+1test_data['contact'],jnum=pd.factorize(test_data['contact']) test_data['contact']=test_data['contact']+1test_data['month'],jnum=pd.factorize(test_data['month']) test_data['month']=test_data['month']+1test_data['poutcome'],jnum=pd.factorize(test_data['poutcome']) test_data['poutcome']=test_data['poutcome']+1test_data.head()

查詢數據的前5行

test_data.head()

查詢數據的末尾5行

test_data.tail()

查詢指定的行

test_data.iloc[[0,2,4,5,7]]

查詢指定的列

test_data[['age','job','marital']].head()?

查詢指定的行和列

test_data.loc[[0,2,4,5,7],['age','job','marital']]

查詢年齡為51的信息

#通過布爾索引實現數據的自己查詢test_data[test_data['age']==51].head()

查詢工作為5以上的年齡在51的信息

test_data[(test_data['age']==51)?&?(test_data['job']>=5)].head()

查詢工作為5以上，年齡在51的人員，并且只選取指定列

#只選取housing,loan,contac和poutcome test_data[(test_data['age']==51)?&?(test_data['job']>=5)][['education','housing','loan','contact','poutcome']].head()

可以看到，當有多個條件的查詢，需要在&或者|的兩端的條件括起來

4 對DataFrames進行統計分析

Pandas為我們提供了很多描述性統計分析的指標函數，包括，總和，均值，最小值，最大值等。

a=np.random.normal(size=10) d1=pd.Series(2*a+3) d2=np.random.f(2,4,size=10) d3=np.random.randint(1,100,size=10) print(d1) print(d2) print(d3)

0????5.811077 1????2.963418 2????2.295078 3????0.279647 4????6.564293 5????1.146455 6????1.903623 7????1.157710 8????2.921304 9????2.397009 dtype:?float64 [0.18147396?0.48218962?0.42565903?0.10258942?0.55299842?0.108593280.66923199?1.18542009?0.12053079?4.64172891] [33?17?71?45?33?83?68?41?69?23]

非空元素的計算

d1.count()? 10

最小值

d1.min()?? 0.6149265534311872

最大值

d1.max()?? 6.217953512253818

最小值的位置

d1.idxmin()? 8

最大值的位置

d1.idxmax() 1

10%分位數

d1.quantile(0.1)? 1.4006153623854274

求和

d1.sum()? 27.43961378467516

平均數

d1.mean()?? 2.743961378467515

中位數

d1.median()? 2.3460435427041384

眾數

d1.mode()?? 0????0.279647 1????1.146455 2????1.157710 3????1.903623 4????2.295078 5????2.397009 6????2.921304 7????2.963418 8????5.811077 9????6.564293 dtype:?float64

方差

d1.var()?? 4.027871738323722

標準差

d1.std()?? 2.0069558386580715

平均絕對偏差

d1.mad()?? 1.456849211331346

偏度

d1.skew()?? 1.0457755613918738

峰度

d1.kurt()?? 0.39322767370407874

一次性輸出多個描述性統計指標

d1.describe()? count????10.000000 mean??????2.743961 std???????2.006956 min???????0.279647 25%???????1.344189 50%???????2.346044 75%???????2.952890 max???????6.564293 dtype:?float64

#自定義一個函數，將這些統計描述指標全部匯總到一起 def?stats(x):return?pd.Series([x.count(),x.min(),x.idxmin(),x.quantile(.25),x.median(),x.quantile(.75),x.mean(),x.max(),x.idxmax(),x.mad(),x.var(),x.std(),x.skew(),x.kurt()],index=['Count','Min','Which_Min','Q1','Median','Q3','Mean','Max','Which_Max','Mad','Var','Std','Skew','Kurt'])

stats(d1) Count????????10.000000 Min???????????0.279647 Which_Min?????3.000000 Q1????????????1.344189 Median????????2.346044 Q3????????????2.952890 Mean??????????2.743961 Max???????????6.564293 Which_Max?????4.000000 Mad???????????1.456849 Var???????????4.027872 Std???????????2.006956 Skew??????????1.045776 Kurt??????????0.393228 dtype:?float64

對于數字型數據，它是直接統計一些數據性描述，觀察這一系列數據的范圍。大小、波動趨勢，便于判斷后續對數據采取哪類模型更合適。

#當實際工作中我們需要處理的是一系列的數值型數據框，可以使用apply函數將這個stats函數應用到數據框中的每一列 df=pd.DataFrame(np.array([d1,d2,d3]).T,columns=['x1','x2','x3'])?#將之前創建的d1,d2,d3數據構建數據框 print(df.head()) df.apply(stats)

?????????x1????????x2????x3 0??5.811077??0.181474??33.0 1??2.963418??0.482190??17.0 2??2.295078??0.425659??71.0 3??0.279647??0.102589??45.0 4??6.564293??0.552998??33.0

以上很簡單的創建了數值型數據的統計性描述，但對于離散型數據就不能使用該方法了。我們在統計離散變量的觀測數、唯一值個數、眾數水平及個數，只需要使用describe方法就可以實現這樣的統計了。

train_data=pd.read_csv('train_set.csv') #?test_data.drop(['ID'],inplace=True,axis=1) train_data.head()
train_data['job'].describe()??#離散型數據的描述 count???????????25317 unique?????????????12 top???????blue-collar freq?????????????5456 Name:?job,?dtype:?object test_data['job'].describe()??#數值型數據的描述 count????10852.000000 mean?????????5.593255 std??????????2.727318 min??????????1.000000 25%??????????3.000000 50%??????????6.000000 75%??????????8.000000 max?????????12.000000 Name:?job,?dtype:?float64

除了以上簡單的描述性統計之外，還提供了連續變量的相關系數（corr）和協方差（cov）的求解

df ?df.corr()??#相關系數的計算方法可以調用pearson方法、kendall方法、或者spearman方法，默認使用的是pearson方法 df.corr('spearman') df.corr('pearson') df.corr('kendall') #如果只關注某一個變量與其余變量的相關系數的話，可以使用corrwith，如下方只關注x1與其余變量的相關系數 df.corrwith(df['x1']) x1????1.000000 x2???-0.075466 x3???-0.393609 dtype:?float64
OK，今天的pandas實操演示就到這里，剩下的內容我們下期見。

長按 2 秒，掃碼即可關注

總結

以上是生活随笔為你收集整理的小白也能看懂的Pandas实操演示教程(上)的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：精心收集汇总的Python学习资源（书籍
下一篇：最后 24 小时，赶紧来领取这 50 本