當前位置：首頁 > 编程语言 > python >内容正文

python

python数据框常用操作_转载：python数据框的操作

發布時間：2024/4/20 python 32 豆豆

生活随笔收集整理的這篇文章主要介紹了 python数据框常用操作_转载：python数据框的操作小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

我們接著上次分享給大家的兩篇文章：Python數據分析之numpy學習(一)和Python數據分析之numpy學習(二)，繼續討論使用Python中的pandas模塊進行數據分。在接下來的兩期pandas介紹中將學習到如下8塊內容：

1、數據結構簡介：DataFrame和Series

2、數據索引index

3、利用pandas查詢數據

4、利用pandas的DataFrames進行統計分析

5、利用pandas實現SQL操作

6、利用pandas進行缺失值的處理

7、利用pandas實現Excel的數據透視表功能

8、多層索引的使用

一、數據結構介紹

在pandas中有兩類非常重要的數據結構，即序列Series和數據框DataFrame。Series類似于numpy中的一維數組，除了通吃一維數組可用的函數或方法，而且其可通過索引標簽的方式獲取數據，還具有索引的自動對齊功能；DataFrame類似于numpy中的二維數組，同樣可以通用numpy數組的函數和方法，而且還具有其他靈活應用，后續會介紹到。

1、Series的創建

序列的創建主要有三種方式：

1)通過一維數組創建序列

In [1]:?import numpy as np, pandas as pd

In [2]:?arr1 = np.arange(10)

In [3]:?arr1

Out[3]:?array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [4]:?type(arr1)

Out[4]:?numpy.ndarray

返回的是數組類型。

In [5]:?s1 = pd.Series(arr1)

In [6]:?s1

Out[6]:

0 ? ?0

1 ? ?1

2 ? ?2

3 ? ?3

4 ? ?4

5 ? ?5

6 ? ?6

7 ? ?7

8 ? ?8

9 ? ?9

dtype: int32

In [7]:?type(s1)

Out[7]:?pandas.core.series.Series

返回的是序列類型。

2)通過字典的方式創建序列

In [8]:?dic1 = {'a':10,'b':20,'c':30,'d':40,'e':50}

In [9]:?dic1

Out[9]:?{'a': 10, 'b': 20, 'c': 30, 'd': 40, 'e': 50}

In [10]:?type(dic1)

Out[10]:?dict

返回的是字典類型。

In [11]:?s2 = pd.Series(dic1)

In [12]:?s2

Out[12]:

a ? ?10

b ? ?20

c ? ?30

d ? ?40

e ? ?50

dtype: int64

In [13]:?type(s2)

Out[13]:?pandas.core.series.Series

返回的是序列類型。

3)通過DataFrame中的某一行或某一列創建序列

這部分內容我們放在后面講，接下來就開始講一講如何構造一個DataFrame。

2、DataFrame的創建

數據框的創建主要有三種方式：

1)通過二維數組創建數據框

In [14]:?arr2 = np.array(np.arange(12)).reshape(4,3)

In [15]:?arr2

Out[15]:

array([[ 0, ?1, ?2],

[ 3, ?4, ?5],

[ 6, ?7, ?8],

[ 9, 10, 11]])

In [16]:?type(arr2)

Out[16]:?numpy.ndarray

返回的是數組類型。

In [17]:?df1 = pd.DataFrame(arr2)

In [18]:?df1

Out[18]:

0 ? 1 ? 2

0 ?0 ? 1 ? 2

1 ?3 ? 4 ? 5

2 ?6 ? 7 ? 8

3 ?9 ?10 ?11

In [19]:?type(df1)

Out[19]:?pandas.core.frame.DataFrame

返回的數據框類型。

2)通過字典的方式創建數據框

以下以兩種字典來創建數據框，一個是字典列表，一個是嵌套字典。

In [20]:?dic2 = {'a':[1,2,3,4],'b':[5,6,7,8],

...:?? ? ? ? 'c':[9,10,11,12],'d':[13,14,15,16]}

In [21]:?dic2

Out[21]:

{'a': [1, 2, 3, 4],

'b': [5, 6, 7, 8],

'c': [9, 10, 11, 12],

'd': [13, 14, 15, 16]}

In [22]:?type(dic2)

Out[22]:?dict

返回的是字典類型。

In [23]:?df2 = pd.DataFrame(dic2)

In [24]:?df2

Out[24]:

a ?b ? c ? d

0 ?1 ?5 ? 9 ?13

1 ?2 ?6 ?10 ?14

2 ?3 ?7 ?11 ?15

3 ?4 ?8 ?12 ?16

In [25]:?type(df2)

Out[25]:?pandas.core.frame.DataFrame

返回的是數據框類型。

In [26]:?dic3 = {'one':{'a':1,'b':2,'c':3,'d':4},

...:?? ? ? ? 'two':{'a':5,'b':6,'c':7,'d':8},

...:?? ? ? ? 'three':{'a':9,'b':10,'c':11,'d':12}}

In [27]:?dic3

Out[27]:

{'one': {'a': 1, 'b': 2, 'c': 3, 'd': 4},

'three': {'a': 9, 'b': 10, 'c': 11, 'd': 12},

'two': {'a': 5, 'b': 6, 'c': 7, 'd': 8}}

In [28]:?type(dic3)

Out[28]:?dict

返回的是字典類型。

In [29]:?df3 = pd.DataFrame(dic3)

In [30]:?df3

Out[30]:

one ?three ?two

a ? ?1 ? ? ?9 ? ?5

b ? ?2 ? ? 10 ? ?6

c ? ?3 ? ? 11 ? ?7

d ? ?4 ? ? 12 ? ?8

In [31]:?type(df3)

Out[31]:?pandas.core.frame.DataFrame

返回的是數據框類型。這里需要說明的是，如果使用嵌套字典創建數據框的話，嵌套字典的最外層鍵會形成數據框的列變量，而內層鍵則會形成數據框的行索引。

3)通過數據框的方式創建數據框

In [32]:?df4 = df3[['one','three']]

In [33]:?df4

Out[33]:

one ?three

a ? ?1 ? ? ?9

b ? ?2 ? ? 10

c ? ?3 ? ? 11

d ? ?4 ? ? 12

In [34]:?type(df4)

Out[34]:?pandas.core.frame.DataFrame

返回的是數據框類型。

In [35]:?s3 = df3['one']

In [36]:?s3

Out[36]:

a ? ?1

b ? ?2

c ? ?3

d ? ?4

Name: one, dtype: int64

In [37]:?type(s3)

Out[37]:?pandas.core.series.Series

這里就是通過選擇數據框中的某一列，返回一個序列的對象。

二、數據索引index

細致的朋友可能會發現一個現象，不論是序列也好，還是數據框也好，對象的最左邊總有一個非原始數據對象，這個是什么呢？不錯，就是我們接下來要介紹的索引。

在我看來，序列或數據框的索引有兩大用處，一個是通過索引值或索引標簽獲取目標數據，另一個是通過索引，可以使序列或數據框的計算、操作實現自動化對齊，下面我們就來看看這兩個功能的應用。

1、通過索引值或索引標簽獲取數據

In [38]:?s4 = pd.Series(np.array([1,1,2,3,5,8]))

In [39]:?s4

Out[39]:

0 ? ?1

1 ? ?1

2 ? ?2

3 ? ?3

4 ? ?5

5 ? ?8

dtype: int32

如果不給序列一個指定的索引值，則序列自動生成一個從0開始的自增索引。可以通過index查看序列的索引：

In [40]:?s4.index

Out[40]:?RangeIndex(start=0, stop=6, step=1)

現在我們為序列設定一個自定義的索引值：

In [41]:?s4.index = ['a','b','c','d','e','f']

In [42]:?s4

Out[42]:

a ? ?1

b ? ?1

c ? ?2

d ? ?3

e ? ?5

f ? ?8

dtype: int32

序列有了索引，就可以通過索引值或索引標簽進行數據的獲取：

In [43]:?s4[3]

Out[43]:?3

In [44]:?s4['e']

Out[44]:?5

In [45]:?s4[[1,3,5]]

Out[45]:

b ? ?1

d ? ?3

f ? ?8

dtype: int32

In [46]:?s4[['a','b','d','f']]

Out[46]:

a ? ?1

b ? ?1

d ? ?3

f ? ?8

dtype: int32

In [47]:?s4[:4]

Out[47]:

a ? ?1

b ? ?1

c ? ?2

d ? ?3

dtype: int32

In [48]:?s4['c':]

Out[48]:

c ? ?2

d ? ?3

e ? ?5

f ? ?8

dtype: int32

In [49]:?s4['b':'e']

Out[49]:

b ? ?1

c ? ?2

d ? ?3

e ? ?5

dtype: int32

千萬注意：如果通過索引標簽獲取數據的話，末端標簽所對應的值是可以返回的！在一維數組中，就無法通過索引標簽獲取數據，這也是序列不同于一維數組的一個方面。

2、自動化對齊

如果有兩個序列，需要對這兩個序列進行算術運算，這時索引的存在就體現的它的價值了--自動化對齊。

In [50]:?s5 = pd.Series(np.array([10,15,20,30,55,80]),

...:?? ? ? ? ? ? ? ?index = ['a','b','c','d','e','f'])

In [51]:?s5

Out[51]:

a ? ?10

b ? ?15

c ? ?20

d ? ?30

e ? ?55

f ? ?80

dtype: int32

In [52]:?s6 = pd.Series(np.array([12,11,13,15,14,16]),

...:?? ? ? ? ? ? ? ?index = ['a','c','g','b','d','f'])

In [53]:?s6

Out[53]:

a ? ?12

c ? ?11

g ? ?13

b ? ?15

d ? ?14

f ? ?16

dtype: int32

In [54]:?s5 + s6

Out[54]:

a ? ?22.0

b ? ?30.0

c ? ?31.0

d ? ?44.0

e ? ? NaN

f ? ?96.0

g ? ? NaN

dtype: float64

In [55]:?s5/s6

Out[55]:

a ? ?0.833333

b ? ?1.000000

c ? ?1.818182

d ? ?2.142857

e ? ? ? ? NaN

f ? ?5.000000

g ? ? ? ? NaN

dtype: float64

由于s5中沒有對應的g索引，s6中沒有對應的e索引，所以數據的運算會產生兩個缺失值NaN。注意，這里的算術結果就實現了兩個序列索引的自動對齊，而非簡單的將兩個序列加總或相除。對于數據框的對齊，不僅僅是行索引的自動對齊，同時也會自動對齊列索引(變量名)。

數據框中同樣有索引，而且數據框是二維數組的推廣，所以數據框不僅有行索引，而且還存在列索引，關于數據框中的索引相比于序列的應用要強大的多，這部分內容將放在下面的數據查詢中講解。

三、利用pandas查詢數據

這里的查詢數據相當于R語言里的subset功能，可以通過布爾索引有針對的選取原數據的子集、指定行、指定列等。我們先導入一個student數據集：

In [56]:?student = pd.io.parsers.read_csv('C:\\Users\\admin\\Desktop\\student.csv')

查詢數據的前5行或末尾5行：

In [57]:?student.head()

Out[57]:

Name Sex ?Age ?Height ?Weight

0 ? Alfred ? M ? 14 ? ?69.0 ? 112.5

1 ? ?Alice ? F ? 13 ? ?56.5 ? ?84.0

2 ?Barbara ? F ? 13 ? ?65.3 ? ?98.0

3 ? ?Carol ? F ? 14 ? ?62.8 ? 102.5

4 ? ?Henry ? M ? 14 ? ?63.5 ? 102.5

In [58]:?student.tail()

Out[58]:

Name Sex ?Age ?Height ?Weight

14 ? Philip ? M ? 16 ? ?72.0 ? 150.0

15 ? Robert ? M ? 12 ? ?64.8 ? 128.0

16 ? Ronald ? M ? 15 ? ?67.0 ? 133.0

17 ? Thomas ? M ? 11 ? ?57.5 ? ?85.0

18 ?William ? M ? 15 ? ?66.5 ? 112.0

查詢指定的行：

In [59]:?student.ix[[0,2,4,5,7]] #這里的ix索引標簽函數必須是中括號[]

Out[59]:

Name Sex ?Age ?Height ?Weight

0 ? Alfred ? M ? 14 ? ?69.0 ? 112.5

2 ?Barbara ? F ? 13 ? ?65.3 ? ?98.0

4 ? ?Henry ? M ? 14 ? ?63.5 ? 102.5

5 ? ?James ? M ? 12 ? ?57.3 ? ?83.0

7 ? ?Janet ? F ? 15 ? ?62.5 ? 112.5

查詢指定的列：

In [60]:?student[['Name','Height','Weight']].head() ?#如果多個列的話，必須使用雙重中括號

Out[60]:

Name ?Height ?Weight

0 ? Alfred ? ?69.0 ? 112.5

1 ? ?Alice ? ?56.5 ? ?84.0

2 ?Barbara ? ?65.3 ? ?98.0

3 ? ?Carol ? ?62.8 ? 102.5

4 ? ?Henry ? ?63.5 ? 102.5

也可以通過ix索引標簽查詢指定的列：

In [61]:?student.ix[:,['Name','Height','Weight']].head()

Out[61]:

Name ?Height ?Weight

0 ? Alfred ? ?69.0 ? 112.5

1 ? ?Alice ? ?56.5 ? ?84.0

2 ?Barbara ? ?65.3 ? ?98.0

3 ? ?Carol ? ?62.8 ? 102.5

4 ? ?Henry ? ?63.5 ? 102.5

查詢指定的行和列：

In [62]:?student.ix[[0,2,4,5,7],['Name','Height','Weight']].head()

Out[62]:

Name ?Height ?Weight

0 ? Alfred ? ?69.0 ? 112.5

2 ?Barbara ? ?65.3 ? ?98.0

4 ? ?Henry ? ?63.5 ? 102.5

5 ? ?James ? ?57.3 ? ?83.0

7 ? ?Janet ? ?62.5 ? 112.5

這里簡單說明一下ix的用法：df.ix[行索引,列索引]

1)ix后面必須是中括號

2)多個行索引或列索引必須用中括號括起來

3)如果選擇所有行索引或列索引，則用英文狀態下的冒號:表示

以上是從行或列的角度查詢數據的子集，現在我們來看看如何通過布爾索引實現數據的子集查詢。

查詢所有女生的信息：

In [63]:?student[student['Sex']=='F']

Out[63]:

Name Sex ?Age ?Height ?Weight

1 ? ? Alice ? F ? 13 ? ?56.5 ? ?84.0

2 ? Barbara ? F ? 13 ? ?65.3 ? ?98.0

3 ? ? Carol ? F ? 14 ? ?62.8 ? 102.5

6 ? ? ?Jane ? F ? 12 ? ?59.8 ? ?84.5

7 ? ? Janet ? F ? 15 ? ?62.5 ? 112.5

10 ? ?Joyce ? F ? 11 ? ?51.3 ? ?50.5

11 ? ? Judy ? F ? 14 ? ?64.3 ? ?90.0

12 ? Louise ? F ? 12 ? ?56.3 ? ?77.0

13 ? ? Mary ? F ? 15 ? ?66.5 ? 112.0

查詢出所有12歲以上的女生信息：

In [64]:?student[(student['Sex']=='F') & (student['Age']>12)]

Out[64]:

Name Sex ?Age ?Height ?Weight

1 ? ? Alice ? F ? 13 ? ?56.5 ? ?84.0

2 ? Barbara ? F ? 13 ? ?65.3 ? ?98.0

3 ? ? Carol ? F ? 14 ? ?62.8 ? 102.5

7 ? ? Janet ? F ? 15 ? ?62.5 ? 112.5

11 ? ? Judy ? F ? 14 ? ?64.3 ? ?90.0

13 ? ? Mary ? F ? 15 ? ?66.5 ? 112.0

查詢出所有12歲以上的女生姓名、身高和體重：

In [66]:?student[(student['Sex']=='F') & (student['Age']>12)][['Name','Height','Weight']]

Out[66]:

Name ?Height ?Weight

1 ? ? Alice ? ?56.5 ? ?84.0

2 ? Barbara ? ?65.3 ? ?98.0

3 ? ? Carol ? ?62.8 ? 102.5

7 ? ? Janet ? ?62.5 ? 112.5

11 ? ? Judy ? ?64.3 ? ?90.0

13 ? ? Mary ? ?66.5 ? 112.0

上面的查詢邏輯其實非常的簡單，需要注意的是，如果是多個條件的查詢，必須在&(且)或者|(或)的兩端條件用括號括起來。

四、統計分析

pandas模塊為我們提供了非常多的描述性統計分析的指標函數，如總和、均值、最小值、最大值等，我們來具體看看這些函數：

首先隨機生成三組數據

In [67]:?np.random.seed(1234)

In [68]:?d1 = pd.Series(2*np.random.normal(size = 100)+3)

In [69]:?d2 = np.random.f(2,4,size = 100)

In [70]:?d3 = np.random.randint(1,100,size = 100)

In [71]:?d1.count() ?#非空元素計算

Out[71]:?100

In [72]:?d1.min() ? ?#最小值

Out[72]:?-4.1270333212494705

In [73]:?d1.max() ? ?#最大值

Out[73]:?7.7819210309260658

In [74]:?d1.idxmin() #最小值的位置，類似于R中的which.min函數

Out[74]:?81

In [75]:?d1.idxmax() #最大值的位置，類似于R中的which.max函數

Out[75]:?39

In [76]:?d1.quantile(0.1) ? ?#10%分位數

Out[76]:?0.68701846440699277

In [77]:?d1.sum() ? ?#求和

Out[77]:?307.0224566250874

In [78]:?d1.mean() ? #均值

Out[78]:?3.070224566250874

In [79]:?d1.median() #中位數

Out[79]:?3.204555266776845

In [80]:?d1.mode() ? #眾數

Out[80]:?Series([], dtype: float64)

In [81]:?d1.var() ? ?#方差

Out[81]:?4.005609378535085

In [82]:?d1.std() ? ?#標準差

Out[82]:?2.0014018533355777

In [83]:?d1.mad() ? ?#平均絕對偏差

Out[83]:?1.5112880411556109

In [84]:?d1.skew() ? #偏度

Out[84]:?-0.64947807604842933

In [85]:?d1.kurt() ? #峰度

Out[85]:?1.2201094052398012

In [86]:?d1.describe() ? #一次性輸出多個描述性統計指標

Out[86]:

count ? ?100.000000

mean ? ? ? 3.070225

std ? ? ? ?2.001402

min ? ? ? -4.127033

25% ? ? ? ?2.040101

50% ? ? ? ?3.204555

75% ? ? ? ?4.434788

max ? ? ? ?7.781921

dtype: float64

必須注意的是，describe方法只能針對序列或數據框，一維數組是沒有這個方法的。

這里自定義一個函數，將這些統計描述指標全部匯總到一起：

In [87]:?def stats(x):

...:?? ? return pd.Series([x.count(),x.min(),x.idxmin(),

...:?? ? ? ? ? ? ? ?x.quantile(.25),x.median(),

...:?? ? ? ? ? ? ? ?x.quantile(.75),x.mean(),

...:?? ? ? ? ? ? ? ?x.max(),x.idxmax(),

...:?? ? ? ? ? ? ? ?x.mad(),x.var(),

...:?? ? ? ? ? ? ? ?x.std(),x.skew(),x.kurt()],

...:?? ? ? ? ? ? ? index = ['Count','Min','Whicn_Min',

...:?? ? ? ? ? ? ? ? ? ? ? ?'Q1','Median','Q3','Mean',

...:?? ? ? ? ? ? ? ? ? ? ? ?'Max','Which_Max','Mad',

...:?? ? ? ? ? ? ? ? ? ? ? ?'Var','Std','Skew','Kurt'])

In [88]:?stats(d1)

Out[88]:

Count ? ? ? ?100.000000

Min ? ? ? ? ? -4.127033

Whicn_Min ? ? 81.000000

Q1 ? ? ? ? ? ? 2.040101

Median ? ? ? ? 3.204555

Q3 ? ? ? ? ? ? 4.434788

Mean ? ? ? ? ? 3.070225

Max ? ? ? ? ? ?7.781921

Which_Max ? ? 39.000000

Mad ? ? ? ? ? ?1.511288

Var ? ? ? ? ? ?4.005609

Std ? ? ? ? ? ?2.001402

Skew ? ? ? ? ?-0.649478

Kurt ? ? ? ? ? 1.220109

dtype: float64

在實際的工作中，我們可能需要處理的是一系列的數值型數據框，如何將這個函數應用到數據框中的每一列呢？可以使用apply函數，這個非常類似于R中的apply的應用方法。

將之前創建的d1,d2,d3數據構建數據框：

In [89]:?df = pd.DataFrame(np.array([d1,d2,d3]).T,columns=['x1','x2','x3'])

In [90]:?df.head()

Out[90]:

x1 ? ? ? ?x2 ? ?x3

0 ?3.942870 ?1.369531 ?55.0

1 ?0.618049 ?0.943264 ?68.0

2 ?5.865414 ?0.590663 ?73.0

3 ?2.374696 ?0.206548 ?59.0

4 ?1.558823 ?0.223204 ?60.0

In [91]:?df.apply(stats)

Out[91]:

x1 ? ? ? ? ?x2 ? ? ? ? ?x3

Count ? ? ?100.000000 ?100.000000 ?100.000000

Min ? ? ? ? -4.127033 ? ?0.014330 ? ?3.000000

Whicn_Min ? 81.000000 ? 72.000000 ? 76.000000

Q1 ? ? ? ? ? 2.040101 ? ?0.249580 ? 25.000000

Median ? ? ? 3.204555 ? ?1.000613 ? 54.500000

Q3 ? ? ? ? ? 4.434788 ? ?2.101581 ? 73.000000

Mean ? ? ? ? 3.070225 ? ?2.028608 ? 51.490000

Max ? ? ? ? ?7.781921 ? 18.791565 ? 98.000000

Which_Max ? 39.000000 ? 53.000000 ? 96.000000

Mad ? ? ? ? ?1.511288 ? ?1.922669 ? 24.010800

Var ? ? ? ? ?4.005609 ? 10.206447 ?780.090808

Std ? ? ? ? ?2.001402 ? ?3.194753 ? 27.930106

Skew ? ? ? ?-0.649478 ? ?3.326246 ? -0.118917

Kurt ? ? ? ? 1.220109 ? 12.636286 ? -1.211579

非常完美，就這樣很簡單的創建了數值型數據的統計性描述。如果是離散型數據呢？就不能用這個統計口徑了，我們需要統計離散變量的觀測數、唯一值個數、眾數水平及個數。你只需要使用describe方法就可以實現這樣的統計了。

In [92]:?student['Sex'].describe()

Out[92]:

count ? ? 19

unique ? ? 2

top ? ? ? ?M

freq ? ? ?10

Name: Sex, dtype: object

除以上的簡單描述性統計之外，還提供了連續變量的相關系數(corr)和協方差矩陣(cov)的求解，這個跟R語言是一致的用法。

In [93]:?df.corr()

Out[93]:

x1 ? ? ? ?x2 ? ? ? ?x3

x1 ?1.000000 ?0.136085 ?0.037185

x2 ?0.136085 ?1.000000 -0.005688

x3 ?0.037185 -0.005688 ?1.000000

關于相關系數的計算可以調用pearson方法或kendell方法或spearman方法，默認使用pearson方法。

In [94]:?df.corr('spearman')

Out[94]:

x1 ? ? ? ?x2 ? ? ? ?x3

x1 ?1.00000 ?0.178950 ?0.006590

x2 ?0.17895 ?1.000000 -0.033874

x3 ?0.00659 -0.033874 ?1.000000

如果只想關注某一個變量與其余變量的相關系數的話，可以使用corrwith,如下方只關心x1與其余變量的相關系數：

In [95]:?df.corrwith(df['x1'])

Out[95]:

x1 ? ?1.000000

x2 ? ?0.136085

x3 ? ?0.037185

dtype: float64

數值型數據的協方差矩陣：

In [96]:?df.cov()

Out[96]:

x1 ? ? ? ? x2 ? ? ? ? ?x3

x1 ?4.005609 ? 0.870124 ? ?2.078596

x2 ?0.870124 ?10.206447 ? -0.507512

x3 ?2.078596 ?-0.507512 ?780.090808

總結

以上是生活随笔為你收集整理的python数据框常用操作_转载：python数据框的操作的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：二维数组赋值_3.9数组（数组基本使用
下一篇：矩阵低秩张量分解_TKDE 2020 |