當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

成为华尔街金融巨鳄第三课： Pandas2:学会使用Pandas-DataFrame

發(fā)布時(shí)間：2024/3/24 编程问答 59 豆豆

生活随笔收集整理的這篇文章主要介紹了成为华尔街金融巨鳄第三课： Pandas2:学会使用Pandas-DataFrame 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

成為華爾街金融巨鱷第三課：

Pandas2:學(xué)會(huì)使用Pandas-DataFrame

import pandas as pd import numpy as np

一、DataFrame簡(jiǎn)介和創(chuàng)建：二維數(shù)據(jù)對(duì)象

可以簡(jiǎn)單理解為excel表格

創(chuàng)建方法一：利用字典創(chuàng)建

pd.DataFrame({"one":[1,2,3],'two':[4,5,6]}) onetwo012

1	4
2	5
3	6

和Series類(lèi)似，我們可以為行指定索引

pd.DataFrame({"one":[1,2,3],'two':[4,5,6]},index=['a','b','c']) onetwoabc

1	4
2	5
3	6

創(chuàng)建方法二：利用Series創(chuàng)建

pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']),'two':pd.Series([1,2,3,4],index=['d','c','a','b'])}) onetwoabcd

1.0	3
2.0	4
3.0	2
NaN	1

可見(jiàn)DataFrame在創(chuàng)建時(shí)會(huì)自動(dòng)進(jìn)行索引對(duì)其

創(chuàng)建方法三：利用csv創(chuàng)建

pd.read_csv('test.csv') abc012

1	2	3
2	4	6
3	6	9

保存到csv

df = pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']),'two':pd.Series([1,2,3,4],index=['d','c','a','b'])}) df onetwoabcd

1.0	3
2.0	4
3.0	2
NaN	1

# 保存到csv df.to_csv('test2.csv')

二、DataFrame常見(jiàn)屬性

1、index、columns和vlues屬性

作用：index用來(lái)獲取 # 行 # 索引 ; columns獲取# 列 # 屬性；values用來(lái)獲取值 # 數(shù)組

df = pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']),'two':pd.Series([1,2,3,4],index=['d','c','a','b'])}) df onetwoabcd

1.0	3
2.0	4
3.0	2
NaN	1

# index屬性 df.index Index(['a', 'b', 'c', 'd'], dtype='object') # columns屬性 df.columns Index(['one', 'two'], dtype='object') # values屬性 df.values array([[ 1., 3.],[ 2., 4.],[ 3., 2.],[nan, 1.]])

2.T屬性

作用：轉(zhuǎn)置

df.T abcdonetwo

1.0	2.0	3.0	NaN
3.0	4.0	2.0	1.0

3.describe()方法

作用：返回詳細(xì)信息

df.describe() onetwocountmeanstdmin25%50%75%max

3.0	4.000000
2.0	2.500000
1.0	1.290994
1.0	1.000000
1.5	1.750000
2.0	2.500000
2.5	3.250000
3.0	4.000000

count: 該列數(shù)據(jù)共有多少條 mean：該列數(shù)據(jù)平均值 std：該列數(shù)據(jù)標(biāo)準(zhǔn)差 min：該列數(shù)據(jù)最小值 25%：該列數(shù)據(jù)從小到大25%位置上的數(shù) 50%：該列數(shù)據(jù)中位數(shù) 75%：該列數(shù)據(jù)從小到大75%位置上的數(shù) max：該列數(shù)據(jù)最大值

三、DataFrame索引和切片

df = pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']),'two':pd.Series([1,2,3,4],index=['d','c','a','b'])}) onetwoabcd

1.0	3
2.0	4
3.0	2
NaN	1

df取值可以采用df[x][y]的方法取值，表示取x列的y行，注意對(duì)比numpy，這里前一個(gè)中括號(hào)內(nèi)表示列

eg：取第one行第a列的1.0

df['one']['a'] 1.0

雖然這樣可以輕松地取到想要的值，但是一般情況下，我們不采取這種方式，因?yàn)闀?huì)出現(xiàn)類(lèi)似于Series整數(shù)索引的問(wèn)題

于是，一般情況下我們還是使用loc和iloc來(lái)進(jìn)行取值

用loc方式取值，在這種情況下逗號(hào)前表示行，逗號(hào)后表示列，和numpy類(lèi)似

df.loc['a','one'] 1.0

！特別注意：DataFrame事實(shí)上是由n個(gè)Series對(duì)象所組成的，因此可以通過(guò)列索引直接取到某一列，卻不能通過(guò)行索引直接取到某一行,想要獲取某一行可以采取切片的方式

# 通過(guò)列名直接獲取某一列 df['one'] a 1.0 b 2.0 c 3.0 d NaN Name: one, dtype: float64 # 嚴(yán)禁通過(guò)行索引獲取某一行 df['a'] ---------------------------------------------------------------------------KeyError Traceback (most recent call last)c:\users\lenovo\appdata\local\programs\python\python37\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)2894 try: -> 2895 return self._engine.get_loc(casted_key)2896 except KeyError as err:pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()KeyError: 'a'The above exception was the direct cause of the following exception:KeyError Traceback (most recent call last)<ipython-input-34-9637ce7feee6> in <module>1 # 嚴(yán)禁通過(guò)行索引獲取某一行 ----> 2 df['a']c:\users\lenovo\appdata\local\programs\python\python37\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)2904 if self.columns.nlevels > 1:2905 return self._getitem_multilevel(key) -> 2906 indexer = self.columns.get_loc(key)2907 if is_integer(indexer):2908 indexer = [indexer]c:\users\lenovo\appdata\local\programs\python\python37\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)2895 return self._engine.get_loc(casted_key)2896 except KeyError as err: -> 2897 raise KeyError(key) from err2898 2899 if tolerance is not None:KeyError: 'a' # 通過(guò)切片獲取某一行 df.loc['a',:] one 1.0 two 3.0 Name: a, dtype: float64

根據(jù)我的嘗試，切片獲取：可以省略寫(xiě)成df.loc[‘a(chǎn)’,]，甚至可以省略逗號(hào)df.loc[‘a(chǎn)’]，但是這與上面的注意違背所以盡量不要省略逗號(hào)

df = pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']),'two':pd.Series([1,2,3,4],index=['d','c','a','b'])}) df onetwoabcd

1.0	3
2.0	4
3.0	2
NaN	1

行/列索引部分除了常規(guī)索引，還可以是切片、布爾值索引、花式索引任意搭配

eg：

# 花式索引和切片搭配 df.loc[['a','b'],'one':'two'] onetwoab

1.0	3
2.0	4

注意點(diǎn)回顧：和Series一樣，在使用鍵索引時(shí)，是左閉右也閉的區(qū)間

四、數(shù)據(jù)對(duì)齊與缺失值處理

df = pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']),'two':pd.Series([1,2,3,4],index=['d','c','a','b'])}) df onetwoabcd

1.0	3
2.0	4
3.0	2
NaN	1

df2 = pd.DataFrame({'two':[1,2,3,4],'one':[5,6,7,8]},index = ['d','c','a','b']) df2 twoonedcab

1	5
2	6
3	7
4	8

df + df2 onetwoabcd

8.0	6
10.0	8
9.0	4
NaN	2

DataFrame遵循數(shù)據(jù)對(duì)齊的原則，在運(yùn)算時(shí)，會(huì)行與列都分別對(duì)齊

df onetwoabcd

1.0	3
2.0	4
3.0	2
NaN	1

缺失值處理方法1：缺失值填充

# fillna(x)為缺失值填入x df.fillna(0) onetwoabcd

1.0	3
2.0	4
3.0	2
0.0	1

缺失值處理方法2：缺失值刪除

1.dropna()方法

# dropna()刪除存在缺失值所在行 df.dropna() onetwoabc

1.0	3
2.0	4
3.0	2

2.dropna()方法的how參數(shù)

df onetwoabcd

1.0	3
2.0	4
3.0	2
NaN	1

df.loc['c','two'] = np.nan df.loc['d','two'] = np.nan df onetwoabcd

1.0	3.0
2.0	4.0
3.0	NaN
NaN	NaN

# how=‘a(chǎn)ll’規(guī)定了只有該行全部為空才刪除，默認(rèn)how的參數(shù)為any，即只要存在缺失值就刪除該行 df.dropna(how = 'all') onetwoabc

1.0	3.0
2.0	4.0
3.0	NaN

3.dropna()方法的axis參數(shù)

axis=0表示以行為單位，axis=1表示以列為單位# df.dropna(axis=1)表示將存在缺失值的列刪除，默認(rèn)axis=0

df2 twoonedcab

1	5
2	6
3	7
4	8

df2.loc['a','two'] = np.nan df2 twoonedcab

1.0	5
2.0	6
NaN	7
4.0	8

# df.dropna(axis=1)表示將存在缺失值的列刪除，默認(rèn)axis=0 df2.dropna(axis=1) onedcab

5
6
7
8

#另外，df中一樣提供了isnull（）和notnull（）使用方法和Series完全一致 df2.isnull() twoonedcab

False	False
False	False
True	False
False	False

五、DataFrame常見(jiàn)函數(shù)

1.求平均值

df = df2 df twoonedcab

1.0	5
2.0	6
NaN	7
4.0	8

df.mean() two 2.333333 one 6.500000 dtype: float64

mean():mean方法將忽略缺失值，計(jì)算出每一列的平均值，并返回一個(gè)Series對(duì)象

axis參數(shù)：axis=1可以按行求平均值默認(rèn)為0按列求平均值

df.mean(axis=1) d 3.0 c 4.0 a 7.0 b 6.0 dtype: float64

另外，sum，std等方法和mean()類(lèi)似

2.按值排序： sort_values（）

df twoonedcab

1.0	5
2.0	6
NaN	7
4.0	8

# 按two這一列的值排序 df.sort_values(by = 'two') twoonedcba

1.0	5
2.0	6
4.0	8
NaN	7

# 按two這一列的值降序排序 df.sort_values(by = 'two',ascending=False) twoonebcda

4.0	8
2.0	6
1.0	5
NaN	7

# 按d這一行降序排序 df.sort_values(by = 'd',ascending=False,axis=1) onetwodcab

5	1.0
6	2.0
7	NaN
8	4.0

總結(jié)：

1.df.sort_values(）用以按值排序

2.by參數(shù)：指定按哪一列（行）排序

3.ascending參數(shù)：指定升序或降序排序，True為升序，False為降序，默認(rèn)為T(mén)rue

4.axis參數(shù)：指定按行/列排序，axis=0按列排序，axis=1按行排序，默認(rèn)為0，如果by的參數(shù)為行標(biāo)簽，則必須賦值axis=1

5.關(guān)于缺失值NaN：如果存在缺失值，則缺失值不參與排序，統(tǒng)一放在最后面

3.按列排序： sort_index（）

df twoonedcab

1.0	5
2.0	6
NaN	7
4.0	8

df.sort_index() twooneabcd

NaN	7
4.0	8
2.0	6
1.0	5

# 降序排序 df.sort_index(ascending = False) twoonedcba

1.0	5
2.0	6
4.0	8
NaN	7

# 按one two列的順序排序 df.sort_index(ascending = True , axis = 1) onetwodcab

5	1.0
6	2.0
7	NaN
8	4.0

sort_index僅有ascending，axis兩個(gè)參數(shù)使用方法和按值排序類(lèi)似

六、OMG太牛辣——DataFrame時(shí)間序列

1.pandas時(shí)間對(duì)象處理

# pd.to_datetime(list)支持批量的字符串轉(zhuǎn)化為時(shí)間對(duì)象，并且支持各式各樣的時(shí)間書(shū)寫(xiě)方式 pd.to_datetime(["2021-01-10","2021/MAY/1"]) DatetimeIndex(['2021-01-10', '2021-05-01'], dtype='datetime64[ns]', freq=None)

2.pandas時(shí)間對(duì)象自動(dòng)生成

函數(shù)pd.date_range()的start,end/periods參數(shù)說(shuō)明

start：開(kāi)始時(shí)間 end：結(jié)束時(shí)間 periods：可以指定start不指定end改指定periods，指的是生成從start開(kāi)始的periods天時(shí)間，同樣的，可以指定end不指定start改指定periods # 用start和end生成時(shí)間 pd.date_range('2021-01-01','2021-03-01') DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04','2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08','2021-01-09', '2021-01-10', '2021-01-11', '2021-01-12','2021-01-13', '2021-01-14', '2021-01-15', '2021-01-16','2021-01-17', '2021-01-18', '2021-01-19', '2021-01-20','2021-01-21', '2021-01-22', '2021-01-23', '2021-01-24','2021-01-25', '2021-01-26', '2021-01-27', '2021-01-28','2021-01-29', '2021-01-30', '2021-01-31', '2021-02-01','2021-02-02', '2021-02-03', '2021-02-04', '2021-02-05','2021-02-06', '2021-02-07', '2021-02-08', '2021-02-09','2021-02-10', '2021-02-11', '2021-02-12', '2021-02-13','2021-02-14', '2021-02-15', '2021-02-16', '2021-02-17','2021-02-18', '2021-02-19', '2021-02-20', '2021-02-21','2021-02-22', '2021-02-23', '2021-02-24', '2021-02-25','2021-02-26', '2021-02-27', '2021-02-28', '2021-03-01'],dtype='datetime64[ns]', freq='D') # 用strat和periods生成時(shí)間，periods不能省略 pd.date_range('2021-01-01',periods=60) DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04','2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08','2021-01-09', '2021-01-10', '2021-01-11', '2021-01-12','2021-01-13', '2021-01-14', '2021-01-15', '2021-01-16','2021-01-17', '2021-01-18', '2021-01-19', '2021-01-20','2021-01-21', '2021-01-22', '2021-01-23', '2021-01-24','2021-01-25', '2021-01-26', '2021-01-27', '2021-01-28','2021-01-29', '2021-01-30', '2021-01-31', '2021-02-01','2021-02-02', '2021-02-03', '2021-02-04', '2021-02-05','2021-02-06', '2021-02-07', '2021-02-08', '2021-02-09','2021-02-10', '2021-02-11', '2021-02-12', '2021-02-13','2021-02-14', '2021-02-15', '2021-02-16', '2021-02-17','2021-02-18', '2021-02-19', '2021-02-20', '2021-02-21','2021-02-22', '2021-02-23', '2021-02-24', '2021-02-25','2021-02-26', '2021-02-27', '2021-02-28', '2021-03-01'],dtype='datetime64[ns]', freq='D')

函數(shù)pd.date_range()的freq參數(shù)說(shuō)明：

freq：指定生成的時(shí)間間隔單位，默認(rèn)為D（天），此外，還有‘H’（小時(shí)），‘W’（周）等 W（周）分為，‘W-MON’表示從start開(kāi)始輸出每周周一，默認(rèn)只輸入W表示'W-SUN' B：只輸出工作日 # 生成從2021-01-01 00:00:00往后的60小時(shí)，每小時(shí)生成一個(gè) pd.date_range('2021-01-01',periods=60,freq='H') DatetimeIndex(['2021-01-01 00:00:00', '2021-01-01 01:00:00','2021-01-01 02:00:00', '2021-01-01 03:00:00','2021-01-01 04:00:00', '2021-01-01 05:00:00','2021-01-01 06:00:00', '2021-01-01 07:00:00','2021-01-01 08:00:00', '2021-01-01 09:00:00','2021-01-01 10:00:00', '2021-01-01 11:00:00','2021-01-01 12:00:00', '2021-01-01 13:00:00','2021-01-01 14:00:00', '2021-01-01 15:00:00','2021-01-01 16:00:00', '2021-01-01 17:00:00','2021-01-01 18:00:00', '2021-01-01 19:00:00','2021-01-01 20:00:00', '2021-01-01 21:00:00','2021-01-01 22:00:00', '2021-01-01 23:00:00','2021-01-02 00:00:00', '2021-01-02 01:00:00','2021-01-02 02:00:00', '2021-01-02 03:00:00','2021-01-02 04:00:00', '2021-01-02 05:00:00','2021-01-02 06:00:00', '2021-01-02 07:00:00','2021-01-02 08:00:00', '2021-01-02 09:00:00','2021-01-02 10:00:00', '2021-01-02 11:00:00','2021-01-02 12:00:00', '2021-01-02 13:00:00','2021-01-02 14:00:00', '2021-01-02 15:00:00','2021-01-02 16:00:00', '2021-01-02 17:00:00','2021-01-02 18:00:00', '2021-01-02 19:00:00','2021-01-02 20:00:00', '2021-01-02 21:00:00','2021-01-02 22:00:00', '2021-01-02 23:00:00','2021-01-03 00:00:00', '2021-01-03 01:00:00','2021-01-03 02:00:00', '2021-01-03 03:00:00','2021-01-03 04:00:00', '2021-01-03 05:00:00','2021-01-03 06:00:00', '2021-01-03 07:00:00','2021-01-03 08:00:00', '2021-01-03 09:00:00','2021-01-03 10:00:00', '2021-01-03 11:00:00'],dtype='datetime64[ns]', freq='H') # 生成從2021-01-01 往后的60個(gè)周日 pd.date_range('2021-01-01',periods=60,freq='W') DatetimeIndex(['2021-01-03', '2021-01-10', '2021-01-17', '2021-01-24','2021-01-31', '2021-02-07', '2021-02-14', '2021-02-21','2021-02-28', '2021-03-07', '2021-03-14', '2021-03-21','2021-03-28', '2021-04-04', '2021-04-11', '2021-04-18','2021-04-25', '2021-05-02', '2021-05-09', '2021-05-16','2021-05-23', '2021-05-30', '2021-06-06', '2021-06-13','2021-06-20', '2021-06-27', '2021-07-04', '2021-07-11','2021-07-18', '2021-07-25', '2021-08-01', '2021-08-08','2021-08-15', '2021-08-22', '2021-08-29', '2021-09-05','2021-09-12', '2021-09-19', '2021-09-26', '2021-10-03','2021-10-10', '2021-10-17', '2021-10-24', '2021-10-31','2021-11-07', '2021-11-14', '2021-11-21', '2021-11-28','2021-12-05', '2021-12-12', '2021-12-19', '2021-12-26','2022-01-02', '2022-01-09', '2022-01-16', '2022-01-23','2022-01-30', '2022-02-06', '2022-02-13', '2022-02-20'],dtype='datetime64[ns]', freq='W-SUN') # 生成從2021-01-01 往后的60個(gè)周五 pd.date_range('2021-01-01',periods=60,freq='W-Fri') DatetimeIndex(['2021-01-01', '2021-01-08', '2021-01-15', '2021-01-22','2021-01-29', '2021-02-05', '2021-02-12', '2021-02-19','2021-02-26', '2021-03-05', '2021-03-12', '2021-03-19','2021-03-26', '2021-04-02', '2021-04-09', '2021-04-16','2021-04-23', '2021-04-30', '2021-05-07', '2021-05-14','2021-05-21', '2021-05-28', '2021-06-04', '2021-06-11','2021-06-18', '2021-06-25', '2021-07-02', '2021-07-09','2021-07-16', '2021-07-23', '2021-07-30', '2021-08-06','2021-08-13', '2021-08-20', '2021-08-27', '2021-09-03','2021-09-10', '2021-09-17', '2021-09-24', '2021-10-01','2021-10-08', '2021-10-15', '2021-10-22', '2021-10-29','2021-11-05', '2021-11-12', '2021-11-19', '2021-11-26','2021-12-03', '2021-12-10', '2021-12-17', '2021-12-24','2021-12-31', '2022-01-07', '2022-01-14', '2022-01-21','2022-01-28', '2022-02-04', '2022-02-11', '2022-02-18'],dtype='datetime64[ns]', freq='W-FRI') # 生成從2021-01-01 往后的60個(gè)工作日 pd.date_range('2021-01-01',periods=60,freq='B') DatetimeIndex(['2021-01-01', '2021-01-04', '2021-01-05', '2021-01-06','2021-01-07', '2021-01-08', '2021-01-11', '2021-01-12','2021-01-13', '2021-01-14', '2021-01-15', '2021-01-18','2021-01-19', '2021-01-20', '2021-01-21', '2021-01-22','2021-01-25', '2021-01-26', '2021-01-27', '2021-01-28','2021-01-29', '2021-02-01', '2021-02-02', '2021-02-03','2021-02-04', '2021-02-05', '2021-02-08', '2021-02-09','2021-02-10', '2021-02-11', '2021-02-12', '2021-02-15','2021-02-16', '2021-02-17', '2021-02-18', '2021-02-19','2021-02-22', '2021-02-23', '2021-02-24', '2021-02-25','2021-02-26', '2021-03-01', '2021-03-02', '2021-03-03','2021-03-04', '2021-03-05', '2021-03-08', '2021-03-09','2021-03-10', '2021-03-11', '2021-03-12', '2021-03-15','2021-03-16', '2021-03-17', '2021-03-18', '2021-03-19','2021-03-22', '2021-03-23', '2021-03-24', '2021-03-25'],dtype='datetime64[ns]', freq='B') # 2021-01-01 00:00:00每1小時(shí)20分鐘輸出一個(gè)時(shí)間戳 pd.date_range('2021-01-01',periods=60,freq='1h20min') DatetimeIndex(['2021-01-01 00:00:00', '2021-01-01 01:20:00','2021-01-01 02:40:00', '2021-01-01 04:00:00','2021-01-01 05:20:00', '2021-01-01 06:40:00','2021-01-01 08:00:00', '2021-01-01 09:20:00','2021-01-01 10:40:00', '2021-01-01 12:00:00','2021-01-01 13:20:00', '2021-01-01 14:40:00','2021-01-01 16:00:00', '2021-01-01 17:20:00','2021-01-01 18:40:00', '2021-01-01 20:00:00','2021-01-01 21:20:00', '2021-01-01 22:40:00','2021-01-02 00:00:00', '2021-01-02 01:20:00','2021-01-02 02:40:00', '2021-01-02 04:00:00','2021-01-02 05:20:00', '2021-01-02 06:40:00','2021-01-02 08:00:00', '2021-01-02 09:20:00','2021-01-02 10:40:00', '2021-01-02 12:00:00','2021-01-02 13:20:00', '2021-01-02 14:40:00','2021-01-02 16:00:00', '2021-01-02 17:20:00','2021-01-02 18:40:00', '2021-01-02 20:00:00','2021-01-02 21:20:00', '2021-01-02 22:40:00','2021-01-03 00:00:00', '2021-01-03 01:20:00','2021-01-03 02:40:00', '2021-01-03 04:00:00','2021-01-03 05:20:00', '2021-01-03 06:40:00','2021-01-03 08:00:00', '2021-01-03 09:20:00','2021-01-03 10:40:00', '2021-01-03 12:00:00','2021-01-03 13:20:00', '2021-01-03 14:40:00','2021-01-03 16:00:00', '2021-01-03 17:20:00','2021-01-03 18:40:00', '2021-01-03 20:00:00','2021-01-03 21:20:00', '2021-01-03 22:40:00','2021-01-04 00:00:00', '2021-01-04 01:20:00','2021-01-04 02:40:00', '2021-01-04 04:00:00','2021-01-04 05:20:00', '2021-01-04 06:40:00'],dtype='datetime64[ns]', freq='80T')

3.時(shí)間序列

什么是時(shí)間序列

時(shí)間序列就是以時(shí)間對(duì)象為索引的Series或DataFrame。eg:

# 創(chuàng)建一個(gè)Series，其中index為時(shí)間對(duì)象 sr = pd.Series(np.arange(1000),index=pd.date_range('2020-01-01',periods=1000)) sr 2020-01-01 0 2020-01-02 1 2020-01-03 2 2020-01-04 3 2020-01-05 4... 2022-09-22 995 2022-09-23 996 2022-09-24 997 2022-09-25 998 2022-09-26 999 Freq: D, Length: 1000, dtype: int32 # 查看sr的index屬性，index確實(shí)為時(shí)間對(duì)象，那么稱(chēng)sr為一個(gè)時(shí)間序列 sr.index DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04','2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08','2020-01-09', '2020-01-10',...'2022-09-17', '2022-09-18', '2022-09-19', '2022-09-20','2022-09-21', '2022-09-22', '2022-09-23', '2022-09-24','2022-09-25', '2022-09-26'],dtype='datetime64[ns]', length=1000, freq='D')

時(shí)間序列的特殊作用

時(shí)間序列可以直接查找某一年/月/日的數(shù)據(jù) 甚至支持年月日的切片

# 查找2020年所有數(shù)據(jù) sr['2020'] 2020-01-01 0 2020-01-02 1 2020-01-03 2 2020-01-04 3 2020-01-05 4... 2020-12-27 361 2020-12-28 362 2020-12-29 363 2020-12-30 364 2020-12-31 365 Freq: D, Length: 366, dtype: int32 # 查找2020年3月所有數(shù)據(jù) sr['2020-3'] 2020-03-01 60 2020-03-02 61 2020-03-03 62 2020-03-04 63 2020-03-05 64 2020-03-06 65 2020-03-07 66 2020-03-08 67 2020-03-09 68 2020-03-10 69 2020-03-11 70 2020-03-12 71 2020-03-13 72 2020-03-14 73 2020-03-15 74 2020-03-16 75 2020-03-17 76 2020-03-18 77 2020-03-19 78 2020-03-20 79 2020-03-21 80 2020-03-22 81 2020-03-23 82 2020-03-24 83 2020-03-25 84 2020-03-26 85 2020-03-27 86 2020-03-28 87 2020-03-29 88 2020-03-30 89 2020-03-31 90 Freq: D, dtype: int32 # 查找2020年3月19號(hào)所有數(shù)據(jù) sr['2020-3-19'] 78 # 查找2020年3月到2021年5月1號(hào)所有數(shù)據(jù) sr['2020-03':'2021-05-1'] 2020-03-01 60 2020-03-02 61 2020-03-03 62 2020-03-04 63 2020-03-05 64... 2021-04-27 482 2021-04-28 483 2021-04-29 484 2021-04-30 485 2021-05-01 486 Freq: D, Length: 427, dtype: int32

涉及時(shí)間序列的函數(shù)：resample（）——強(qiáng)大的統(tǒng)計(jì)函數(shù)

#### resample（）傳入的參數(shù)date_range()的freq參數(shù)相同，比如‘W’可以理解為將所有數(shù)據(jù)按周分組，結(jié)合sum（），mean（）等函數(shù)可以實(shí)現(xiàn)數(shù)據(jù)統(tǒng)計(jì) sr 2020-01-01 0 2020-01-02 1 2020-01-03 2 2020-01-04 3 2020-01-05 4... 2022-09-22 995 2022-09-23 996 2022-09-24 997 2022-09-25 998 2022-09-26 999 Freq: D, Length: 1000, dtype: int32 # 統(tǒng)計(jì)每周的數(shù)據(jù)總和 sr.resample('W').sum() 2020-01-05 10 2020-01-12 56 2020-01-19 105 2020-01-26 154 2020-02-02 203... 2022-09-04 6818 2022-09-11 6867 2022-09-18 6916 2022-09-25 6965 2022-10-02 999 Freq: W-SUN, Length: 144, dtype: int32 # 統(tǒng)計(jì)每周的數(shù)據(jù)平均 sr.resample('W').mean() 2020-01-05 2 2020-01-12 8 2020-01-19 15 2020-01-26 22 2020-02-02 29... 2022-09-04 974 2022-09-11 981 2022-09-18 988 2022-09-25 995 2022-10-02 999 Freq: W-SUN, Length: 144, dtype: int32

七、文件操作

# 讀取csv文件 pd.read_csv('maotai.csv') 日期收盤(pán)開(kāi)盤(pán)高低交易量漲跌幅01234...239240241242243

2021/11/12	1,773.78	1,778.00	1,785.05	1,767.00	1.76M	0.24%
2021/11/11	1,769.60	1,752.93	1,769.60	1,741.50	2.27M	0.89%
2021/11/10	1,753.99	1,790.01	1,795.00	1,735.00	3.53M	-2.01%
2021/11/9	1,790.01	1,819.98	1,827.87	1,782.00	2.74M	-1.65%
2021/11/8	1,820.10	1,820.00	1,830.80	1,802.05	1.77M	0.01%
...	...	...	...	...	...	...
2020/11/18	1,693.65	1,715.00	1,720.53	1,683.16	3.52M	-1.29%
2020/11/17	1,715.80	1,740.00	1,742.35	1,701.07	2.52M	-0.82%
2020/11/16	1,730.05	1,711.00	1,730.05	1,697.26	3.06M	1.47%
2020/11/13	1,705.00	1,724.00	1,728.88	1,691.00	2.82M	-1.72%
2020/11/12	1,734.79	1,730.01	1,750.00	1,722.27	2.35M	0.20%

244 rows × 7 columns

# 參數(shù)index_col：指定索引，可以傳數(shù)字表示第n列，也可以傳列名 df = pd.read_csv('maotai.csv',index_col=0) df 收盤(pán)開(kāi)盤(pán)高低交易量漲跌幅日期2021/11/122021/11/112021/11/102021/11/92021/11/8...2020/11/182020/11/172020/11/162020/11/132020/11/12

1,773.78	1,778.00	1,785.05	1,767.00	1.76M	0.24%
1,769.60	1,752.93	1,769.60	1,741.50	2.27M	0.89%
1,753.99	1,790.01	1,795.00	1,735.00	3.53M	-2.01%
1,790.01	1,819.98	1,827.87	1,782.00	2.74M	-1.65%
1,820.10	1,820.00	1,830.80	1,802.05	1.77M	0.01%
...	...	...	...	...	...
1,693.65	1,715.00	1,720.53	1,683.16	3.52M	-1.29%
1,715.80	1,740.00	1,742.35	1,701.07	2.52M	-0.82%
1,730.05	1,711.00	1,730.05	1,697.26	3.06M	1.47%
1,705.00	1,724.00	1,728.88	1,691.00	2.82M	-1.72%
1,734.79	1,730.01	1,750.00	1,722.27	2.35M	0.20%

244 rows × 6 columns

# 默認(rèn)情況下生成的列名均為字符串格式 df.index Index(['2021/11/12', '2021/11/11', '2021/11/10', '2021/11/9', '2021/11/8','2021/11/5', '2021/11/4', '2021/11/3', '2021/11/2', '2021/11/1',...'2020/11/25', '2020/11/24', '2020/11/23', '2020/11/20', '2020/11/19','2020/11/18', '2020/11/17', '2020/11/16', '2020/11/13', '2020/11/12'],dtype='object', name='日期', length=244) # 參數(shù)parse_dates=True將所有能用時(shí)間對(duì)象表示的列統(tǒng)統(tǒng)轉(zhuǎn)為時(shí)間對(duì)象 df = pd.read_csv('maotai.csv',index_col=0,thousands=',',parse_dates=True) df 收盤(pán)開(kāi)盤(pán)高低交易量漲跌幅日期2021-11-122021-11-112021-11-102021-11-092021-11-08...2020-11-182020-11-172020-11-162020-11-132020-11-12

1773.78	1778.00	1785.05	1767.00	1.76M	0.24%
1769.60	1752.93	1769.60	1741.50	2.27M	0.89%
1753.99	1790.01	1795.00	1735.00	3.53M	-2.01%
1790.01	1819.98	1827.87	1782.00	2.74M	-1.65%
1820.10	1820.00	1830.80	1802.05	1.77M	0.01%
...	...	...	...	...	...
1693.65	1715.00	1720.53	1683.16	3.52M	-1.29%
1715.80	1740.00	1742.35	1701.07	2.52M	-0.82%
1730.05	1711.00	1730.05	1697.26	3.06M	1.47%
1705.00	1724.00	1728.88	1691.00	2.82M	-1.72%
1734.79	1730.01	1750.00	1722.27	2.35M	0.20%

244 rows × 6 columns

df.index DatetimeIndex(['2021-11-12', '2021-11-11', '2021-11-10', '2021-11-09','2021-11-08', '2021-11-05', '2021-11-04', '2021-11-03','2021-11-02', '2021-11-01',...'2020-11-25', '2020-11-24', '2020-11-23', '2020-11-20','2020-11-19', '2020-11-18', '2020-11-17', '2020-11-16','2020-11-13', '2020-11-12'],dtype='datetime64[ns]', name='日期', length=244, freq=None) # 參數(shù)parse_dates=[n]將第n列列統(tǒng)統(tǒng)轉(zhuǎn)為時(shí)間對(duì)象 df = pd.read_csv('maotai.csv',index_col=0,parse_dates=[0]) df.index DatetimeIndex(['2021-11-12', '2021-11-11', '2021-11-10', '2021-11-09','2021-11-08', '2021-11-05', '2021-11-04', '2021-11-03','2021-11-02', '2021-11-01',...'2020-11-25', '2020-11-24', '2020-11-23', '2020-11-20','2020-11-19', '2020-11-18', '2020-11-17', '2020-11-16','2020-11-13', '2020-11-12'],dtype='datetime64[ns]', name='日期', length=244, freq=None) # header=None:不讓csv文件中的第0行成為列名 pd.read_csv('maotai.csv',index_col=0,header=None) 1234560日期2021/11/122021/11/112021/11/102021/11/9...2020/11/182020/11/172020/11/162020/11/132020/11/12

收盤(pán)	開(kāi)盤(pán)	高	低	交易量	漲跌幅
1,773.78	1,778.00	1,785.05	1,767.00	1.76M	0.24%
1,769.60	1,752.93	1,769.60	1,741.50	2.27M	0.89%
1,753.99	1,790.01	1,795.00	1,735.00	3.53M	-2.01%
1,790.01	1,819.98	1,827.87	1,782.00	2.74M	-1.65%
...	...	...	...	...	...
1,693.65	1,715.00	1,720.53	1,683.16	3.52M	-1.29%
1,715.80	1,740.00	1,742.35	1,701.07	2.52M	-0.82%
1,730.05	1,711.00	1,730.05	1,697.26	3.06M	1.47%
1,705.00	1,724.00	1,728.88	1,691.00	2.82M	-1.72%
1,734.79	1,730.01	1,750.00	1,722.27	2.35M	0.20%

245 rows × 6 columns

# names:自行指定列名 pd.read_csv('maotai.csv',index_col=0,header=None,names=['a','b','c','d','e','f','g']) bcdefga日期2021/11/122021/11/112021/11/102021/11/9...2020/11/182020/11/172020/11/162020/11/132020/11/12

收盤(pán)	開(kāi)盤(pán)	高	低	交易量	漲跌幅
1,773.78	1,778.00	1,785.05	1,767.00	1.76M	0.24%
1,769.60	1,752.93	1,769.60	1,741.50	2.27M	0.89%
1,753.99	1,790.01	1,795.00	1,735.00	3.53M	-2.01%
1,790.01	1,819.98	1,827.87	1,782.00	2.74M	-1.65%
...	...	...	...	...	...
1,693.65	1,715.00	1,720.53	1,683.16	3.52M	-1.29%
1,715.80	1,740.00	1,742.35	1,701.07	2.52M	-0.82%
1,730.05	1,711.00	1,730.05	1,697.26	3.06M	1.47%
1,705.00	1,724.00	1,728.88	1,691.00	2.82M	-1.72%
1,734.79	1,730.01	1,750.00	1,722.27	2.35M	0.20%

245 rows × 6 columns

# na_values:指定哪些字符串被指定為NaN pd.read_csv('maotai.csv',index_col=0,header=None,names=['a','b','c','d','e','f','g'],na_values=['收盤(pán)','開(kāi)盤(pán)']) bcdefga日期2021/11/122021/11/112021/11/102021/11/9...2020/11/182020/11/172020/11/162020/11/132020/11/12

NaN	NaN	高	低	交易量	漲跌幅
1,773.78	1,778.00	1,785.05	1,767.00	1.76M	0.24%
1,769.60	1,752.93	1,769.60	1,741.50	2.27M	0.89%
1,753.99	1,790.01	1,795.00	1,735.00	3.53M	-2.01%
1,790.01	1,819.98	1,827.87	1,782.00	2.74M	-1.65%
...	...	...	...	...	...
1,693.65	1,715.00	1,720.53	1,683.16	3.52M	-1.29%
1,715.80	1,740.00	1,742.35	1,701.07	2.52M	-0.82%
1,730.05	1,711.00	1,730.05	1,697.26	3.06M	1.47%
1,705.00	1,724.00	1,728.88	1,691.00	2.82M	-1.72%
1,734.79	1,730.01	1,750.00	1,722.27	2.35M	0.20%

245 rows × 6 columns

# df.to_文件格式（）保存為某個(gè)文件格式

總結(jié)

以上是生活随笔為你收集整理的成为华尔街金融巨鳄第三课： Pandas2:学会使用Pandas-DataFrame的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：红衣教主“叫停”360路由器的背后（上）
下一篇： python玛丽冒险_超级玛丽的 pyt

编程问答

成为华尔街金融巨鳄第三课： Pandas2:学会使用Pandas-DataFrame

成為華爾街金融巨鱷第三課：

Pandas2:學(xué)會(huì)使用Pandas-DataFrame

一、DataFrame簡(jiǎn)介和創(chuàng)建：二維數(shù)據(jù)對(duì)象

可以簡(jiǎn)單理解為excel表格

創(chuàng)建方法一：利用字典創(chuàng)建

和Series類(lèi)似，我們可以為行指定索引

創(chuàng)建方法二：利用Series創(chuàng)建

可見(jiàn)DataFrame在創(chuàng)建時(shí)會(huì)自動(dòng)進(jìn)行索引對(duì)其

創(chuàng)建方法三：利用csv創(chuàng)建

保存到csv

二、DataFrame常見(jiàn)屬性

1、index、columns和vlues屬性

作用：index用來(lái)獲取 # 行 # 索引 ; columns獲取# 列 # 屬性；values用來(lái)獲取值 # 數(shù)組

2.T屬性

作用：轉(zhuǎn)置

3.describe()方法

作用：返回詳細(xì)信息

三、DataFrame索引和切片

df取值可以采用df[x][y]的方法取值，表示取x列的y行，注意對(duì)比numpy，這里前一個(gè)中括號(hào)內(nèi)表示列

雖然這樣可以輕松地取到想要的值，但是一般情況下，我們不采取這種方式，因?yàn)闀?huì)出現(xiàn)類(lèi)似于Series整數(shù)索引的問(wèn)題

于是，一般情況下我們還是使用loc和iloc來(lái)進(jìn)行取值

用loc方式取值，在這種情況下逗號(hào)前表示行，逗號(hào)后表示列，和numpy類(lèi)似

！特別注意：DataFrame事實(shí)上是由n個(gè)Series對(duì)象所組成的，因此可以通過(guò)列索引直接取到某一列，卻不能通過(guò)行索引直接取到某一行,想要獲取某一行可以采取切片的方式

根據(jù)我的嘗試，切片獲取：可以省略寫(xiě)成df.loc[‘a(chǎn)’,]，甚至可以省略逗號(hào)df.loc[‘a(chǎn)’]，但是這與上面的注意違背所以盡量不要省略逗號(hào)

行/列索引部分除了常規(guī)索引，還可以是切片、布爾值索引、花式索引任意搭配

注意點(diǎn)回顧：和Series一樣，在使用鍵索引時(shí)，是左閉右也閉的區(qū)間

四、數(shù)據(jù)對(duì)齊與缺失值處理

DataFrame遵循數(shù)據(jù)對(duì)齊的原則，在運(yùn)算時(shí)，會(huì)行與列都分別對(duì)齊

缺失值處理方法1：缺失值填充

缺失值處理方法2：缺失值刪除

1.dropna()方法

2.dropna()方法的how參數(shù)

3.dropna()方法的axis參數(shù)

axis=0表示以行為單位，axis=1表示以列為單位# df.dropna(axis=1)表示將存在缺失值的列刪除，默認(rèn)axis=0

五、DataFrame常見(jiàn)函數(shù)

1.求平均值

mean():mean方法將忽略缺失值，計(jì)算出每一列的平均值，并返回一個(gè)Series對(duì)象

axis參數(shù)：axis=1可以按行求平均值 默認(rèn)為0按列求平均值

另外，sum，std等方法和mean()類(lèi)似

2.按值排序 ： sort_values（）

總結(jié)：

1.df.sort_values(）用以按值排序

2.by參數(shù)：指定按哪一列（行）排序

3.ascending參數(shù)：指定升序或降序排序，True為升序，False為降序，默認(rèn)為T(mén)rue

4.axis參數(shù)：指定按行/列排序，axis=0按列排序，axis=1按行排序，默認(rèn)為0，如果by的參數(shù)為行標(biāo)簽，則必須賦值axis=1

5.關(guān)于缺失值NaN：如果存在缺失值，則缺失值不參與排序，統(tǒng)一放在最后面

3.按列排序 ： sort_index（）

sort_index僅有ascending，axis兩個(gè)參數(shù)使用方法和按值排序類(lèi)似

六、OMG太牛辣——DataFrame時(shí)間序列

1.pandas時(shí)間對(duì)象處理

2.pandas時(shí)間對(duì)象自動(dòng)生成

函數(shù)pd.date_range()的start,end/periods參數(shù)說(shuō)明

函數(shù)pd.date_range()的freq參數(shù)說(shuō)明：

3.時(shí)間序列

什么是時(shí)間序列

時(shí)間序列就是以時(shí)間對(duì)象為索引的Series或DataFrame。eg:

時(shí)間序列的特殊作用

時(shí)間序列可以直接查找某一年/月/日的數(shù)據(jù) 甚至支持年月日的切片

涉及時(shí)間序列的函數(shù)：resample（）——強(qiáng)大的統(tǒng)計(jì)函數(shù)

七、文件操作

總結(jié)

axis參數(shù)：axis=1可以按行求平均值默認(rèn)為0按列求平均值

2.按值排序： sort_values（）

3.按列排序： sort_index（）