當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

【Python学习笔记—保姆版】第四章—关于Pandas、数据准备、数据处理、数据分析、数据可视化

發(fā)布時(shí)間：2023/12/15 python 30 豆豆

生活随笔收集整理的這篇文章主要介紹了【Python学习笔记—保姆版】第四章—关于Pandas、数据准备、数据处理、数据分析、数据可视化小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

第四章

歡迎訪問我搞事情的【知乎賬號(hào)】：Coffee
以及我的【B站漫威剪輯賬號(hào)】：VideosMan
若我的筆記對(duì)你有幫助，請(qǐng)用小小的手指，點(diǎn)一個(gè)大大的贊哦。

#編譯器使用的是sypder，其中">>>"代表輸入和執(zhí)行的內(nèi)容
>>> print(‘Hello World’) #執(zhí)行代碼
Hello World #輸出值

【Python學(xué)習(xí)筆記—保姆版】第四章

第四章
1、關(guān)于Pandas
2、數(shù)據(jù)準(zhǔn)備
- pandas和numpy
- 創(chuàng)建DataFrame
- - 1.標(biāo)準(zhǔn)格式創(chuàng)建：
  - 2 .傳入等長的列表組成的字典來創(chuàng)建：
  - **3 傳入嵌套字典（字典的值也是字典）創(chuàng)建DataFrame**
- 增刪改查
- - 1.增加值
  - - 1.增加列。直接為不存在的列賦值就會(huì)創(chuàng)建新的列
    - 2.增加行
    - 3.刪除行和列：axis代表選中的是行還是列，列是1，行是2.inplace代表有沒有真正刪除
    - 改
- DataFrame
- - 一、某列所有值
  - 二、某行所有值
  - 三、某行某列對(duì)應(yīng)值df_signal[‘a(chǎn)’].iloc[-1]
  - 四、刪除特定行
  - 五、**Python DataFrame 按條件篩選數(shù)據(jù)**
  - 六、排序
- 數(shù)據(jù)導(dǎo)入
3、數(shù)據(jù)處理
- 4.3.1 數(shù)據(jù)清洗
- - 1、重復(fù)值的處理：drop_duplicates()
  - 2、缺失值處理：
  - - 1. dropna() 去除數(shù)據(jù)結(jié)構(gòu)中值為空的數(shù)據(jù)行
    - 2. df.fillna() 用其他數(shù)值替代NaN，有些時(shí)候空數(shù)據(jù)直接刪除會(huì)影響分析的結(jié)果，可以對(duì)數(shù)據(jù)進(jìn)行填補(bǔ)。【例4-8】使用數(shù)值或者任意字符替代缺失值
    - 3. df.fillna(method='pad') 用前一個(gè)數(shù)據(jù)值替代NaN
    - 4. df.fillna(method='bfill') 用后一個(gè)數(shù)據(jù)值替代NaN
    - 5. df.fillna(df.mean()) 用平均數(shù)或者其他描述性統(tǒng)計(jì)量來代替NaN。
    - 6. df.fillna(df.mean()[math: physical]) 可以選擇列進(jìn)行缺失值的處理
    - 7. strip()：清除字符型數(shù)據(jù)左右(首尾)指定的字符，默認(rèn)為空格，中間的不清除。
  - 3、特定值替換：replace('缺考', 0)
  - 4、刪除滿足條件元素所在的行：drop()
- 4.3.2 數(shù)據(jù)抽取
- - 1. 字段抽取：抽出某列上指定位置的數(shù)據(jù)做成新的列。
  - 2. 字段拆分：按指定的字符sep，拆分已有的字符串。
  - 3. 記錄抽取：是指根據(jù)一定的條件，對(duì)數(shù)據(jù)進(jìn)行抽取。
  - 4. 隨機(jī)抽樣：是指隨機(jī)從數(shù)據(jù)中按照一定的行數(shù)或者比例抽取數(shù)據(jù)。
  - PS:按照指定條件抽取數(shù)據(jù)：
  - 5. 字典數(shù)據(jù)：將字典數(shù)據(jù)抽取為dataframe，有三種方法。
- 4.3.3 排名索引
- - 說明：axis、ascending、inplace、by
  - **1. 排名排序（索引排序）：df.sort_index()**
  - 2．重新索引：.reindex(index=None,**kwargs)
  - 3. 值排序：df.sort_values()
  - **4、sort_values()中的na_position參數(shù)**
  - 5、“值排名”：rank()函數(shù)
- 4.3.4 數(shù)據(jù)合并
- - 1. 記錄合并：是指兩個(gè)結(jié)構(gòu)相同的數(shù)據(jù)框合并成一個(gè)數(shù)據(jù)框。也就是在一個(gè)數(shù)據(jù)框中追加另一個(gè)數(shù)據(jù)框的數(shù)據(jù)記錄。pd.concat([df1,df2])
  - 2. 字段合并：是指將同一個(gè)數(shù)據(jù)框中的不同的列進(jìn)行合并，形成新的列。X = x1+x2+…
  - 3. 字段匹配：是指不同結(jié)構(gòu)的數(shù)據(jù)框(兩個(gè)或以上的數(shù)據(jù)框)，按照一定的條件進(jìn)行合并，即追加列。merge(x,y,left_on,right_on) 外鍵連接
- 4.3.5 數(shù)據(jù)計(jì)算
- - 1. 簡單計(jì)算：通過對(duì)各字段進(jìn)行加、減、乘、除等四則算術(shù)運(yùn)算，計(jì)算出的結(jié)果作為新的字段。
  - 2. 數(shù)據(jù)標(biāo)準(zhǔn)化：是指將數(shù)據(jù)按照比例縮放，使之落入特定的區(qū)間，一般使用0-1標(biāo)準(zhǔn)化。X*=(x-min)/(max-min)
- 4.3.6 數(shù)據(jù)分組
- - 說明：pd.cut(series,bins,right=True,labels=NULL)
  - bins
  - labels
- 4.3.7 日期處理
- - 1. 日期轉(zhuǎn)換：是指將字符型的日期格式轉(zhuǎn)換為日期格式數(shù)據(jù)的過程。
  - 2. 日期格式化：是指將日期型的數(shù)據(jù)按照給定的格式轉(zhuǎn)化為字符型的數(shù)據(jù)。
  - 3. 日期抽取：是指從日期格式里面抽取出需要的部分屬性
  - 4. 日期判斷：
  - 5.日期增長：
4、數(shù)據(jù)分析
- 4.4.1 基本統(tǒng)計(jì)：describe
- 4.4.2 分組分析：groupby（離散值分組）
- 4.4.3 分布分析：cut+groupby（連續(xù)值分組）
- 4.4.4 交叉分析：pivot_table（數(shù)據(jù)透視表）
- 4.4.5 結(jié)構(gòu)分析：pivot_table+sum+div（查比重）
- 4.4.6 相關(guān)分析：corr（一維、二維）
5、數(shù)據(jù)可視化
- 相關(guān)注意
- 4.5.1 餅圖：plt.pie(gb2.人數(shù)，labels=gb2.index，autopct='%.2f%%'，colors=['b'，'pink'，(0.5，0.8，0.3)]，explode=[0，0，0，0，0.1])
- 4.5.2 散點(diǎn)圖：plt.plot(df.高代,df.數(shù)分,'o',color='pink')
- 4.5.3 折線圖：plt.plot(df.學(xué)號(hào),df.總分,'-',color='r')
- 4.5.4 柱形圖：plt.bar(df.學(xué)號(hào)后三位,df.總分,width=1,color=['r','b'])
- 4.5.5 直方圖：plt.hist(df2.C語言程序設(shè)計(jì),bins=10,color='g',cumulative=True)
這是我的線上筆記，希望對(duì)你有所幫助；你的點(diǎn)贊收藏，是我堅(jiān)持的最大動(dòng)力

1、關(guān)于Pandas

Pandas的中文網(wǎng)，介紹得非常詳細(xì)
https://www.pypandas.cn/docs/

2、數(shù)據(jù)準(zhǔn)備

pandas和numpy

import pandas as pd import pandas as Series import numpy as np from pandas import DataFrame

pandas：生成數(shù)據(jù)框，處理數(shù)據(jù)框
import pandas as pd DataFrame、Series

numpy：一些特殊的數(shù)值，可視化的使用

import numpy as np

比如

方法解釋

np.nan	空值（缺失值）
np.inf	無窮（ -inf 或 +inf ）
np.arange(16)	返回一個(gè)有終點(diǎn)和起點(diǎn)的固定步長的排列
np.random	生成隨機(jī)數(shù)
np.array([1,2,3,4])	返回一個(gè)自定義的排列

方法效果

numpy.size	人數(shù)
numpy.mean	平均值
numpy.var	方差
numpy.std	標(biāo)準(zhǔn)差
numpy.max	最高分
numpy.min	最低分

創(chuàng)建DataFrame

1.標(biāo)準(zhǔn)格式創(chuàng)建：

>>> from pandas import DataFrame>>> df = DataFrame(np.arange(16).reshape(4,4),index=['a','b','c','d'],columns =['one','two','three','four'])>>> dfone two three foura 0 1 2 3b 4 5 6 7c 8 9 10 11d 12 13 14 15

2 .傳入等長的列表組成的字典來創(chuàng)建：

>>> data = {'a':[5,8],'b':[1,0]}>>> df = DataFrame(data)>>> dfa b0 5 11 8 0

同時(shí)也可以指定列索引序列

>>> df = DataFrame(data,columns = ['b','a'])>>> dfb a0 1 51 0 8

3 傳入嵌套字典（字典的值也是字典）創(chuàng)建DataFrame

其中我們可以知道，外層鍵是列索引，內(nèi)層子鍵是行索引

>>> nest_dict={'shanghai':{2015:100,2016:101},'beijing':{2015:102,2016:103}}>>> df = DataFrame(nest_dict)>>> dfshanghai beijing2015 100 1022016 101 103>>> nest_dict={'shanghai':{2015:100,2016:101},'beijing':{2015:102,2014:103}}>>> df = DataFrame(nest_dict)>>> dfshanghai beijing2014 NaN 103.02015 100.0 102.02016 101.0 NaN

增刪改查

(1 封私信 / 4 條消息) dataframe修改某列的值 - 搜索結(jié)果 - 知乎 (zhihu.com)

(20條消息) pandas：dataframe在指定位置插入一行數(shù)據(jù)_碧海藍(lán)天-CSDN博客_dataframe插入一行數(shù)據(jù)

(22條消息) Python中pandas dataframe刪除一行或一列：drop函數(shù)_海晨威-CSDN博客_dataframe drop

1.增加值

1.增加列。直接為不存在的列賦值就會(huì)創(chuàng)建新的列

df['Hefei'] = 1 df

shanghai beijing Hefei
2014 NaN 103.0 1
2015 100.0 102.0 1
2016 101.0 NaN 1

2.增加行

利用loc方法，當(dāng)然也可以使用append方法，不過傳入的需要是字典形式。

>>> df.loc[4]={'shanghai':5,'beijing':13, 'Hefei':50} >>> dfshanghai beijing Hefei 2014 NaN 103.0 1 2015 100.0 102.0 1 2016 101.0 NaN 1 4 5.0 13.0 50 >>> df.loc[2]={'shanghai':5,'beijing':13, 'Hefei':50} >>> dfshanghai beijing Hefei 2014 NaN 103.0 1 2015 100.0 102.0 1 2016 101.0 NaN 1 4 5.0 13.0 50 2 5.0 13.0 50

3.刪除行和列：axis代表選中的是行還是列，列是1，行是2.inplace代表有沒有真正刪除

>>> df.drop('Hefei',axis = 1,inplace = True)>>> dfshanghai beijing2014 NaN 1032015 100 1022016 101 NaN4 5 132 5 133 5 13df.drop(3,axis = 0,inplace = True)>>> dfshanghai beijing2014 NaN 1032015 100 1022016 101 NaN4 5 132 5 13

改

改操作主要記住就是從列開始

>>> dfshanghai beijing 2014 6 6 2015 100 102 2016 101 NaN 4 5 13 2 5 13 >>> df[:3]shanghai beijing 2014 6 6 2015 100 102 2016 101 NaN >>> df[1] = 3 >>> dfshanghai beijing 1 2014 6 6 3 2015 100 102 3 2016 101 NaN 3 4 5 13 3 2 5 13 3

DataFrame

df.info(): # 打印摘要 df.describe(): # 描述性統(tǒng)計(jì)信息 df.values: # 數(shù)據(jù) <ndarray> df.to_numpy() # 數(shù)據(jù) <ndarray> (推薦) df.shape: # 形狀 (行數(shù), 列數(shù)) df.columns: # 列標(biāo)簽 <Index> df.columns.values: # 列標(biāo)簽 <ndarray> df.index: # 行標(biāo)簽 <Index> df.index.values: # 行標(biāo)簽 <ndarray> df.head(n): # 前n行 df.tail(n): # 尾n行 pd.options.display.max_columns=n: # 最多顯示n列 pd.options.display.max_rows=n: # 最多顯示n行 df.memory_usage(): # 占用內(nèi)存(字節(jié)B)np.random.seed(1234) d1 = pd.Series(2*np.random.normal(size = 100)+3) d2 = np.random.f(2,4,size = 100) d3 = np.random.randint(1,100,size = 100) d1.count() #非空元素計(jì)算 d1.min() #最小值 d1.max() #最大值 d1.idxmin() #最小值的位置，類似于R中的which.min函數(shù) d1.idxmax() #最大值的位置，類似于R中的which.max函數(shù) d1.quantile(0.1) #10%分位數(shù) d1.sum() #求和 d1.mean() #均值 d1.median() #中位數(shù) d1.mode() #眾數(shù) d1.var() #方差 d1.std() #標(biāo)準(zhǔn)差 d1.mad() #平均絕對(duì)偏差 d1.skew() #偏度 d1.kurt() #峰度 d1.describe() #一次性輸出多個(gè)描述性統(tǒng)計(jì)指標(biāo)np.nan #賦空值

一、某列所有值

df['a']#取a列 df[['a','b']]#取a、b列

二、某行所有值

# 前n行，后n行 df.head(n) df.tail(n)#iloc只能用數(shù)字索引，不能用索引名------(左閉右開) df.iloc[0:2]#前2行 df.iloc[0]#第0行 df.iloc[0:2,0:2]#0、1行，0、1列 df.iloc[[0,2],[1,2,3]]#第0、2行，1、2、3列# 選取等于某些值的行記錄用 == df.loc[df['column_name'] == some_value]# 選取某列是否是某一類型的數(shù)值用 isin df.loc[df['column_name'].isin(some_values)]# 多種條件的選取用 & df.loc[(df['column'] == some_value) & df['other_column'].isin(some_values)]# 選取不等于某些值的行記錄用！= df.loc[df['column_name'] != some_value]# isin返回一系列的數(shù)值,如果要選擇不符合這個(gè)條件的數(shù)值使用~ df.loc[~df['column_name'].isin(some_values)] #提取出某行某列li=list(df.columns)df.iloc[[3,4,8],[li.index('animal'),li.index('age')]]

三、某行某列對(duì)應(yīng)值df_signal[‘a(chǎn)’].iloc[-1]

#iat取某個(gè)單值,只能數(shù)字索引df.iat[1,1]#第1行，1列#at取某個(gè)單值,只能index和columns索引df.at[‘one’,‘a(chǎn)’]#one行，a列

💡 index只能批量操作，不支持單個(gè)修改

四、刪除特定行

# 要?jiǎng)h除列“score”<50的所有行：df = df.drop(df[df.score < 50].index)df = df.drop(df[df['score'] < 50].index)df.drop(df[df.score < 50].index, inplace=True)df.drop(df[df['score'] < 50].index, inplace=True)# 多條件情況# 可以使用操作符： | 只需其中一個(gè)成立, & 同時(shí)成立, ~ 表示取反，它們要用括號(hào)括起來。# 例如刪除列“score<50 和>20的所有行df = df.drop(df[(df.score < 50) & (df.score > 20)].index

五、Python DataFrame 按條件篩選數(shù)據(jù)

點(diǎn)擊查看更多內(nèi)容

比如我想查看id等于11396的數(shù)據(jù)。 pdata1[pdata1['id']==11396] pdata1[pdata1.id==11396]查看時(shí)間time小于25320的數(shù)據(jù)。 pdata1[pdata1['time']<25320] pdata1[pdata1.time<25320]查看time小于25320且大于等于25270的數(shù)據(jù) pdata1[(pdata1['time'] < 25320)&(pdata1['time'] >= 25270)]可以根據(jù)篩選條件查看某幾列 pdata1[(pdata1['time'] < 25320)&(pdata1['time'] >= 25270)][['x','y']] 注意多個(gè)條件要加括號(hào)后在&或|。

六、排序

點(diǎn)擊查看更多內(nèi)容

#表示pd按照xxx這個(gè)字段排序，inplace默認(rèn)為False,如果該值為False，那么原來的pd順序沒變，只是返回的是排序的 pd.sort_values("xxx",inplace=True)

數(shù)據(jù)導(dǎo)入

從excel導(dǎo)入

from pandas import read_excel df = read_excel('e://rz2.xlsx') df

從csv導(dǎo)入

from pandas import read_csv path4 = 'C:\\Users\\admin\\Desktop\\大數(shù)據(jù)爬蟲2\\合并\\主鍵外連接.csv' df5 = read_csv(path4,engine='python')

3、數(shù)據(jù)處理

df3=pd.merge(df1,df2,left_on='學(xué)號(hào)',right_on='學(xué)號(hào)') #外鍵連接 df3=df3.drop(columns=['手機(jī)號(hào)碼']) df3 df3=df3.replace('缺考', 0) #特殊值代替

數(shù)據(jù)分析的第一步是提高數(shù)據(jù)質(zhì)量。數(shù)據(jù)清洗要做的就是處理缺失數(shù)據(jù)以及清除無意義的信息。這是數(shù)據(jù)價(jià)值鏈中最關(guān)鍵的步驟。垃圾數(shù)據(jù)，即使是通過最好的分析，也將產(chǎn)生錯(cuò)誤的結(jié)果，并誤導(dǎo)業(yè)務(wù)本身。

4.3.1 數(shù)據(jù)清洗

1、重復(fù)值的處理：drop_duplicates()

drop_duplicates() 把數(shù)據(jù)結(jié)構(gòu)中行相同的數(shù)據(jù)去除(保留其中的一行)

【例4-6】數(shù)據(jù)去重。
這里df是原始數(shù)據(jù)，其中7、9行、8、10行是重復(fù)行

from pandas import DataFramefrom pandas import read_excel df = read_excel('e://rz2.xlsx') df

Out[1]:
YHM TCSJ YWXT IP
0 S1402048 18922254812 1.225790e+17 221.205.98.55
1 S1411023 13522255003 1.225790e+17 183.184.226.205
2 S1402048 13422259938 NaN 221.205.98.55
3 20031509 18822256753 NaN 222.31.51.200
4 S1405010 18922253721 1.225790e+17 120.207.64.3
5 20140007 NaN 1.225790e+17 222.31.51.200
6 S1404095 13822254373 1.225790e+17 222.31.59.220
7 S1402048 13322252452 1.225790e+17 221.205.98.55
8 S1405011 18922257681 1.225790e+17 183.184.230.38
9 S1402048 13322252452 1.225790e+17 221.205.98.55
10 S1405011 18922257681 1.225790e+17 183.184.230.38

newDF=df.drop_duplicates() newDF

Out[2]:
YHM TCSJ YWXT IP
0 S1402048 18922254812 1.225790e+17 221.205.98.55
1 S1411023 13522255003 1.225790e+17 183.184.226.205
2 S1402048 13422259938 NaN 221.205.98.55
3 20031509 18822256753 NaN 222.31.51.200
4 S1405010 18922253721 1.225790e+17 120.207.64.3
5 20140007 NaN 1.225790e+17 222.31.51.200
6 S1404095 13822254373 1.225790e+17 222.31.59.220
7 S1402048 13322252452 1.225790e+17 221.205.98.55
8 S1405011 18922257681 1.225790e+17 183.184.230.38
上面的df中第7和第9行數(shù)據(jù)相同，第8和第10行數(shù)據(jù)相同，去重后第7、9和8、10各保留一行數(shù)據(jù)。

2、缺失值處理：

dropna()、df.fillna() 、df.fillna(method=‘pad’)、df.fillna(method=‘bfill’)、df.fillna(df.mean())、df.fillna(df.mean()[math: physical]) 、strip()

對(duì)于缺失數(shù)據(jù)的處理方式有數(shù)據(jù)補(bǔ)齊、刪除對(duì)應(yīng)行、不處理等方法。

【例4-6】缺失處理。
這里df是原始數(shù)據(jù)，其中2、3、5行有缺失值

from pandas import DataFrame from pandas import read_excel df = read_excel('e://rz2.xlsx') df

Out[1]:
YHM TCSJ YWXT IP
0 S1402048 18922254812 1.225790e+17 221.205.98.55
1 S1411023 13522255003 1.225790e+17 183.184.226.205
2 S1402048 13422259938 NaN 221.205.98.55
3 20031509 18822256753 NaN 222.31.51.200
4 S1405010 18922253721 1.225790e+17 120.207.64.3
5 20140007 NaN 1.225790e+17 222.31.51.200
6 S1404095 13822254373 1.225790e+17 222.31.59.220
7 S1402048 13322252452 1.225790e+17 221.205.98.55
8 S1405011 18922257681 1.225790e+17 183.184.230.38
9 S1402048 13322252452 1.225790e+17 221.205.98.55
10 S1405011 18922257681 1.225790e+17 183.184.230.38

1. dropna() 去除數(shù)據(jù)結(jié)構(gòu)中值為空的數(shù)據(jù)行

【例4-7】刪除數(shù)據(jù)為空所對(duì)應(yīng)的行

from pandas import DataFrame from pandas import read_excel df = read_excel('e://rz2.xlsx') newDF=df.dropna() newDF

Out[3]:
YHM TCSJ YWXT IP
0 S1402048 18922254812 1.225790e+17 221.205.98.55
1 S1411023 13522255003 1.225790e+17 183.184.226.205
4 S1405010 18922253721 1.225790e+17 120.207.64.3
6 S1404095 13822254373 1.225790e+17 222.31.59.220
7 S1402048 13322252452 1.225790e+17 221.205.98.55
8 S1405011 18922257681 1.225790e+17 183.184.230.38
9 S1402048 13322252452 1.225790e+17 221.205.98.55
10 S1405011 18922257681 1.225790e+17 183.184.230.38
例中的2、3、5行有空值NaN已經(jīng)被刪除。

2. df.fillna() 用其他數(shù)值替代NaN，有些時(shí)候空數(shù)據(jù)直接刪除會(huì)影響分析的結(jié)果，可以對(duì)數(shù)據(jù)進(jìn)行填補(bǔ)。【例4-8】使用數(shù)值或者任意字符替代缺失值

【例4-8】使用數(shù)值或者任意字符替代缺失值

from pandas import DataFrame from pandas import read_excel df = read_excel('e://rz2.xlsx') df.fillna('?')

Out[4]:
YHM TCSJ YWXT IP DLSJ
0 S1402048 1.89223e+10 1.22579e+17 221.205.98.55 2014-11-04 08:44:46
1 S1411023 1.35223e+10 1.22579e+17 183.184.226.205 2014-11-04 08:45:06
2 S1402048 1.34223e+10 ? 221.205.98.55 2014-11-04 08:46:39
3 20031509 1.88223e+10 ? 222.31.51.200 2014-11-04 08:47:41
4 S1405010 1.89223e+10 1.22579e+17 120.207.64.3 2014-11-04 08:49:03
5 20140007 ? 1.22579e+17 222.31.51.200 2014-11-04 08:50:06
6 S1404095 1.38223e+10 1.22579e+17 222.31.59.220 2014-11-04 08:50:02
7 S1402048 1.33223e+10 1.22579e+17 221.205.98.55 2014-11-04 08:49:18
8 S1405011 1.89223e+10 1.22579e+17 183.184.230.38 2014-11-04 08:14:55
9 S1402048 1.33223e+10 1.22579e+17 221.205.98.55 2014-11-04 08:49:18
10 S1405011 1.89223e+10 1.22579e+17 183.184.230.38 2014-11-04 08:14:55
如2、3、5行有空，用？替代了缺失值。

3. df.fillna(method=‘pad’) 用前一個(gè)數(shù)據(jù)值替代NaN

【例4-9】用前一個(gè)數(shù)據(jù)值替代缺失值
（2、3、5行是缺失值）

from pandas import DataFrame from pandas import read_excel df = read_excel('e://rz2.xlsx') df.fillna(method='pad')

Out[5]:
YHM TCSJ YWXT IP DLSJ
0 S1402048 18922254812 1.225790e+17 221.205.98.55 2014-11-04 08:44:46
1 S1411023 13522255003 1.225790e+17 183.184.226.205 2014-11-04 08:45:06
2 S1402048 13422259938 1.225790e+17 221.205.98.55 2014-11-04 08:46:39
3 20031509 18822256753 1.225790e+17 222.31.51.200 2014-11-04 08:47:41
4 S1405010 18922253721 1.225790e+17 120.207.64.3 2014-11-04 08:49:03
5 20140007 18922253721 1.225790e+17 222.31.51.200 2014-11-04 08:50:06
6 S1404095 13822254373 1.225790e+17 222.31.59.220 2014-11-04 08:50:02
7 S1402048 13322252452 1.225790e+17 221.205.98.55 2014-11-04 08:49:18
8 S1405011 18922257681 1.225790e+17 183.184.230.38 2014-11-04 08:14:55
9 S1402048 13322252452 1.225790e+17 221.205.98.55 2014-11-04 08:49:18
10 S1405011 18922257681 1.225790e+17 183.184.230.38 2014-11-04 08:14:55

4. df.fillna(method=‘bfill’) 用后一個(gè)數(shù)據(jù)值替代NaN

【例4-10】用后一個(gè)數(shù)據(jù)值替代NaN
（2、3、5行是缺失值）

from pandas import DataFrame from pandas import read_excel df = read_excel('e://rz2.xlsx') df.fillna(method='bfill')

Out[6]:
YHM TCSJ YWXT IP DLSJ
0 S1402048 18922254812 1.225790e+17 221.205.98.55 2014-11-04 08:44:46
1 S1411023 13522255003 1.225790e+17 183.184.226.205 2014-11-04 08:45:06
2 S1402048 13422259938 1.225790e+17 221.205.98.55 2014-11-04 08:46:39
3 20031509 18822256753 1.225790e+17 222.31.51.200 2014-11-04 08:47:41
4 S1405010 18922253721 1.225790e+17 120.207.64.3 2014-11-04 08:49:03
5 20140007 13822254373 1.225790e+17 222.31.51.200 2014-11-04 08:50:06
6 S1404095 13822254373 1.225790e+17 222.31.59.220 2014-11-04 08:50:02
7 S1402048 13322252452 1.225790e+17 221.205.98.55 2014-11-04 08:49:18
8 S1405011 18922257681 1.225790e+17 183.184.230.38 2014-11-04 08:14:55
9 S1402048 13322252452 1.225790e+17 221.205.98.55 2014-11-04 08:49:18
10 S1405011 18922257681 1.225790e+17 183.184.230.38 2014-11-04 08:14:55

5. df.fillna(df.mean()) 用平均數(shù)或者其他描述性統(tǒng)計(jì)量來代替NaN。

【例4-11】使用均值來填補(bǔ)數(shù)據(jù)。

from pandas import DataFrame from pandas import read_excel df = read_excel('e://rz2_0.xlsx') dfdf.fillna(df.mean())

Out[7]:
No math physical Chinese
0 1 76 85 78
1 2 85 56 NaN
2 3 76 95 85
3 4 NaN 75 58
4 5 87 52 68

Out[8]:
No math physical Chinese
0 1 76 85 78.00
1 2 85 56 72.25
2 3 76 95 85.00
3 4 81 75 58.00
4 5 87 52 68.00

6. df.fillna(df.mean()[math: physical]) 可以選擇列進(jìn)行缺失值的處理

【例4-12】為某列使用該列的均值來填補(bǔ)數(shù)據(jù)

from pandas import DataFrame from pandas import read_excel df = read_excel('e://rz2_0.xlsx') df.fillna(df.mean()['math':'physical'])

Out[26]:
No math physical Chinese
0 1 76.0 85 78.0
1 2 85.0 56 NaN
2 3 76.0 95 85.0
3 4 NaN 75 58.0
4 5 87.0 52 68.0

Out[9]:
No math physical Chinese
0 1 76 85 78
1 2 85 56 NaN
2 3 76 95 85
3 4 81 75 58
4 5 87 52 68

7. strip()：清除字符型數(shù)據(jù)左右(首尾)指定的字符，默認(rèn)為空格，中間的不清除。

【例4-13】刪除字符串左右或首位指定的字符。

from pandas import DataFrame from pandas import read_excel df = read_excel('e://rz2.xlsx') newDF=df['IP'].str.strip() #因?yàn)镮P是一個(gè)對(duì)象，所以先轉(zhuǎn)為str。 newDF

Out[27]:
YHM TCSJ YWXT IP DLSJ
0 S1402048 18922254812.0 1.2257903137349373e+17 221.205.98.55 2014-11-04 08:44:46
1 S1411023 13522255003.0 1.2257903137349373e+17 183.184.226.205 2014-11-04 08:45:06
2 S1402048 13422259938.0 221.205.98.55 2014-11-04 08:46:39
3 20031509 18822256753.0 222.31.51.200 2014-11-04 08:47:41
4 S1405010 18922253721.0 1.2257903137349373e+17 120.207.64.3 2014-11-04 08:49:03
5 20140007 1.2257903137349373e+17 222.31.51.200 2014-11-04 08:50:06
6 S1404095 13822254373.0 1.2257903137349373e+17 222.31.59.220 2014-11-04 08:50:02
7 S1402048 13322252452.0 1.2257903137349373e+17 221.205.98.55 2014-11-04 08:49:18
8 S1405011 18922257681.0 1.2257903137349373e+17 183.184.230.38 2014-11-04 08:14:55
9 S1402048 13322252452.0 1.2257903137349373e+17 221.205.98.55 2014-11-04 08:49:18
10 S1405011 18922257681.0 1.2257903137349373e+17 183.184.230.38 2014-11-04 08:14:55

Out[10]:
0 221.205.98.55
1 183.184.226.205
2 221.205.98.55
3 222.31.51.200
4 120.207.64.3
5 222.31.51.200
6 222.31.59.220
7 221.205.98.55
8 183.184.230.38
9 221.205.98.55
10 183.184.230.38
Name: IP, dtype: object

3、特定值替換：replace(‘缺考’, 0)

df11 = df11.replace(np.nan,'[正常]') df11 = df11.replace('none',np.nan) df11 = df11.replace(' ― ',np.nan)

4、刪除滿足條件元素所在的行：drop()

df = df.drop(df[].index)

#刪除價(jià)格大于1000的手機(jī)

df_s_acc = df_s.drop(df_s[df_s['價(jià)格']>=1000].index)

Out[68]:
Unnamed: 0 ID值價(jià)格 … 標(biāo)簽變更規(guī)范日期
5566 6626 1354676 699 … 安全手機(jī) 2020/11/1 2020-11-01
5565 6625 1354673 799 … 安全手機(jī) 2020/11/1 2020-11-01
101 102 1346463 2699 … 安全手機(jī) 2020/11/11 2020-11-11
64 64 1338710 3199 … 安全手機(jī) 2020/11/11 2020-11-11
2884 3382 1352445 1499 … 安全手機(jī) 2020/11/26 2020-11-26
2892 3391 1349515 1099 … 安全手機(jī) 2020/11/26 2020-11-26
2910 3411 1349516 1299 … 安全手機(jī) 2020/11/26 2020-11-26
2844 3340 1348871 999 … 安全手機(jī) 2020/11/26 2020-11-26
4046 4845 1350884 799 … 安全手機(jī) 2020/12/1 2020-12-01
4036 4834 1350882 699 … 安全手機(jī) 2020/12/1 2020-12-01
3394 4023 1349976 799 … 安全手機(jī) 2020/12/12 2020-12-12
4740 5656 1353088 2399 … 安全手機(jī) 2020/12/22 2020-12-22
4737 5653 1353068 1999 … 安全手機(jī) 2020/12/22 2020-12-22
4048 4847 1357947 1099 … 安全手機(jī) 2021/1/1 2021-01-01
4038 4836 1357933 999 … 安全手機(jī) 2021/1/1 2021-01-01
4043 4842 1357949 1199 … 安全手機(jī) 2021/1/1 2021-01-01

Out[72]:
Unnamed: 0 ID值價(jià)格 … 標(biāo)簽變更規(guī)范日期
5566 6626 1354676 699 … 安全手機(jī) 2020/11/1 2020-11-01
5565 6625 1354673 799 … 安全手機(jī) 2020/11/1 2020-11-01
2844 3340 1348871 999 … 安全手機(jī) 2020/11/26 2020-11-26
4046 4845 1350884 799 … 安全手機(jī) 2020/12/1 2020-12-01
4036 4834 1350882 699 … 安全手機(jī) 2020/12/1 2020-12-01
3394 4023 1349976 799 … 安全手機(jī) 2020/12/12 2020-12-12
4038 4836 1357933 999 … 安全手機(jī) 2021/1/1 2021-01-01

也可以使用多個(gè)條件

df_clear = df.drop(df[df['x']<0.01].index) # 也可以使用多個(gè)條件 df_clear = df.drop(df[(df['x']<0.01) | (df['x']>10)].index) #刪除x小于0.01或大于10的行

4.3.2 數(shù)據(jù)抽取

1. 字段抽取：抽出某列上指定位置的數(shù)據(jù)做成新的列。

slice(start,stop)
start 開始位置； stop 結(jié)束位置

【例4-14】從數(shù)據(jù)中抽出某列。

from pandas import DataFrame from pandas import read_excel df = read_excel('e://rz2.xlsx') df['TCSJ']=df['TCSJ'].astype(str) #astype()轉(zhuǎn)化類型 df['TCSJ']bands = df['TCSJ'].str.slice(0,3) bands

Out[1]:
0 18922254812
1 13522255003
2 13422259938
3 18822256753
4 18922253721
5 nan
6 13822254373
7 13322252452
8 18922257681
9 13322252452
10 18922257681
Name: TCSJ, dtype: object

Out[2]:
0 189
1 135
2 134
3 188
4 189
5 nan
6 138
7 133
8 189
9 133
10 189
Name: TCSJ, dtype: object

2. 字段拆分：按指定的字符sep，拆分已有的字符串。

split(sep,n,expand=False) sep 用于分隔字符串的分隔符n 分割后新增的列數(shù)expand 是否展開為數(shù)據(jù)框，默認(rèn)為False 返回值：expand為True，返回DaraFrame；False返回Series。

【原始數(shù)據(jù)】

YHM TCSJ YWXT IP DLSJ
0 S1402048 18922254812.0 1.2257903137349373e+17 221.205.98.55 2014-11-04 08:44:46
1 S1411023 13522255003.0 1.2257903137349373e+17 183.184.226.205 2014-11-04 08:45:06
2 S1402048 13422259938.0 221.205.98.55 2014-11-04 08:46:39
3 20031509 18822256753.0 222.31.51.200 2014-11-04 08:47:41
4 S1405010 18922253721.0 1.2257903137349373e+17 120.207.64.3 2014-11-04 08:49:03
5 20140007 1.2257903137349373e+17 222.31.51.200 2014-11-04 08:50:06
6 S1404095 13822254373.0 1.2257903137349373e+17 222.31.59.220 2014-11-04 08:50:02
7 S1402048 13322252452.0 1.2257903137349373e+17 221.205.98.55 2014-11-04 08:49:18
8 S1405011 18922257681.0 1.2257903137349373e+17 183.184.230.38 2014-11-04 08:14:55
9 S1402048 13322252452.0 1.2257903137349373e+17 221.205.98.55 2014-11-04 08:49:18
10 S1405011 18922257681.0 1.2257903137349373e+17 183.184.230.38 2014-11-04 08:14:55

【例4-15】拆分字符串為指定的列數(shù)

from pandas import DataFrame from pandas import read_excel df = read_excel('e://rz2.xlsx') newDF=df['IP'].str.strip() #IP先轉(zhuǎn)為str，再刪除首位空格 newDF= df['IP'].str.split('.',1,True)#按第一個(gè)"."分成兩列，1表示新增的列數(shù) newDF

Out[1]:
0 1
0 221 205.98.55
1 183 184.226.205
2 221 205.98.55
3 222 31.51.200
4 120 207.64.3
5 222 31.51.200
6 222 31.59.220
7 221 205.98.55
8 183 184.230.38
9 221 205.98.55
10 183 184.230.38

newDF.columns = ['IP1','IP2-4'] #給第一第二列增加列名稱 newDF

Out[2]:
IP1 IP2-4
0 221 205.98.55
1 183 184.226.205
2 221 205.98.55
3 222 31.51.200
4 120 207.64.3
5 222 31.51.200
6 222 31.59.220
7 221 205.98.55
8 183 184.230.38
9 221 205.98.55
10 183 184.230.38

3. 記錄抽取：是指根據(jù)一定的條件，對(duì)數(shù)據(jù)進(jìn)行抽取。

dataframe[condition]condition：過濾條件返回值：DataFrame 常用的condition類型：比較運(yùn)算：<、>、>=、<=、!=，如：df[df.comments>10000)]；范圍運(yùn)算：between(left,right)，如：df[df.comments.between(1000,10000)]；空置運(yùn)算：pandas.isnull(column) ，如：df[df.title.isnull()];字符匹配：str.contains(patten,na = False) ，如：df[df.title.str.contains(‘電臺(tái)’,na=False)]邏輯運(yùn)算：&(與)，|(或)，not(取反)；如：df[(df.comments>=1000)&(df.comments<=10000)] 與 df[df.comments.between(1000,10000)]等價(jià)。

【原始數(shù)據(jù)】同上

【例4-16】按條件抽取數(shù)據(jù)。

import pandas from pandas import read_excel df = read_excel('e://rz2.xlsx') df[df.TCSJ==13322252452]

Out[2]:
YHM TCSJ YWXT IP
7 S1402048 13322252452 1.225790e+17 221.205.98.55
9 S1402048 13322252452 1.225790e+17 221.205.98.55

df[df.TCSJ>13500000000]

Out[3]:
YHM TCSJ YWXT IP DLSJ
0 S1402048 18922254812 1.225790e+17 221.205.98.55 2014-11-04 08:44:46
1 S1411023 13522255003 1.225790e+17 183.184.226.205 2014-11-04 08:45:06
3 20031509 18822256753 NaN 222.31.51.200 2014-11-04 08:47:41
4 S1405010 18922253721 1.225790e+17 120.207.64.3 2014-11-04 08:49:03
6 S1404095 13822254373 1.225790e+17 222.31.59.220 2014-11-04 08:50:02
8 S1405011 18922257681 1.225790e+17 183.184.230.38 2014-11-04 08:14:55
10 S1405011 18922257681 1.225790e+17 183.184.230.38 2014-11-04 08:14:55

df[df.TCSJ.between(13400000000,13999999999)]

Out[4]:
YHM TCSJ YWXT IP DLSJ
1 S1411023 13522255003 1.225790e+17 183.184.226.205 2014-11-04 08:45:06
2 S1402048 13422259938 NaN 221.205.98.55 2014-11-04 08:46:39
6 S1404095 13822254373 1.225790e+17 222.31.59.220 2014-11-04 08:50:02

df[df.YWXT.isnull()]

Out[5]:
YHM TCSJ YWXT IP DLSJ
2 S1402048 13422259938 NaN 221.205.98.55 2014-11-04 08:46:39
3 20031509 18822256753 NaN 222.31.51.200 2014-11-04 08:47:41

df[df.IP.str.contains('222.',na=False)]

Out[6]:
YHM TCSJ YWXT IP DLSJ
3 20031509 18822256753 NaN 222.31.51.200 2014-11-04 08:47:41
5 20140007 NaN 1.225790e+17 222.31.51.200 2014-11-04 08:50:06
6 S1404095 13822254373 1.225790e+17 222.31.59.220 2014-11-04 08:50:02

4. 隨機(jī)抽樣：是指隨機(jī)從數(shù)據(jù)中按照一定的行數(shù)或者比例抽取數(shù)據(jù)。

隨機(jī)抽樣函數(shù)：numpy.random.randint(start,end,num)start：范圍的開始值；end：范圍的結(jié)束值；num：抽樣個(gè)數(shù)返回值：行的索引值序列

【原始數(shù)據(jù)】同上

【例4-17】隨機(jī)抽取數(shù)據(jù)。

import numpy import pandas from pandas import read_excel df = read_excel('e://rz2.xlsx’) df

Out[1]:
YHM TCSJ YWXT IP DLSJ
0 S1402048 18922254812 1.225790e+17 221.205.98.55 2014-11-4 8:44
1 S1411023 13522255003 1.225790e+17 183.184.226.205 2014-11-4 8:45
2 S1402048 13422259938 NaN 221.205.98.55 2014-11-4 8:46
3 20031509 18822256753 NaN 222.31.51.200 2014-11-4 8:47
4 S1405010 18922253721 1.225790e+17 120.207.64.3 2014-11-4 8:49
5 20140007 NaN 1.225790e+17 222.31.51.200 2014-11-4 8:50
6 S1404095 13822254373 1.225790e+17 222.31.59.220 2014-11-4 8:50
7 S1402048 13322252452 1.225790e+17 221.205.98.55 2014-11-4 8:49
8 S1405011 18922257681 1.225790e+17 183.184.230.38 2014-11-4 8:14
9 S1402048 13322252452 1.225790e+17 221.205.98.55 2014-11-4 8:49
10S1405011 18922257681 1.225790e+17 183.184.230.38 2014-11-4 8:14

r = numpy.random.randint(0,10,3) r

Out[2]: array([8, 2, 9])

df.loc[r,:] #抽取r行數(shù)據(jù)

Out[3]:
YHM TCSJ YWXT IP DLSJ
8 S1405011 18922257681 1.225790e+17 183.184.230.38 2014-11-04 08:14:55
2 S1402048 13422259938 NaN 221.205.98.55 2014-11-04 08:46:39
9 S1402048 13322252452 1.225790e+17 221.205.98.55 2014-11-04 08:49:18

PS:按照指定條件抽取數(shù)據(jù)：

1）使用index標(biāo)簽選取數(shù)據(jù)：df.loc[行標(biāo)簽,列標(biāo)簽]

df.loc[‘a(chǎn)’:‘b’] #選取ab兩行之間的數(shù)據(jù)，假設(shè)a，b為行索引 df.loc[:,'TCSJ'] #選取TCSJ列的數(shù)據(jù)df.loc的第一個(gè)參數(shù)是行標(biāo)簽，第二個(gè)參數(shù)為列標(biāo)簽(可選參數(shù)，默認(rèn)為所有列標(biāo)簽)，兩個(gè)參數(shù)既可以是列表也可以是單個(gè)字符，如果兩個(gè)參數(shù)都為列表則返回的是DataFrame，否則為Series。按照指定條件抽取數(shù)據(jù)：

2）使用切片位置選取數(shù)據(jù)：df.iloc[行位置,列位置] #iloc只能用數(shù)字索引，不能用索引名------(左閉右開)

df.iloc[1,1] #選取第二行，第二列的值，返回的為單個(gè)值 df.iloc[[0,2],:] #選取第一行和第三行的數(shù)據(jù) df.iloc[0:2,:] #選取第一行到第三行(不包含)的數(shù)據(jù) df.iloc[:,1] #選取所有記錄的第一列的值，返回的為一個(gè)Series df.iloc[1,:] #選取第一行數(shù)據(jù)，返回的為一個(gè)Series

說明：loc為location的縮寫，iloc則為integer & location的縮寫。更廣義的切片方式是使用 .ix，它自動(dòng)根據(jù)給到的索引類型判斷是使用位置還是標(biāo)簽進(jìn)行切片。即：iloc為整型索引；loc為字符串索引； ix是 iloc和 loc的合體。

Python默認(rèn)的行序號(hào)是從0開始，我們稱為行位置；但實(shí)際上0開始的行我們?cè)谟?jì)數(shù)時(shí)為第1行，也稱為行號(hào)，是從1開始；有時(shí)index是被命名的，如’one’,‘two’,‘three’,‘four’或’a’,‘b’,‘c’,'d’等字符串，我們稱之為標(biāo)簽。loc索引的是行號(hào)、標(biāo)簽，不是行位置，如下例中df2.loc[1]索引的是第一行（行號(hào)為1），其實(shí)位置為0行；iloc索引的是位置，不能是標(biāo)簽或行號(hào)；ix則三者皆可。

import pandas as pd index_loc = ['a','b'] index_iloc = [1,2] data = [[1,2,3,4],[5,6,7,8]] columns = ['one','two','three','four'] df1 = pd.DataFrame(data=data,index=index_loc,columns=columns) df2 = pd.DataFrame(data=data,index=index_iloc,columns=columns) print(df1.loc['a'])

one 1
two 2
three 3
four 4
Name: a, dtype: int64

print(df1.iloc['a']) #iloc不能索引字符串

Traceback (most recent call last):
TypeError: cannot do label indexing on <class ‘pandas.core.index.Index’> with these indexers [a] of <class 'str’>

print(df2.iloc[1]) #iloc索引的是行位置print(df2.loc[1]) #loc[1]索引的是行號(hào)，的對(duì)應(yīng)的行位置為0行print(df1.ix[0])print(df1.ix['a'])

Out[0]:
one 5
two 6
three 7
four 8
Name: 2, dtype: int64

Out[1]:
one 1
two 2
three 3
four 4
Name: 1, dtype: int64

Out[2]:
one 1
two 2
three 3
four 4
Name: a, dtype: int64

Out[3]:
one 1
two 2
three 3
four 4
Name: a, dtype: int64

3）通過邏輯指針進(jìn)行數(shù)據(jù)切片：df[邏輯條件]

df[df. TCSJ >= 18822256753] #單個(gè)邏輯條件 df[(df. TCSJ >=13422259938 )&(df. TCSJ < 13822254373)] #多個(gè)邏輯條件組合這種方式獲取的數(shù)據(jù)切片都是DataFrame。 df[df.TCSJ >= 18822256753]

Out[14]:
YHM TCSJ YWXT IP DLSJ
0 S1402048 18922254812 1.225790e+17 221.205.98.55 2014-11-04 08:44:46
3 20031509 18822256753 NaN 222.31.51.200 2014-11-04 08:47:41
4 S1405010 18922253721 1.225790e+17 120.207.64.3 2014-11-04 08:49:03
8 S1405011 18922257681 1.225790e+17 183.184.230.38 2014-11-04 08:14:55
10 S1405011 18922257681 1.225790e+17 183.184.230.38 2014-11-04 08:14:55

5. 字典數(shù)據(jù)：將字典數(shù)據(jù)抽取為dataframe，有三種方法。

import pandas from pandas import DataFrame #1.字典的key和value各作為一列 d1={‘a(chǎn)':'[1,2,3]','b':'[0,1,2]'} a1=pandas.DataFrame.from_dict(d1, orient='index’) #將字典轉(zhuǎn)化為dataframe，且key列做成了index a1.index.name = 'key' #將index的列名改成‘key’ b1=a1.reset_index() #重新增加index，并將原index做成了‘key’列 b1.columns=['key','value'] #對(duì)列重新命名為'key'和'value' b1

Out[1]:
key value
0 b [0,1,2]
1 a [1,2,3]

#2.字典里的每一個(gè)元素作為一列(同長) d2={'a':[1,2,3],'b':[4,5,6]} #字典的value必須長度相等 a2= DataFrame(d2) a2

Out[2]:
a b
0 1 4
1 2 5
2 3 6

#3.字典里的每一個(gè)元素作為一列(不同長) d = {'one' : pandas.Series([1, 2, 3]),'two' : pandas.Series([1, 2, 3, 4])} #字典的value長度可以不相等 df = pandas.DataFrame(d) df

Out[3]:
one two
0 1.0 1
1 2.0 2
2 3.0 3
3 NaN 4

也可以如下處理：

import pandas from pandas import Series import numpy as np from pandas import DataFrame d = dict( A = np.array([1,2]), B = np.array([1,2,3,4])) DataFrame(dict([(k,Series(v)) for k,v in d.items()]))

Out[4]:
A B
0 1.0 1
1 2.0 2
2 NaN 3
3 NaN 4

還可以處理如下：

import numpy as np import pandas as pd my_dict = dict( A = np.array([1,2]), B = np.array([1,2,3,4]) ) df = pd.DataFrame.from_dict(my_dict,orient='index').T df

Out[5]:
A B
0 1.0 1.0
1 2.0 2.0
2 NaN 3.0
3 NaN 4.0

4.3.3 排名索引

說明：axis、ascending、inplace、by

DataFrame中的排序分為兩種，一種是對(duì)索引排序，一種是對(duì)值進(jìn)行排序。
??索引排序： sort_index()；值排序：sort_values()；值排名：rank()
??對(duì)于索引排序，涉及到對(duì)行索引、列索引的排序，并且還涉及到是升序還是降序。函數(shù)df.sort_index(axis= , ascending= , inplace=)，需要特別注意這三個(gè)參數(shù)。axis表示對(duì)行操作，還是對(duì)列操作；ascending表示升序，還是降序操作。
??對(duì)于值排序，同樣也是涉及到行、列排序問題，升序、降序排列問題。函數(shù)df.sort_values(by= , axis= , ascending= , inplace=),也需要特別注意這幾個(gè)參數(shù)，只是多了一個(gè)by操作，需要我們指明是按照哪一行或哪一列，進(jìn)行排序的。

PS：(True\False)大寫

axis=0表示對(duì)行操作，axis=1表示對(duì)列進(jìn)行操作；
ascending=True表示升序，ascending=False表示降序；
inplace=True表示對(duì)原始DataFrame本身操作，因此不需要賦值操作，inplace=False相當(dāng)于是對(duì)原始DataFrame的拷貝，之后的一些操作都是針對(duì)這個(gè)拷貝文件進(jìn)行操作的，因此需要我們賦值給一個(gè)變量，保存操作后的結(jié)果。

1. 排名排序（索引排序）：df.sort_index()

Series的sort_index(ascending=True)方法可以對(duì) index 進(jìn)行排序操作，ascending 參數(shù)用于控制升序或降序，默認(rèn)為升序。

在 DataFrame 上，.sort_index(axis=0, by=None, ascending=True) 方法多了一個(gè)軸向的選擇參數(shù)與一個(gè) by 參數(shù)，by 參數(shù)的作用是針對(duì)某一(些)列進(jìn)行排序(不能對(duì)行使用 by 參數(shù))。

axis：0按照行名排序；1按照列名排序

from pandas import DataFrame df0={'Ohio':[0,6,3],'Texas':[7,4,1],'California':[2,8,5]} df=DataFrame(df0,index=['a','c','d']) df

Out[1]:
Ohio Texas California
a 0 7 2
c 6 4 8
d 3 1 5

df.sort_index(by='Ohio')

Out[2]:
Ohio Texas California
a 0 7 2
d 3 1 5
c 6 4 8

df.sort_index(by=['California','Texas'])

Out[3]:
Ohio Texas California
a 0 7 2
d 3 1 5
c 6 4 8

df.sort_index(axis=1) #axis：0按照行名排序；1按照列名排序

California Ohio Texas
a 2 0 7
c 8 6 4
d 5 3 1

排名(Series.rank(method=‘a(chǎn)verage’, ascending=True))的作用與排序的不同之處在于，它會(huì)把對(duì)象的 values 替換成名次(從 1 到 n)，對(duì)于平級(jí)項(xiàng)可以通過方法里的 method 參數(shù)來處理，method 參數(shù)有四個(gè)可選項(xiàng)：average, min, max, first。舉例如下：

from pandas import Series ser=Series([3,2,0,3],index=list('abcd')) ser

Out[8]:
a 3
b 2
c 0
d 3

ser.rank() ser.rank(method='min') ser.rank(method='max') ser.rank(method='first')

ser.rank()
Out[9]:
a 3.5
b 2.0
c 1.0
d 3.5
dtype: float64

ser.rank(method=‘min’)
Out[10]:
a 3.0
b 2.0
c 1.0
d 3.0
dtype: float64

ser.rank(method=‘max’)
Out[11]:
a 4.0
b 2.0
c 1.0
d 4.0
dtype: float64

ser.rank(method=‘first’)
Out[12]:
a 3.0
b 2.0
c 1.0
d 4.0
dtype: float64

💡 注意：在 ser[0]和ser[3] 這對(duì)平級(jí)項(xiàng)上，不同 method 參數(shù)表現(xiàn)出的不同名次。DataFrame 的 .rank(axis=0, method='average', ascending=True) 方法多了axis 參數(shù)，可選擇按行或列分別進(jìn)行排名，暫時(shí)好像沒有針對(duì)全部元素的排名方法。

2．重新索引：.reindex(index=None,**kwargs)

Series 對(duì)象的重新索引通過其 .reindex(index=None,**kwargs) 方法實(shí)現(xiàn)。**kwargs 中常用的參數(shù)有兩個(gè)：method=None和fill_value=np.NaN。

ser = Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c']) A = ['a','b','c','d','e'] ser.reindex(A)

Out[13]:
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64

ser = ser.reindex(A,fill_value=0) ser.reindex(A,method='ffill') ser.reindex(A,fill_value=0,method='ffill')

Out[15]:
a -5.3
b 7.2
c 3.6
d 4.5
e 0.0
dtype: float64

a -5.3
b 7.2
c 3.6
d 4.5
e 4.5
dtype: float64

.reindex() 方法會(huì)返回一個(gè)新對(duì)象，其 index 嚴(yán)格遵循給出的參數(shù)，
method:{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None} 參數(shù)用于指定插值(填充)方式，當(dāng)沒有給出時(shí)，默認(rèn)用 fill_value 填充，值為 NaN(ffill = pad，bfill = back fill，分別指插值時(shí)向前還是向后取值)。
DataFrame 對(duì)象的重新索引方法：.reindex(index=None,columns=None,**kwargs)僅比 Series 多了一個(gè)可選的 columns 參數(shù)，用于給列索引。用法與上例Series類似，只不過插值方法 method 參數(shù)只能應(yīng)用于行，即軸axis = 0。

>>> state = ['Texas','Utha','California'] >>> df.reindex(columns=state,method='ffill')Texas Utha California a 1 NaN 2 c 4 NaN 5 d 7 NaN 8 [3 rows x 3 columns] >>> df.reindex(index=['a','b','c','d'],columns=state,method='ffill')Texas Utha California a 1 NaN 2 b 1 NaN 2 c 4 NaN 5 d 7 NaN 8 [4 rows x 3 columns]

可不可以通過 df.T.reindex(index,method=‘**’).T 這樣的方式來實(shí)現(xiàn)在列上的插值呢？

答案是肯定的。

另外要注意，使用 reindex(index,method=‘**’) 的時(shí)候，index 必須是單調(diào)的，否則就會(huì)引發(fā)一個(gè) ValueError: Must be monotonic for forward fill，比如上例中的最后一次調(diào)用，如果使用 index=[‘a(chǎn)’,‘b’,‘d’,‘c’] 就會(huì)報(bào)錯(cuò)。

3. 值排序：df.sort_values()

1、對(duì)某一列進(jìn)行升序排列(有實(shí)際意義)

df = pd.DataFrame({"A":[3,1,5,9,7],"B":[4,1,2,5,3],"C":[3,15,9,6,12],"D":[2,4,6,10,8]},index=list("acbed")) display(df)

A B C D
a 3 4 3 2
c 1 1 15 4
b 5 2 9 6
e 9 5 6 10
d 7 3 12 8

df.sort_values(by="A",axis=0,ascending=True,inplace=True) df

Out[133]:
A B C D
c 1 1 15 4
a 3 4 3 2
b 5 2 9 6
d 7 3 12 8
e 9 5 6 10

2、 對(duì)某一行進(jìn)行降序排列(實(shí)際意義不大)

df = pd.DataFrame({"A":[3,1,5,9,7],"B":[4,1,2,5,3],"C":[3,15,9,6,12],"D":[2,4,6,10,8]},index=list("acbed")) display(df)

A B C D
a 3 4 3 2
c 1 1 15 4
b 5 2 9 6
e 9 5 6 10
d 7 3 12 8

df.sort_values(by="a",axis=1,ascending=False,inplace=True) df

Out[140]:
B A C D
a 4 3 3 2
c 1 1 15 4
b 2 5 9 6
e 5 9 6 10
d 3 7 12 8

4、sort_values()中的na_position參數(shù)

na_position參數(shù)用于設(shè)定缺失值的顯示位置，first表示缺失值顯示在最前面；last表示缺失值顯示在最后面。

df = pd.DataFrame({"A":[10,8,np.nan,2,4],"B":[1,7,5,3,8],"C":[5,2,8,4,1]},index=list("abcde")) df

Out[141]:
A B C
a 10.0 1 5
b 8.0 7 2
c NaN 5 8
d 2.0 3 4
e 4.0 8 1

df.sort_values(by="A",axis=0,inplace=True,na_position="first") df

Out[142]:
A B C
c NaN 5 8
d 2.0 3 4
e 4.0 8 1
b 8.0 7 2
a 10.0 1 5

df.sort_values(by="A",axis=0,inplace=True,na_position="last") df

Out[143]:
A B C
d 2.0 3 4
e 4.0 8 1
b 8.0 7 2
a 10.0 1 5
c NaN 5 8

5、“值排名”：rank()函數(shù)

1、rank()函數(shù)的常用參數(shù)說明

(31條消息) DataFrame(13)：DataFrame的排序與排名問題_數(shù)據(jù)分析與統(tǒng)計(jì)學(xué)之美-CSDN博客_dataframe排序

4.3.4 數(shù)據(jù)合并

1. 記錄合并：是指兩個(gè)結(jié)構(gòu)相同的數(shù)據(jù)框合并成一個(gè)數(shù)據(jù)框。也就是在一個(gè)數(shù)據(jù)框中追加另一個(gè)數(shù)據(jù)框的數(shù)據(jù)記錄。pd.concat([df1,df2])

concat([dataFrame1, dataFrame2,…])DataFrame1：數(shù)據(jù)框返回值：DataFrame import pandas from pandas import DataFrame from pandas import read_exceldf1 = read_excel('E:\\Python\\第4章數(shù)據(jù)\\rz2.xlsx') df1df2 = read_excel('E:\\Python\\第4章數(shù)據(jù)\\rz3.xlsx') df2

Out[1]:
YHM TCSJ YWXT IP
0 S1402048 18922254812 1.225790e+17 221.205.98.55
1 S1411023 13522255003 1.225790e+17 183.184.226.205
2 S1402048 13422259938 NaN 221.205.98.55
3 20031509 18822256753 NaN 222.31.51.200
4 S1405010 18922253721 1.225790e+17 120.207.64.3
5 20140007 13422259313 1.225790e+17 222.31.51.200
6 S1404095 13822254373 1.225790e+17 222.31.59.220
7 S1402048 13322252452 1.225790e+17 221.205.98.55
8 S1405011 18922257681 1.225790e+17 183.184.230.38
9 S1402048 13322252452 1.225790e+17 221.205.98.55
10 S1405011 18922257681 1.225790e+17 183.184.230.38

[11 rows x 5 columns]

Out[2]:
YHM TCSJ YWXT IP
0 S1402011 18603514812 1.225790e+17 221.205.98.55
1 S1411022 13103515003 1.225790e+17 183.184.226.205
2 S1402033 13203559930 NaN 221.205.98.55

[3 rows x 5 columns]

合并：兩個(gè)文件的數(shù)據(jù)記錄都合并到一起了，實(shí)現(xiàn)了數(shù)據(jù)記錄的“疊加”或者記錄順延。

df=pandas.concat([df1,df2]) df

Out[3]:
YHM TCSJ YWXT IP
0 S1402048 18922254812 1.225790e+17 221.205.98.55
1 S1411023 13522255003 1.225790e+17 183.184.226.205
2 S1402048 13422259938 NaN 221.205.98.55
3 20031509 18822256753 NaN 222.31.51.200
4 S1405010 18922253721 1.225790e+17 120.207.64.3
5 20140007 13422259313 1.225790e+17 222.31.51.200
6 S1404095 13822254373 1.225790e+17 222.31.59.220
7 S1402048 13322252452 1.225790e+17 221.205.98.55
8 S1405011 18922257681 1.225790e+17 183.184.230.38
9 S1402048 13322252452 1.225790e+17 221.205.98.55
10 S1405011 18922257681 1.225790e+17 183.184.230.38
0 S1402011 18603514812 1.225790e+17 221.205.98.55
1 S1411022 13103515003 1.225790e+17 183.184.226.205
2 S1402033 13203559930 NaN 221.205.98.55

[14 rows x 5 columns]

2. 字段合并：是指將同一個(gè)數(shù)據(jù)框中的不同的列進(jìn)行合并，形成新的列。X = x1+x2+…

X = x1+x2+…x1：數(shù)據(jù)列1x2：數(shù)據(jù)列2 返回值：Series，合并后的系列，要求合并的系列長度一致。 import pandas from pandas import DataFrame from pandas import read_csvdf = read_csv('e://rz4.csv',sep=" ",names=['band','area','num']) df df = df.astype(str) tel=df['band']+df['area']+df['num'] tel

Out[1]:
band area num
0 189 2225 4812
1 135 2225 5003
2 134 2225 9938
3 188 2225 6753
4 189 2225 3721
5 134 2225 9313
6 138 2225 4373
7 133 2225 2452
8 189 2225 7681

Out[2]:
0 18922254812
1 13522255003
2 13422259938
3 18822256753
4 18922253721
5 13422259313
6 13822254373
7 13322252452
8 18922257681
dtype: object

3. 字段匹配：是指不同結(jié)構(gòu)的數(shù)據(jù)框(兩個(gè)或以上的數(shù)據(jù)框)，按照一定的條件進(jìn)行合并，即追加列。merge(x,y,left_on,right_on) 外鍵連接

merge(x,y,left_on,right_on) x：第一個(gè)數(shù)據(jù)框y：第二個(gè)數(shù)據(jù)框left_on：第一個(gè)數(shù)據(jù)框的用于匹配的列right_on：第二個(gè)數(shù)據(jù)框的用于匹配的列返回值：DataFrame import pandas from pandas import DataFrame from pandas import read_excel df1 = read_excel('e://rz2.xlsx',sheetname='Sheet3') df1

Out[1]:
id band num
0 1 130 123
1 2 131 124
2 4 133 125
3 5 134 126

df2 = read_excel('e://rz2.xlsx',sheetname='Sheet4') df2

Out[2]:
id band area
0 1 130 351
1 2 131 352
2 3 132 353
3 4 133 354
4 5 134 355
5 5 135 356

pandas.merge(df1,df2,left_on='id',right_on='id')

Out[3]:
id band_x num band_y area
0 1 130 123 130 351
1 2 131 124 131 352
2 4 133 125 133 354
3 5 134 126 134 355
4 5 134 126 135 356

4.3.5 數(shù)據(jù)計(jì)算

1. 簡單計(jì)算：通過對(duì)各字段進(jìn)行加、減、乘、除等四則算術(shù)運(yùn)算，計(jì)算出的結(jié)果作為新的字段。

idnumprice

1	123	159
2	124	753
3	125	456
4	126	852

idnumpriceresult

1	123	159	19557
2	124	753	93372
3	125	456	57000
4	126	852	107352

from pandas import read_csv df = read_csv('e://rz2.csv',sep=',') df

Out[1]:
id band num price
0 1 130 123 159
1 2 131 124 753
2 3 132 125 456
3 4 133 126 852

result=df.price*df.num result

Out[2]:
0 19557
1 93372
2 57000
3 107352
dtype: int64

df['result']=result df

Out[3]:
id band num price result
0 1 130 123 159 19557
1 2 131 124 753 93372
2 3 132 125 456 57000
3 4 133 126 852 107352

2. 數(shù)據(jù)標(biāo)準(zhǔn)化：是指將數(shù)據(jù)按照比例縮放，使之落入特定的區(qū)間，一般使用0-1標(biāo)準(zhǔn)化。X*=(x-min)/(max-min)

from pandas import read_csv df = read_csv('e://rz2.csv',sep=',') df

Out[1]:
id band num price
0 1 130 123 159
1 2 131 124 753
2 3 132 125 456
3 4 133 126 852

scale=(df.price-df.price.min())/(df.price.max()-df.price.min()) scale

Out[2]:
0 0.000000
1 0.857143
2 0.428571
3 1.000000
Name: price, dtype: float64

4.3.6 數(shù)據(jù)分組

bins = [0,180,210,240,270,np.inf] #np.inf 是無窮大 labels=["差","及格", "中","良","優(yōu)"] df3["等級(jí)"]=pd.cut(df3.總分,bins,right=False,labels=labels)

說明：pd.cut(series,bins,right=True,labels=NULL)

數(shù)據(jù)分組：根據(jù)數(shù)據(jù)分析對(duì)象的特征，按照一定的數(shù)據(jù)指標(biāo)，把數(shù)據(jù)劃分為不同的區(qū)間來進(jìn)行研究，以揭示其內(nèi)在的聯(lián)系和規(guī)律性。簡單來說：就是新增一列，將原來的數(shù)據(jù)按照其性質(zhì)歸入新的類別中。

cut(series,bins,right=True,labels=NULL) series 需要分組的數(shù)據(jù) bins 分組的依據(jù)數(shù)據(jù) right 分組的時(shí)候右邊是否閉合 labels 分組的自定義標(biāo)簽，可以不自定義

bins

import pandas #from pandas import DataFrame from pandas import read_csvdf = read_csv('e://rz2.csv',sep=',') df

Out[1]:
id band num price
0 1 130 123 159
1 2 131 124 753
2 3 132 125 456
3 4 133 126 852

bins=[min(df.price)-1,500,max(df.price)+1] labels=["500以下","500以上"]pandas.cut(df.price,bins)

Out[2]:
0 (158, 500]
1 (500, 853]
2 (158, 500]
3 (500, 853]
Name: price, dtype: category
Categories (2, object): [(158, 500] < (500, 853]]

pandas.cut(df.price,bins,right=False)

Out[3]:
0 [158, 500)
1 [500, 853)
2 [158, 500)
3 [500, 853)
Name: price, dtype: category
Categories (2, object): [[158, 500) < [500, 853)]
right 值：分組的時(shí)候右邊是否閉合

labels

pa=pandas.cut(df.price,bins,right=False,labels=labels) paOut[5]: 0 500以下1 500以上2 500以下3 500以上Name: price, dtype: categoryCategories (2, object): [500以下 < 500以上]df['label']=pandas.cut(df.price,bins,right=False,labels=labels) dfOut[6]: id band num price label0 1 130 123 159 500以下1 2 131 124 753 500以上2 3 132 125 456 500以下3 4 133 126 852 500以上

4.3.7 日期處理

1. 日期轉(zhuǎn)換：是指將字符型的日期格式轉(zhuǎn)換為日期格式數(shù)據(jù)的過程。

to_datetime(dateString,format)format格式： %Y：年份 %m：月份 %d：日期 %H：小時(shí) %M：分鐘 %S：秒

【例4-21】to_datetime(df.注冊(cè)時(shí)間,format=’%Y/%m/%d’)。

from pandas import read_csv from pandas import to_datetime df = read_csv('e://rz3.csv',sep=',',encoding='utf8') df

Out[1]:
num price year month date
0 123 159 2016 1 2016/6/1
1 124 753 2016 2 2016/6/2
2 125 456 2016 3 2016/6/3
3 126 852 2016 4 2016/6/4
4 127 210 2016 5 2016/6/5
5 115 299 2016 6 2016/6/6
6 102 699 2016 7 2016/6/7
7 201 599 2016 8 2016/6/8
8 154 199 2016 9 2016/6/9
9 142 899 2016 10 2016/6/10

df_dt = to_datetime(df.date,format="%Y/%m/%d") df_dt

Out[2]:
0 2016-06-01
1 2016-06-02
2 2016-06-03
3 2016-06-04
4 2016-06-05
5 2016-06-06
6 2016-06-07
7 2016-06-08
8 2016-06-09
9 2016-06-10
Name: date, dtype: datetime64[ns]

注意csv的格式是否是utf8格式，否則會(huì)報(bào)錯(cuò)。另外，csv里date的格式是文本(字符串)格式。

2. 日期格式化：是指將日期型的數(shù)據(jù)按照給定的格式轉(zhuǎn)化為字符型的數(shù)據(jù)。

apply(lambda x:處理邏輯) datetime.strftime(x,format)

【例4-22】日期型數(shù)據(jù)轉(zhuǎn)化為字符型數(shù)據(jù)。

df_dt = to_datetime(df.date,format="%Y/%m/%d") df_dt_str=df_dt.apply(lambda x: datetime.strftime(x,"%Y/%m/%d"))from pandas import read_csv from pandas import to_datetime from datetime import datetime df = read_csv('e://rz3.csv',sep=',',encoding='utf8') df_dt = to_datetime(df.date,format="%Y/%m/%d") df_dt_str=df_dt.apply(lambda x: datetime.strftime(x,"%Y/%m/%d")) #apply見后注 df_dt_str

Out[1]:
0 2016/06/01
1 2016/06/02
2 2016/06/03
3 2016/06/04
4 2016/06/05
5 2016/06/06
6 2016/06/07
7 2016/06/08
8 2016/06/09
9 2016/06/10
Name: date, dtype: object

注意：當(dāng)希望將函數(shù)f應(yīng)用到DataFrame 對(duì)象的行或列時(shí)，可以使用.apply(f, axis=0, args=(), **kwds) 方法，axis=0表示按列運(yùn)算，axis=1時(shí)表示按行運(yùn)算。如：

from pandas import DataFrame df=DataFrame({'ohio':[1,3,6],'texas':[1,4,5],'california':[2,5,8]},index=['a','c','d']) df

Out[1]:
california ohio texas
a 2 1 1
c 5 3 4
d 8 6 5

f = lambda x:x.max()-x.min() df.apply(f) #默認(rèn)按列運(yùn)算，同df.apply(f,axis=0)

Out[2]:
california 6
ohio 5
texas 4
dtype: int64

df.apply(f,axis=1) #按行運(yùn)算

Out[3]:
a 1
c 2
d 3
dtype: int64

3. 日期抽取：是指從日期格式里面抽取出需要的部分屬性

Data_dt.dt.property

second1-60秒，從1開始到60

minute	1-60分，從1開始到60
hour	1-24小時(shí)，從1開始到24
day	1-31日，一個(gè)月中第幾天，從1開始到31
month	1-12月，從1開始到12
year	年份
weekday	1-7，一周中的第幾天，從1開始，最大為7（已改為0-6）

【例4-23】對(duì)日期進(jìn)行抽取。

from pandas import read_csv; from pandas import to_datetime; df = read_csv('e://rz3.csv', sep=',', encoding='utf8') df

df_dt =to_datetime(df.date,format='%Y/%m/%d') df_dt

Out[2]:
0 2016-06-01
1 2016-06-02
2 2016-06-03
3 2016-06-04
4 2016-06-05
5 2016-06-06
6 2016-06-07
7 2016-06-08
8 2016-06-09
9 2016-06-10
Name: date, dtype: datetime64[ns]

df_dt.dt.year

Out[3]:
0 2016
1 2016
2 2016
3 2016
4 2016
5 2016
6 2016
7 2016
8 2016
9 2016
Name: date, dtype: int64

df_dt.dt.day

Out[4]:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
Name: date, dtype: int64

4. 日期判斷：

Python 時(shí)間比較大小并從dataframe中提取滿足時(shí)間條件的量_siml142857的博客-CSDN博客_dataframe 比較大小

5.日期增長：

python 當(dāng)前時(shí)間減一天-菜鳥筆記

4、數(shù)據(jù)分析

4.4.1 基本統(tǒng)計(jì)：describe

如果分組字段是離散值，則直接使用groupby分組統(tǒng)計(jì)

基本統(tǒng)計(jì)分析：又叫描述性統(tǒng)計(jì)分析，一般統(tǒng)計(jì)某個(gè)變量的最小值、第一個(gè)四分位值、中值、第三個(gè)四分位值、以及最大值。
describe() 描述性統(tǒng)計(jì)分析函數(shù)

常用的統(tǒng)計(jì)函數(shù)：size 計(jì)數(shù)(此函數(shù)不需要括號(hào))sum() 求和mean() 平均值var() 方差std() 標(biāo)準(zhǔn)差

(22條消息) Python對(duì)表的行列求和_OlivierJ的博客-CSDN博客_python 列求和

4.4.2 分組分析：groupby（離散值分組）

如果分組字段是連續(xù)值，則先進(jìn)行離散化（cut），然后再使用groupby分組

分組分析：是指根據(jù)分組字段將分析對(duì)象劃分成不同的部分，以進(jìn)行對(duì)比分析各組之間的差異性的一種分析方法。

常用的統(tǒng)計(jì)指標(biāo)：計(jì)數(shù)、求和、平均值常用形式： df.groupby(by=['分類1','分類2',...])['被統(tǒng)計(jì)的列'].agg({列別名1：統(tǒng)計(jì)函數(shù)1，列別名2：統(tǒng)計(jì)函數(shù)2，…})by 用于分組的列[ ] 用于統(tǒng)計(jì)的列.agg 統(tǒng)計(jì)別名顯示統(tǒng)計(jì)值的名稱，統(tǒng)計(jì)函數(shù)用于統(tǒng)計(jì)數(shù)據(jù)size 計(jì)數(shù)sum 求和mean 均值 from pandas import read_excel df = read_excel('e:\\rz4.xlsx') df

Out[1]:
學(xué)號(hào) 班級(jí) 姓名性別英語體育軍訓(xùn) 數(shù)分高代解幾計(jì)算機(jī)
0 2308024241 23080242 成龍男 76 78 77 40 23 60 89
1 2308024244 23080242 周怡女 66 91 75 47 47 44 82
2 2308024251 23080242 張波男 85 81 75 45 45 60 80
3 2308024249 23080242 朱浩男 65 50 80 72 62 71 82
4 2308024219 23080242 封印女 73 88 92 61 47 46 83
5 2308024201 23080242 遲培男 60 50 89 71 76 71 82
6 2308024347 23080243 李華女 67 61 84 61 65 78 83
7 2308024307 23080243 陳田男 76 79 86 69 40 69 82
8 2308024326 23080243 余皓男 66 67 85 65 61 71 95
9 2308024320 23080243 李嘉女 62 60 90 60 67 77 95
10 2308024342 23080243 李上初男 76 90 84 60 66 60 82
11 2308024310 23080243 郭竇女 79 67 84 64 64 79 85
12 2308024435 23080244 姜毅濤男 77 71 87 61 73 76 82
13 2308024432 23080244 趙宇男 74 74 88 68 70 71 85
14 2308024446 23080244 周路女 76 80 77 61 74 80 85
15 2308024421 23080244 林建祥男 72 72 81 63 90 75 85
16 2308024433 23080244 李大強(qiáng) 男 79 76 77 78 70 70 89
17 2308024428 23080244 李側(cè)通男 64 96 91 69 60 77 83
18 2308024402 23080244 王慧女 73 74 93 70 71 75 88
19 2308024422 23080244 李曉亮男 85 60 85 72 72 83 89

df3.groupby(by=['班級(jí)','性別'])['軍訓(xùn)'].agg({'總分':numpy.sum,'人數(shù)': numpy.size,'平均>值':numpy.mean,'方差':numpy.var,'標(biāo)準(zhǔn)差':numpy.std,'最高分':numpy.max,'最低分':numpy.min})

總分人數(shù) 平均值方差標(biāo)準(zhǔn)差最高分最低分
班級(jí) 性別
23080242 女 167 2 83.500000 144.500000 12.020815 92 75
男 321 4 80.250000 38.250000 6.184658 89 75
23080243 女 258 3 86.000000 12.000000 3.464102 90 84
男 255 3 85.000000 1.000000 1.000000 86 84
23080244 女 170 2 85.000000 128.000000 11.313708 93 77
男 509 6 84.833333 25.766667 5.076088 91 77

4.4.3 分布分析：cut+groupby（連續(xù)值分組）

分組分析：是指根據(jù)分組字段將分析對(duì)象劃分成不同的部分，以進(jìn)行對(duì)比分析各組之間的差異性的一種分析方法。

常用的統(tǒng)計(jì)指標(biāo)：計(jì)數(shù)、求和、平均值常用形式：df.groupby(by=['分類1','分類2',...])['被統(tǒng)計(jì)的列'].agg({列別名1：統(tǒng)計(jì)函數(shù)1，列別名2：統(tǒng)計(jì)函數(shù)2，…}) by 用于分組的列 [ ] 用于統(tǒng)計(jì)的列 .agg 統(tǒng)計(jì)別名顯示統(tǒng)計(jì)值的名稱，統(tǒng)計(jì)函數(shù)用于統(tǒng)計(jì)數(shù)據(jù) size 計(jì)數(shù) sum 求和 mean 均值 import numpy import pandas from pandas import read_excel df = read_excel('e:\\rz4.xlsx') df

Out[1]:
學(xué)號(hào) 班級(jí) 姓名性別英語體育軍訓(xùn) 數(shù)分高代解幾計(jì)算機(jī) 總分
0 2308024241 23080242 成龍男 76 78 77 40 23 60 89 443
1 2308024244 23080242 周怡女 66 91 75 47 47 44 82 452
2 2308024251 23080242 張波男 85 81 75 45 45 60 80 471
3 2308024249 23080242 朱浩男 65 50 80 72 62 71 82 482
4 2308024219 23080242 封印女 73 88 92 61 47 46 83 490
5 2308024201 23080242 遲培男 60 50 89 71 76 71 82 499
6 2308024347 23080243 李華女 67 61 84 61 65 78 83 499
7 2308024307 23080243 陳田男 76 79 86 69 40 69 82 501
8 2308024326 23080243 余皓男 66 67 85 65 61 71 95 510
9 2308024320 23080243 李嘉女 62 60 90 60 67 77 95 511
10 2308024342 23080243 李上初男 76 90 84 60 66 60 82 518
11 2308024310 23080243 郭竇女 79 67 84 64 64 79 85 522
12 2308024435 23080244 姜毅濤男 77 71 87 61 73 76 82 527
13 2308024432 23080244 趙宇男 74 74 88 68 70 71 85 530
14 2308024446 23080244 周路女 76 80 77 61 74 80 85 533
15 2308024421 23080244 林建祥男 72 72 81 63 90 75 85 538
16 2308024433 23080244 李大強(qiáng) 男 79 76 77 78 70 70 89 539
17 2308024428 23080244 李側(cè)通男 64 96 91 69 60 77 83 540
18 2308024402 23080244 王慧女 73 74 93 70 71 75 88 544
19 2308024422 23080244 李曉亮男 85 60 85 72 72 83 89 546

labels=['450及其以下','450到500','500及其以上'] #給三段數(shù)據(jù)貼標(biāo)簽 labels

Out[5]: [‘450及其以下’, ‘450到500’, ‘500及其以上’]

bins = [min(df.總分)-1,450,500,max(df.總分)+1] #將數(shù)據(jù)分成三段 bins

Out[3]: [442, 450, 500, 547]

總分分層 = pandas.cut(df.總分,bins,labels=labels) 總分分層

Out[7]:
0 450及其以下
1 450到500
2 450到500
3 450到500
4 450到500
5 450到500
6 450到500
7 500及其以上
8 500及其以上
9 500及其以上
10 500及其以上
11 500及其以上
12 500及其以上
13 500及其以上
14 500及其以上
15 500及其以上
16 500及其以上
17 500及其以上
18 500及其以上
19 500及其以上
Name: 總分, dtype: category
Categories (3, object): [450及其以下 < 450到500 < 500及其以上]

df['總分分層']= 總分分層 df

Out8]:
學(xué)號(hào) 班級(jí) 姓名性別英語體育軍訓(xùn) 數(shù)分高代解幾計(jì)算機(jī)基礎(chǔ) 總分總分分層
0 2308024241 23080242 成龍男 76 78 77 40 23 60 89 443 450及其以下
1 2308024244 23080242 周怡女 66 91 75 47 47 44 82 452 450到500
2 2308024251 23080242 張波男 85 81 75 45 45 60 80 471 450到500
3 2308024249 23080242 朱浩男 65 50 80 72 62 71 82 482 450到500
4 2308024219 23080242 封印女 73 88 92 61 47 46 83 490 450到500
5 2308024201 23080242 遲培男 60 50 89 71 76 71 82 499 450到500
6 2308024347 23080243 李華女 67 61 84 61 65 78 83 499 450到500
7 2308024307 23080243 陳田男 76 79 86 69 40 69 82 501 500及其以上
8 2308024326 23080243 余皓男 66 67 85 65 61 71 95 510 500及其以上
9 2308024320 23080243 李嘉女 62 60 90 60 67 77 95 511 500及其以上
10 2308024342 23080243 李上初男 76 90 84 60 66 60 82 518 500及其以上
11 2308024310 23080243 郭竇女 79 67 84 64 64 79 85 522 500及其以上
12 2308024435 23080244 姜毅濤男 77 71 87 61 73 76 82 527 500及其以上
13 2308024432 23080244 趙宇男 74 74 88 68 70 71 85 530 500及其以上
14 2308024446 23080244 周路女 76 80 77 61 74 80 85 533 500及其以上
15 2308024421 23080244 林建祥男 72 72 81 63 90 75 85 538 500及其以上
16 2308024433 23080244 李大強(qiáng) 男 79 76 77 78 70 70 89 539 500及其以上
17 2308024428 23080244 李側(cè)通男 64 96 91 69 60 77 83 540 500及其以上
18 2308024402 23080244 王慧女 73 74 93 70 71 75 88 544 500及其以上
19 2308024422 23080244 李曉亮男 85 60 85 72 72 83 89 546 500及其以上

df.groupby(by=['總分分層'])['總分'].agg({'人數(shù)':numpy.size})

Out[9]:
人數(shù)
總分分層
450及其以下 1
450到500 6
500及其以上 13

4.4.4 交叉分析：pivot_table（數(shù)據(jù)透視表）

交叉分析：通常用于分析兩個(gè)或兩個(gè)以上分組變量之間的關(guān)系，以交叉表形式進(jìn)行變量間關(guān)系的對(duì)比分析。一般分為：定量、定量分組交叉；定量、定性分組交叉；定性、定性分組交叉。

pivot_table(values,index,columns,aggfunc,fill_value)values 數(shù)據(jù)透視表中的值 index 數(shù)據(jù)透視表中的行 columns 數(shù)據(jù)透視表中的列 aggfunc 統(tǒng)計(jì)函數(shù) fill_value NA值的統(tǒng)一替換 import numpy import pandas from pandas import read_excel from pandas import pivot_table #在spyder下也可以不導(dǎo)入df = read_excel('e:\\rz4.xlsx') bins = [min(df.總分)-1,450,500,max(df.總分)+1] labels=['450及其以下','450到500','500及其以上'] 總分分層 = pandas.cut(df.總分,bins,labels=labels) df['總分分層']= 總分分層 df.pivot_table(values=['總分'],index=['總分分層’],columns=['性別'],aggfunc=[numpy.size,numpy.mean])

Out[1]:
size mean
總分總分
性別女男女男
總分分層
450及其以下 NaN 1 NaN 443.000000
450到500 3 3 480.333333 484.000000
500及其以上 4 9 527.500000 527.666667

df.pivot_table(values=['總分'],index=['總分分層’],columns=['性別'],aggfunc=[numpy.size,numpy.mean],fill_value=0) #也可以將統(tǒng)計(jì)為0的賦值為零，默認(rèn)為nan。

Out[2]:
size mean
總分總分
性別女男女男
總分分層
450及其以下 0 1 0.000000 443.000000
450到500 3 3 480.333333 484.000000
500及其以上 4 9 527.500000 527.666667s

4.4.5 結(jié)構(gòu)分析：pivot_table+sum+div（查比重）

結(jié)構(gòu)分析：是在分組的基礎(chǔ)上，計(jì)算各組成部分所占的比重，進(jìn)而分析總體的內(nèi)部特征的一種分析方法。

axis參數(shù)說明：0表示列；1表示行。

#假設(shè)要計(jì)算班級(jí)團(tuán)體總分情況 import numpy import pandas from pandas import read_excel from pandas import pivot_table #在spyder下也可以不導(dǎo)入 df = read_excel('e:\\rz4.xlsx') df_pt = df.pivot_table(values=['總分’],index=['班級(jí)'],columns=['性別’],aggfunc=[numpy.sum]) df_pt

Out[1]:
sum
總分
性別女男
班級(jí)
23080242 942 1895
23080243 1532 1529
23080244 1077 3220

df_pt.sum()

Out[3]:
性別
sum 總分女 3551
男 6644
dtype: int64

df_pt.div(df_pt.sum(axis=1),axis=0)#按列占比

Out[5]:
sum
總分
性別女男
班級(jí)
23080242 0.332041 0.667959
23080243 0.500490 0.499510
23080244 0.250640 0.749360

df_pt.sum(axis=1)

Out[2]:
性別
sum 總分女 3551
男 6644
dtype: int64

df_pt.div(df_pt.sum(axis=0),axis=1)#按行占比

Out[6]:
sum
總分
性別女男
班級(jí)
23080242 0.265277 0.285220
23080243 0.431428 0.230132
23080244 0.303295 0.484648

df_pt.sum(axis=0)#效果同省略

Out[4]:
班級(jí)
23080242 2837
23080243 3061
23080244 4297
dtype: int64

4.4.6 相關(guān)分析：corr（一維、二維）

相關(guān)分析： 是研究現(xiàn)象之間是否存在某種依存關(guān)系，并對(duì)具體有依存關(guān)系的現(xiàn)象探討其相關(guān)方向以及相關(guān)程度，是研究隨機(jī)變量之間的相關(guān)關(guān)系的一種統(tǒng)計(jì)方法。
相關(guān)系數(shù)： 可以用來描述定量變量之間的關(guān)系

相關(guān)系數(shù)|r|取值范圍相關(guān)程度

0<=\|r\|<0.3	低度相關(guān)
0.3<=\|r\|<0.8	中度相關(guān)
0.8<=\|r\|<=1	高度相關(guān)

相關(guān)分析函數(shù)：

DataFrame.corr()Series.corr(other)

如果由數(shù)據(jù)框調(diào)用corr方法，那么將會(huì)計(jì)算每列兩兩之間的相似度。如果由序列調(diào)用corr方法，那么只是計(jì)算該序列與傳入的序列之間的相關(guān)度。

返回值：

DataFrame調(diào)用返回DataFrameSeries調(diào)用返回一個(gè)數(shù)值型，大小為相關(guān)度

舉例

#一維 df3.英語.corr(df3.高代) Out[9]: -0.12524513810989527 df3.解幾.corr(df3.高代) Out[11]: 0.6132805268443008#二維 df3.corr() Out[13]: 學(xué)號(hào) 班級(jí) 英語 ... 解幾計(jì)算機(jī)基礎(chǔ) 總分學(xué)號(hào) 1.000000 0.982617 0.287492 ... 0.636150 0.211420 0.843040 班級(jí) 0.982617 1.000000 0.257248 ... 0.671301 0.251736 0.901960 英語 0.287492 0.257248 1.000000 ... 0.027452 -0.119039 0.167927 體育 0.130255 0.088482 0.244323 ... -0.526276 -0.266896 -0.067810 軍訓(xùn) 0.124176 0.248652 -0.335015 ... 0.249299 0.148933 0.446614 數(shù)分 0.435493 0.517529 -0.129588 ... 0.544394 0.123399 0.732137 高代 0.602636 0.635006 -0.125245 ... 0.613281 0.096979 0.779466 解幾 0.636150 0.671301 0.027452 ... 1.000000 0.305934 0.705506 計(jì)算機(jī)基礎(chǔ) 0.211420 0.251736 -0.119039 ... 0.305934 1.000000 0.223004 總分 0.843040 0.901960 0.167927 ... 0.705506 0.223004 1.000000

5、數(shù)據(jù)可視化

4.5.1 餅圖：plt.pie(gb2.人數(shù)，labels=gb2.index，autopct=‘%.2f%%’，colors=[‘b’，‘pink’，(0.5，0.8，0.3)]，explode=[0，0，0，0，0.1])

餅圖(Pie Graph)：又稱圓形圖，是一個(gè)劃分為幾個(gè)扇形的圓形統(tǒng)計(jì)圖，它能夠直觀的反映個(gè)體與總體的比例關(guān)系

pie(x,labels,colors,explode,autopct)x 進(jìn)行繪圖的序列 labels 餅圖的各部分標(biāo)簽 colors 餅圖的各部分顏色，使用GRB標(biāo)顏色 explode 需要突出的塊狀序列 autopct 餅圖占比的顯示格式，%.2f：保留兩位小數(shù) import numpy as np import pandas as pd import matplotlib.pyplot as pltdf = pd.read_excel(r'E:\Python\第4章數(shù)據(jù)\rz4.xlsx') dfgb=df.groupby(by=['班級(jí)'])['學(xué)號(hào)'].agg([('人數(shù)',np.size)])

plt.pie(gb.人數(shù),labels=gb.index,autopct='%.2f%%',colors=['b','pink',(0.5,0.8,0.3)],explode=[0,0.2,0])

練習(xí)

df2= pd.read_excel(r'E:\Python\第4章數(shù)據(jù)\09電動(dòng)1.xls') df2['C語言程序設(shè)計(jì)']=pd.cut(df2.C語言程序設(shè)計(jì),bins=[0,60,70,80,90,101],right=False,labels=['不及格','及格','中等','良好','優(yōu)秀']) #right=False 控制左閉右開 gb2=df2.groupby(by=['C語言程序設(shè)計(jì)'])['學(xué)號(hào)'].agg([('人數(shù)',np.size)]).fillna(0) #fillna(0)填充空值 plt.pie(gb2.人數(shù),labels=gb2.index,autopct='%.2f%%',colors=['b','pink',(0.5,0.8,0.3)],explode=[0,0,0,0,0.1]) plt.rcParams['font.sans-serif']=['SimHei'] #字體 plt.rcParams['font.size']=30 #字體大小 plt.rcParams['figure.figsize']=[6,6] #正圓

plt.rcParams，顯示設(shè)值字體的東西
plt.rcParamsplt.rcParams['font.sans-serif'] Out[43]: ['DejaVu Sans','Bitstream Vera Sans','Computer Modern Sans Serif','Lucida Grande','Verdana','Geneva','Lucid','Arial','Helvetica','Avant Garde','sans-serif']plt.rcParams['font.sans-serif']=['SimHei'] plt.rcParams['font.sans-serif']=['SimHei','...','...'] #沒有的往后找

4.5.2 散點(diǎn)圖：plt.plot(df.高代,df.數(shù)分,‘o’,color=‘pink’)

散點(diǎn)圖(scatter diagram)：是以一個(gè)變量為橫坐標(biāo)，另一個(gè)變量為縱坐標(biāo)，利用散點(diǎn)(坐標(biāo)點(diǎn))的分布形態(tài)反映變量關(guān)系的一種圖形。

plot(x,y, '. ',color=(r,g,b)) plt.xlabel('x軸坐標(biāo)') plt.ylabel('y軸坐標(biāo)') plt.grid(Ture) x、y X軸和Y軸的序列 '. '、'o' 小點(diǎn)還是大點(diǎn) Color 散點(diǎn)圖的顏色，可以用rgb定義，也可以用英文字母定義 RGB顏色的設(shè)置：(red,green,blue) 紅綠藍(lán)顏色組成 df = pd.read_excel(r'E:\Python\第4章數(shù)據(jù)\rz4.xlsx') dfgb=df.groupby(by=['班級(jí)'])['學(xué)號(hào)'].agg([('人數(shù)',np.size)]) plt.plot(df.英語,df.數(shù)分,'.',color='g') plt.xlabel('英語') plt.xlabel('數(shù)分') plt.plot(df.高代,df.數(shù)分,'o',color='pink') plt.plot(df.高代,df.數(shù)分,'o',color='pink')

plt.plot(df.高代,df.數(shù)分,'-',color='pink') #連線

4.5.3 折線圖：plt.plot(df.學(xué)號(hào),df.總分,‘-’,color=‘r’)

參數(shù)值注釋

-	連續(xù)的曲線
—	連續(xù)的虛線
-.	連續(xù)的用帶點(diǎn)的曲線
:	由點(diǎn)連成的曲線
.	小點(diǎn),散點(diǎn)圖
o	大點(diǎn),散點(diǎn)圖
,	像素點(diǎn)(更小的點(diǎn))的散點(diǎn)圖
*	五角星的點(diǎn)散點(diǎn)圖
>	右角標(biāo)記散點(diǎn)圖
<	左角標(biāo)記散點(diǎn)圖
1(2,3,4)	傘形上(下左右)標(biāo)記散點(diǎn)圖
s	正方形標(biāo)記散點(diǎn)圖
p	五角星標(biāo)記散點(diǎn)圖
v	下三角標(biāo)記散點(diǎn)圖
^	上三角標(biāo)記散點(diǎn)圖
h	多邊形標(biāo)記散點(diǎn)圖
d	鉆石標(biāo)記散點(diǎn)圖

df = pd.read_excel(r'E:\Python\第4章數(shù)據(jù)\rz4.xlsx')dfplt.plot(df.學(xué)號(hào),df.總分,'-',color='r')

學(xué)號(hào) 班級(jí) 姓名性別英語體育軍訓(xùn) 數(shù)分高代解幾計(jì)算機(jī)基礎(chǔ) 總分
0 2308024241 23080242 成龍男 76 78 77 40 23 60 89 443
1 2308024244 23080242 周怡女 66 91 75 47 47 44 82 452
2 2308024251 23080242 張波男 85 81 75 45 45 60 80 471
3 2308024249 23080242 朱浩男 65 50 80 72 62 71 82 482
4 2308024219 23080242 封印女 73 88 92 61 47 46 83 490
5 2308024201 23080242 遲培男 60 50 89 71 76 71 82 499
6 2308024347 23080243 李華女 67 61 84 61 65 78 83 499
7 2308024307 23080243 陳田男 76 79 86 69 40 69 82 501
8 2308024326 23080243 余皓男 66 67 85 65 61 71 95 510
9 2308024320 23080243 李嘉女 62 60 90 60 67 77 95 511
10 2308024342 23080243 李上初男 76 90 84 60 66 60 82 518
11 2308024310 23080243 郭竇女 79 67 84 64 64 79 85 522
12 2308024435 23080244 姜毅濤男 77 71 87 61 73 76 82 527
13 2308024432 23080244 趙宇男 74 74 88 68 70 71 85 530
14 2308024446 23080244 周路女 76 80 77 61 74 80 85 533
15 2308024421 23080244 林建祥男 72 72 81 63 90 75 85 538
16 2308024433 23080244 李大強(qiáng) 男 79 76 77 78 70 70 89 539
17 2308024428 23080244 李側(cè)通男 64 96 91 69 60 77 83 540
18 2308024402 23080244 王慧女 73 74 93 70 71 75 88 544
19 2308024422 23080244 李曉亮男 85 60 85 72 72 83 89 546

df=df.sort_values('學(xué)號(hào)')

df['學(xué)號(hào)后三位']=df.學(xué)號(hào).astype(str).str.slice(-3,) #分離學(xué)號(hào)后三位，并加入新一列（不會(huì)影響df）plt.plot(df.學(xué)號(hào)后三位,df.總分,'-',color='r') #畫圖plt.xticks(rotation=60) #標(biāo)簽旋轉(zhuǎn)度數(shù)

df2= pd.read_excel(r'E:\Python\第4章數(shù)據(jù)\09電動(dòng)1.xls')plt.plot(df2.姓名,df2.C語言程序設(shè)計(jì),'--',color='g')plt.xlabel('姓名')plt.ylabel('C語言程序設(shè)計(jì)')plt.xticks(rotation=90)plt.rcParams['font.sans-serif']=['SimHei']plt.rcParams['figure.figsize']=[10,6]

4.5.4 柱形圖：plt.bar(df.學(xué)號(hào)后三位,df.總分,width=1,color=[‘r’,‘b’])

柱形圖用于顯示一段時(shí)間內(nèi)的數(shù)據(jù)變化或顯示各項(xiàng)之間的比較情況，是一種單位長度的長方形，根據(jù)數(shù)據(jù)大小繪制的統(tǒng)計(jì)圖，用來比較兩個(gè)或以上的數(shù)據(jù)(時(shí)間或類別)。

bar(left,height,width,color)barh(bottom,width,height,color)left x軸的位置序列，一般采用arange函數(shù)產(chǎn)生一個(gè)序列height y軸的數(shù)值序列，也就是柱形圖高度，一般就是我們需要展示的數(shù)據(jù)width 柱形圖的寬度，一般設(shè)置為1即可color 柱形圖填充顏色 df['學(xué)號(hào)后三位']=df.學(xué)號(hào).astype(str).str.slice(-3,) plt.bar(df.學(xué)號(hào)后三位,df.總分,width=1,color=['r','b']) #柱形圖 plt.xticks(rotation=60)plt.barh(df.學(xué)號(hào)后三位,df.總分,0.6,color=['r','b']) #條形圖

bar

barh

4.5.5 直方圖：plt.hist(df2.C語言程序設(shè)計(jì),bins=10,color=‘g’,cumulative=True)

直方圖(Histogram)：是用一系列等寬不等高的長方形來繪制，寬度表示數(shù)據(jù)范圍的間隔，高度表示在給定間隔內(nèi)數(shù)據(jù)出現(xiàn)的頻數(shù)，變化的高度形態(tài)表示數(shù)據(jù)的分布情況。

用來查看數(shù)據(jù)的頻率

hist(x,color,bins,cumulative=False)x 需要進(jìn)行繪制的向量color 直方圖填充的顏色bins 設(shè)置直方圖的分組個(gè)數(shù)cumulative 設(shè)置是否累積計(jì)數(shù)，默認(rèn)是False df2= pd.read_excel(r'E:\Python\第4章數(shù)據(jù)\09電動(dòng)1.xls') plt.hist(df2.C語言程序設(shè)計(jì),bins=10,color='g',cumulative=False) plt.hist(df2.C語言程序設(shè)計(jì),bins=10,color='g',cumulative=True)

cumulative=False

cumulative=True

bins=20

這是我的線上筆記，希望對(duì)你有所幫助；你的點(diǎn)贊收藏，是我堅(jiān)持的最大動(dòng)力

【Python學(xué)習(xí)筆記—保姆版】Notion筆記

總結(jié)

以上是生活随笔為你收集整理的【Python学习笔记—保姆版】第四章—关于Pandas、数据准备、数据处理、数据分析、数据可视化的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： android初学者_初学者：如何在An
下一篇：利用css做三线表格,如何快速制作三线表

python

【Python学习笔记—保姆版】第四章—关于Pandas、数据准备、数据处理、数据分析、数据可视化

第四章

【Python學(xué)習(xí)筆記—保姆版】第四章

1、關(guān)于Pandas

2、數(shù)據(jù)準(zhǔn)備

pandas和numpy

創(chuàng)建DataFrame

1.標(biāo)準(zhǔn)格式創(chuàng)建：

2 .傳入等長的列表組成的字典來創(chuàng)建：

3 傳入嵌套字典（字典的值也是字典）創(chuàng)建DataFrame

增刪改查

1.增加值

1.增加列。直接為不存在的列賦值就會(huì)創(chuàng)建新的列

2.增加行

3.刪除行和列：axis代表選中的是行還是列，列是1，行是2.inplace代表有沒有真正刪除

改

DataFrame

一、某列所有值

二、某行所有值

三、某行某列對(duì)應(yīng)值df_signal[‘a(chǎn)’].iloc[-1]

四、刪除特定行

五、Python DataFrame 按條件篩選數(shù)據(jù)

六、排序

數(shù)據(jù)導(dǎo)入

3、數(shù)據(jù)處理

4.3.1 數(shù)據(jù)清洗

1、重復(fù)值的處理：drop_duplicates()

2、缺失值處理：

1. dropna() 去除數(shù)據(jù)結(jié)構(gòu)中值為空的數(shù)據(jù)行

2. df.fillna() 用其他數(shù)值替代NaN，有些時(shí)候空數(shù)據(jù)直接刪除會(huì)影響分析的結(jié)果，可以對(duì)數(shù)據(jù)進(jìn)行填補(bǔ)。【例4-8】使用數(shù)值或者任意字符替代缺失值

3. df.fillna(method=‘pad’) 用前一個(gè)數(shù)據(jù)值替代NaN

4. df.fillna(method=‘bfill’) 用后一個(gè)數(shù)據(jù)值替代NaN

5. df.fillna(df.mean()) 用平均數(shù)或者其他描述性統(tǒng)計(jì)量來代替NaN。

6. df.fillna(df.mean()[math: physical]) 可以選擇列進(jìn)行缺失值的處理

7. strip()：清除字符型數(shù)據(jù)左右(首尾)指定的字符，默認(rèn)為空格，中間的不清除。

3、特定值替換：replace(‘缺考’, 0)

4、刪除滿足條件元素所在的行：drop()

4.3.2 數(shù)據(jù)抽取

1. 字段抽取：抽出某列上指定位置的數(shù)據(jù)做成新的列。

2. 字段拆分：按指定的字符sep，拆分已有的字符串。

3. 記錄抽取：是指根據(jù)一定的條件，對(duì)數(shù)據(jù)進(jìn)行抽取。

4. 隨機(jī)抽樣：是指隨機(jī)從數(shù)據(jù)中按照一定的行數(shù)或者比例抽取數(shù)據(jù)。

PS:按照指定條件抽取數(shù)據(jù)：

5. 字典數(shù)據(jù)：將字典數(shù)據(jù)抽取為dataframe，有三種方法。

4.3.3 排名索引

說明：axis、ascending、inplace、by

1. 排名排序（索引排序）：df.sort_index()

2．重新索引：.reindex(index=None,**kwargs)

3. 值排序：df.sort_values()

4、sort_values()中的na_position參數(shù)

5、“值排名”：rank()函數(shù)

4.3.4 數(shù)據(jù)合并

1. 記錄合并：是指兩個(gè)結(jié)構(gòu)相同的數(shù)據(jù)框合并成一個(gè)數(shù)據(jù)框。也就是在一個(gè)數(shù)據(jù)框中追加另一個(gè)數(shù)據(jù)框的數(shù)據(jù)記錄。pd.concat([df1,df2])

2. 字段合并：是指將同一個(gè)數(shù)據(jù)框中的不同的列進(jìn)行合并，形成新的列。X = x1+x2+…

3. 字段匹配：是指不同結(jié)構(gòu)的數(shù)據(jù)框(兩個(gè)或以上的數(shù)據(jù)框)，按照一定的條件進(jìn)行合并，即追加列。merge(x,y,left_on,right_on) 外鍵連接

4.3.5 數(shù)據(jù)計(jì)算

1. 簡單計(jì)算：通過對(duì)各字段進(jìn)行加、減、乘、除等四則算術(shù)運(yùn)算，計(jì)算出的結(jié)果作為新的字段。

2. 數(shù)據(jù)標(biāo)準(zhǔn)化：是指將數(shù)據(jù)按照比例縮放，使之落入特定的區(qū)間，一般使用0-1標(biāo)準(zhǔn)化。X*=(x-min)/(max-min)

4.3.6 數(shù)據(jù)分組

說明：pd.cut(series,bins,right=True,labels=NULL)

bins

labels

4.3.7 日期處理

1. 日期轉(zhuǎn)換：是指將字符型的日期格式轉(zhuǎn)換為日期格式數(shù)據(jù)的過程。

2. 日期格式化：是指將日期型的數(shù)據(jù)按照給定的格式轉(zhuǎn)化為字符型的數(shù)據(jù)。

3. 日期抽取：是指從日期格式里面抽取出需要的部分屬性

4. 日期判斷：

5.日期增長：

4、數(shù)據(jù)分析

4.4.1 基本統(tǒng)計(jì)：describe

4.4.2 分組分析：groupby（離散值分組）

4.4.3 分布分析：cut+groupby（連續(xù)值分組）

4.4.4 交叉分析：pivot_table（數(shù)據(jù)透視表）

4.4.5 結(jié)構(gòu)分析：pivot_table+sum+div（查比重）

4.4.6 相關(guān)分析：corr（一維、二維）

5、數(shù)據(jù)可視化

相關(guān)注意

4.5.1 餅圖：plt.pie(gb2.人數(shù)，labels=gb2.index，autopct=‘%.2f%%’，colors=[‘b’，‘pink’，(0.5，0.8，0.3)]，explode=[0，0，0，0，0.1])

4.5.2 散點(diǎn)圖：plt.plot(df.高代,df.數(shù)分,‘o’,color=‘pink’)