當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Pandas中文官档~基础用法2

發(fā)布時(shí)間：2024/9/15 编程问答 35 豆豆

生活随笔收集整理的這篇文章主要介紹了 Pandas中文官档~基础用法2 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

呆鳥云：“翻譯不易，要么是一個(gè)詞反復(fù)思索，要么是上萬字一遍遍校稿修改，只為給大家翻譯更準(zhǔn)確、閱讀更舒適的感受，呆鳥也不求啥，就是希望各位看官如果覺得本文有用，能給點(diǎn)個(gè)在看或分享給有需要的朋友，這就是對(duì)呆鳥辛苦翻譯的最大鼓勵(lì)?！?/p>

描述性統(tǒng)計(jì)

Series 與 DataFrame 支持大量計(jì)算描述性統(tǒng)計(jì)的方法與操作。這些方法大部分都是?sum()、mean()、quantile()?等聚合函數(shù)，其輸出結(jié)果比原始數(shù)據(jù)集小；此外，還有輸出結(jié)果與原始數(shù)據(jù)集同樣大小的?cumsum()?、?cumprod()?等函數(shù)。這些方法都基本上都接受?axis?參數(shù)，如，?ndarray.{sum,std,…}，但這里的?axis?可以用名稱或整數(shù)指定：

Series：無需?axis?參數(shù)
DataFrame：
- "index"，即?axis=0，默認(rèn)值
- "columns", 即?axis=1

示例如下：

In [77]: df Out[77]:one two three a 1.394981 1.772517 NaN b 0.343054 1.912123 -0.050390 c 0.695246 1.478369 1.227435 d NaN 0.279344 -0.613172In [78]: df.mean(0) Out[78]: one 0.811094 two 1.360588 three 0.187958 dtype: float64In [79]: df.mean(1) Out[79]: a 1.583749 b 0.734929 c 1.133683 d -0.166914 dtype: float64

這些方法都支持?skipna，這個(gè)關(guān)鍵字指定是否要把缺失數(shù)據(jù)排除在外，默認(rèn)值為?True。

In [80]: df.sum(0, skipna=False) Out[80]: one NaN two 5.442353 three NaN dtype: float64In [81]: df.sum(axis=1, skipna=True) Out[81]: a 3.167498 b 2.204786 c 3.401050 d -0.333828 dtype: float64

結(jié)合廣播機(jī)制或算數(shù)操作，可以描述不同統(tǒng)計(jì)過程，比如標(biāo)準(zhǔn)化，即渲染數(shù)據(jù)零均值與標(biāo)準(zhǔn)差 1，這種操作非常簡單：

In [82]: ts_stand = (df - df.mean()) / df.std()In [83]: ts_stand.std() Out[83]: one 1.0 two 1.0 three 1.0 dtype: float64In [84]: xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)In [85]: xs_stand.std(1) Out[85]: a 1.0 b 1.0 c 1.0 d 1.0 dtype: float64

注：?cumsum()?與?cumprod()?等方法保留?NaN?值的位置。這與?expanding()?和?rolling()?略顯不同，詳情請(qǐng)參閱本文。

In [86]: df.cumsum() Out[86]:one two three a 1.394981 1.772517 NaN b 1.738035 3.684640 -0.050390 c 2.433281 5.163008 1.177045 d NaN 5.442353 0.563873

下面是常用函數(shù)匯總表。每個(gè)函數(shù)都支持?level?參數(shù)，僅在數(shù)據(jù)對(duì)象為結(jié)構(gòu)化 Index 時(shí)使用。

函數(shù)描述

count	統(tǒng)計(jì)非空值數(shù)量
sum	匯總值
mean	平均值
mad	平均絕對(duì)偏差
median	算數(shù)中位數(shù)
min	最小值
max	最大值
mode	眾數(shù)
abs	絕對(duì)值
prod	乘積
std	貝塞爾校正的樣本標(biāo)準(zhǔn)偏差
var	無偏方差
sem	平均值的標(biāo)準(zhǔn)誤差
skew	樣本偏度 (第三階)
kurt	樣本峰度 (第四階)
quantile	樣本分位數(shù) (不同 % 的值)
cumsum	累加
cumprod	累乘
cummax	累積最大值
cummin	累積最小值

注意：Numpy 的?mean、std、sum?等方法默認(rèn)不統(tǒng)計(jì) Series 里的空值。

In [87]: np.mean(df['one']) Out[87]: 0.8110935116651192In [88]: np.mean(df['one'].to_numpy()) Out[88]: nan

Series.nunique()?返回 Series 里所有非空值的唯一值。

In [89]: series = pd.Series(np.random.randn(500))In [90]: series[20:500] = np.nanIn [91]: series[10:20] = 5In [92]: series.nunique() Out[92]: 11

數(shù)據(jù)總結(jié)：describe

describe()?函數(shù)計(jì)算 Series 與 DataFrame 數(shù)據(jù)列的各種數(shù)據(jù)統(tǒng)計(jì)量，注意，這里排除了空值。

In [93]: series = pd.Series(np.random.randn(1000))In [94]: series[::2] = np.nanIn [95]: series.describe() Out[95]: count 500.000000 mean -0.021292 std 1.015906 min -2.683763 25% -0.699070 50% -0.069718 75% 0.714483 max 3.160915 dtype: float64In [96]: frame = pd.DataFrame(np.random.randn(1000, 5),....: columns=['a', 'b', 'c', 'd', 'e'])....:In [97]: frame.iloc[::2] = np.nanIn [98]: frame.describe() Out[98]:a b c d e count 500.000000 500.000000 500.000000 500.000000 500.000000 mean 0.033387 0.030045 -0.043719 -0.051686 0.005979 std 1.017152 0.978743 1.025270 1.015988 1.006695 min -3.000951 -2.637901 -3.303099 -3.159200 -3.188821 25% -0.647623 -0.576449 -0.712369 -0.691338 -0.691115 50% 0.047578 -0.021499 -0.023888 -0.032652 -0.025363 75% 0.729907 0.775880 0.618896 0.670047 0.649748 max 2.740139 2.752332 3.004229 2.728702 3.240991

此外，還可以指定輸出結(jié)果包含的分位數(shù)：

In [99]: series.describe(percentiles=[.05, .25, .75, .95]) Out[99]: count 500.000000 mean -0.021292 std 1.015906 min -2.683763 5% -1.645423 25% -0.699070 50% -0.069718 75% 0.714483 95% 1.711409 max 3.160915 dtype: float64

一般情況下，默認(rèn)值包含中位數(shù)。

對(duì)于非數(shù)值型 Series 對(duì)象，?describe()?返回值的總數(shù)、唯一值數(shù)量、出現(xiàn)次數(shù)最多的值及出現(xiàn)的次數(shù)。

In [100]: s = pd.Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a'])In [101]: s.describe() Out[101]: count 9 unique 4 top a freq 5 dtype: object

注意：對(duì)于混合型的 DataFrame 對(duì)象，?describe()?只返回?cái)?shù)值列的匯總統(tǒng)計(jì)量，如果沒有數(shù)值列，則只顯示類別型的列。

In [102]: frame = pd.DataFrame({'a': ['Yes', 'Yes', 'No', 'No'], 'b': range(4)})In [103]: frame.describe() Out[103]:b count 4.000000 mean 1.500000 std 1.290994 min 0.000000 25% 0.750000 50% 1.500000 75% 2.250000 max 3.000000

include/exclude?參數(shù)的值為列表，用該參數(shù)可以控制包含或排除的數(shù)據(jù)類型。這里還有一個(gè)特殊值，all：

In [104]: frame.describe(include=['object']) Out[104]:a count 4 unique 2 top Yes freq 2In [105]: frame.describe(include=['number']) Out[105]:b count 4.000000 mean 1.500000 std 1.290994 min 0.000000 25% 0.750000 50% 1.500000 75% 2.250000 max 3.000000In [106]: frame.describe(include='all') Out[106]:a b count 4 4.000000 unique 2 NaN top Yes NaN freq 2 NaN mean NaN 1.500000 std NaN 1.290994 min NaN 0.000000 25% NaN 0.750000 50% NaN 1.500000 75% NaN 2.250000 max NaN 3.000000

本功能依托于?select_dtypes，要了解該參數(shù)接受哪些輸入內(nèi)容請(qǐng)參閱本文。

最大值與最小值對(duì)應(yīng)的索引

Series 與 DataFrame 的?idxmax()?與?idxmin()?函數(shù)計(jì)算最大值與最小值對(duì)應(yīng)的索引。

In [107]: s1 = pd.Series(np.random.randn(5))In [108]: s1 Out[108]: 0 1.118076 1 -0.352051 2 -1.242883 3 -1.277155 4 -0.641184 dtype: float64In [109]: s1.idxmin(), s1.idxmax() Out[109]: (3, 0)In [110]: df1 = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])In [111]: df1 Out[111]:A B C 0 -0.327863 -0.946180 -0.137570 1 -0.186235 -0.257213 -0.486567 2 -0.507027 -0.871259 -0.111110 3 2.000339 -2.430505 0.089759 4 -0.321434 -0.033695 0.096271In [112]: df1.idxmin(axis=0) Out[112]: A 2 B 3 C 1 dtype: int64In [113]: df1.idxmax(axis=1) Out[113]: 0 C 1 A 2 C 3 A 4 C dtype: object

多行或多列中存在多個(gè)最大值或最小值時(shí)，idxmax()?與?idxmin()?只返回匹配到的第一個(gè)值的?Index：

In [114]: df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=['A'], index=list('edcba'))In [115]: df3 Out[115]:A e 2.0 d 1.0 c 1.0 b 3.0 a NaNIn [116]: df3['A'].idxmin() Out[116]: 'd'

::: tip 注意

idxmin?與?idxmax?對(duì)應(yīng) Numpy 里的?argmin?與?argmax。

:::

值計(jì)數(shù)（直方圖）與眾數(shù)

Series 的?value_counts()?方法及頂級(jí)函數(shù)計(jì)算一維數(shù)組中數(shù)據(jù)值的直方圖，還可以用作常規(guī)數(shù)組的函數(shù)：

In [117]: data = np.random.randint(0, 7, size=50)In [118]: data Out[118]: array([6, 6, 2, 3, 5, 3, 2, 5, 4, 5, 4, 3, 4, 5, 0, 2, 0, 4, 2, 0, 3, 2,2, 5, 6, 5, 3, 4, 6, 4, 3, 5, 6, 4, 3, 6, 2, 6, 6, 2, 3, 4, 2, 1,6, 2, 6, 1, 5, 4])In [119]: s = pd.Series(data)In [120]: s.value_counts() Out[120]: 6 10 2 10 4 9 5 8 3 8 0 3 1 2 dtype: int64In [121]: pd.value_counts(data) Out[121]: 6 10 2 10 4 9 5 8 3 8 0 3 1 2 dtype: int64

與上述操作類似，還可以統(tǒng)計(jì) Series 或 DataFrame 的眾數(shù)，即出現(xiàn)頻率最高的值：

In [122]: s5 = pd.Series([1, 1, 3, 3, 3, 5, 5, 7, 7, 7])In [123]: s5.mode() Out[123]: 0 3 1 7 dtype: int64In [124]: df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),.....: "B": np.random.randint(-10, 15, size=50)}).....:In [125]: df5.mode() Out[125]:A B 0 1.0 -9 1 NaN 10 2 NaN 13

離散化與分位數(shù)

cut()函數(shù)（以值為依據(jù)實(shí)現(xiàn)分箱）及?qcut()函數(shù)（以樣本分位數(shù)為依據(jù)實(shí)現(xiàn)分箱）用于連續(xù)值的離散化：

In [126]: arr = np.random.randn(20)In [127]: factor = pd.cut(arr, 4)In [128]: factor Out[128]: [(-0.251, 0.464], (-0.968, -0.251], (0.464, 1.179], (-0.251, 0.464], (-0.968, -0.251], ..., (-0.251, 0.464], (-0.968, -0.251], (-0.968, -0.251], (-0.968, -0.251], (-0.968, -0.251]] Length: 20 Categories (4, interval[float64]): [(-0.968, -0.251] < (-0.251, 0.464] < (0.464, 1.179] <(1.179, 1.893]]In [129]: factor = pd.cut(arr, [-5, -1, 0, 1, 5])In [130]: factor Out[130]: [(0, 1], (-1, 0], (0, 1], (0, 1], (-1, 0], ..., (-1, 0], (-1, 0], (-1, 0], (-1, 0], (-1, 0]] Length: 20 Categories (4, interval[int64]): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]

qcut()?計(jì)算樣本分位數(shù)。比如，下列代碼按等距分位數(shù)分割正態(tài)分布的數(shù)據(jù)：

In [131]: arr = np.random.randn(30)In [132]: factor = pd.qcut(arr, [0, .25, .5, .75, 1])In [133]: factor Out[133]: [(0.569, 1.184], (-2.278, -0.301], (-2.278, -0.301], (0.569, 1.184], (0.569, 1.184], ..., (-0.301, 0.569], (1.184, 2.346], (1.184, 2.346], (-0.301, 0.569], (-2.278, -0.301]] Length: 30 Categories (4, interval[float64]): [(-2.278, -0.301] < (-0.301, 0.569] < (0.569, 1.184] <(1.184, 2.346]]In [134]: pd.value_counts(factor) Out[134]: (1.184, 2.346] 8 (-2.278, -0.301] 8 (0.569, 1.184] 7 (-0.301, 0.569] 7 dtype: int64

定義分箱時(shí)，還可以傳遞無窮值：

In [135]: arr = np.random.randn(20)In [136]: factor = pd.cut(arr, [-np.inf, 0, np.inf])In [137]: factor Out[137]: [(-inf, 0.0], (0.0, inf], (0.0, inf], (-inf, 0.0], (-inf, 0.0], ..., (-inf, 0.0], (-inf, 0.0], (-inf, 0.0], (0.0, inf], (0.0, inf]] Length: 20 Categories (2, interval[float64]): [(-inf, 0.0] < (0.0, inf]]

推薦閱讀：（點(diǎn)擊標(biāo)題即可跳轉(zhuǎn)）

??長按圖片 1 秒即可關(guān)注哦～

總結(jié)

以上是生活随笔為你收集整理的Pandas中文官档~基础用法2的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： DigSci科学数据挖掘大赛：如何在3天
下一篇： 520 页机器学习笔记！图文并茂可能更适