Pandas中文官档~基础用法2
呆鳥云:“翻譯不易,要么是一個(gè)詞反復(fù)思索,要么是上萬字一遍遍校稿修改,只為給大家翻譯更準(zhǔn)確、閱讀更舒適的感受,呆鳥也不求啥,就是希望各位看官如果覺得本文有用,能給點(diǎn)個(gè)在看或分享給有需要的朋友,這就是對(duì)呆鳥辛苦翻譯的最大鼓勵(lì)?!?/p>
描述性統(tǒng)計(jì)
Series 與 DataFrame 支持大量計(jì)算描述性統(tǒng)計(jì)的方法與操作。這些方法大部分都是?sum()、mean()、quantile()?等聚合函數(shù),其輸出結(jié)果比原始數(shù)據(jù)集小;此外,還有輸出結(jié)果與原始數(shù)據(jù)集同樣大小的?cumsum()?、?cumprod()?等函數(shù)。這些方法都基本上都接受?axis?參數(shù),如,?ndarray.{sum,std,…},但這里的?axis?可以用名稱或整數(shù)指定:
Series:無需?axis?參數(shù)
DataFrame:
"index",即?axis=0,默認(rèn)值
"columns", 即?axis=1
示例如下:
In [77]: df Out[77]:one two three a 1.394981 1.772517 NaN b 0.343054 1.912123 -0.050390 c 0.695246 1.478369 1.227435 d NaN 0.279344 -0.613172In [78]: df.mean(0) Out[78]: one 0.811094 two 1.360588 three 0.187958 dtype: float64In [79]: df.mean(1) Out[79]: a 1.583749 b 0.734929 c 1.133683 d -0.166914 dtype: float64這些方法都支持?skipna,這個(gè)關(guān)鍵字指定是否要把缺失數(shù)據(jù)排除在外,默認(rèn)值為?True。
In [80]: df.sum(0, skipna=False) Out[80]: one NaN two 5.442353 three NaN dtype: float64In [81]: df.sum(axis=1, skipna=True) Out[81]: a 3.167498 b 2.204786 c 3.401050 d -0.333828 dtype: float64結(jié)合廣播機(jī)制或算數(shù)操作,可以描述不同統(tǒng)計(jì)過程,比如標(biāo)準(zhǔn)化,即渲染數(shù)據(jù)零均值與標(biāo)準(zhǔn)差 1,這種操作非常簡單:
In [82]: ts_stand = (df - df.mean()) / df.std()In [83]: ts_stand.std() Out[83]: one 1.0 two 1.0 three 1.0 dtype: float64In [84]: xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)In [85]: xs_stand.std(1) Out[85]: a 1.0 b 1.0 c 1.0 d 1.0 dtype: float64注 :?cumsum()?與?cumprod()?等方法保留?NaN?值的位置。這與?expanding()?和?rolling()?略顯不同,詳情請(qǐng)參閱本文。
In [86]: df.cumsum() Out[86]:one two three a 1.394981 1.772517 NaN b 1.738035 3.684640 -0.050390 c 2.433281 5.163008 1.177045 d NaN 5.442353 0.563873下面是常用函數(shù)匯總表。每個(gè)函數(shù)都支持?level?參數(shù),僅在數(shù)據(jù)對(duì)象為結(jié)構(gòu)化 Index 時(shí)使用。
| count | 統(tǒng)計(jì)非空值數(shù)量 |
| sum | 匯總值 |
| mean | 平均值 |
| mad | 平均絕對(duì)偏差 |
| median | 算數(shù)中位數(shù) |
| min | 最小值 |
| max | 最大值 |
| mode | 眾數(shù) |
| abs | 絕對(duì)值 |
| prod | 乘積 |
| std | 貝塞爾校正的樣本標(biāo)準(zhǔn)偏差 |
| var | 無偏方差 |
| sem | 平均值的標(biāo)準(zhǔn)誤差 |
| skew | 樣本偏度 (第三階) |
| kurt | 樣本峰度 (第四階) |
| quantile | 樣本分位數(shù) (不同 % 的值) |
| cumsum | 累加 |
| cumprod | 累乘 |
| cummax | 累積最大值 |
| cummin | 累積最小值 |
注意:Numpy 的?mean、std、sum?等方法默認(rèn)不統(tǒng)計(jì) Series 里的空值。
In [87]: np.mean(df['one']) Out[87]: 0.8110935116651192In [88]: np.mean(df['one'].to_numpy()) Out[88]: nanSeries.nunique()?返回 Series 里所有非空值的唯一值。
In [89]: series = pd.Series(np.random.randn(500))In [90]: series[20:500] = np.nanIn [91]: series[10:20] = 5In [92]: series.nunique() Out[92]: 11數(shù)據(jù)總結(jié):describe
describe()?函數(shù)計(jì)算 Series 與 DataFrame 數(shù)據(jù)列的各種數(shù)據(jù)統(tǒng)計(jì)量,注意,這里排除了空值。
In [93]: series = pd.Series(np.random.randn(1000))In [94]: series[::2] = np.nanIn [95]: series.describe() Out[95]: count 500.000000 mean -0.021292 std 1.015906 min -2.683763 25% -0.699070 50% -0.069718 75% 0.714483 max 3.160915 dtype: float64In [96]: frame = pd.DataFrame(np.random.randn(1000, 5),....: columns=['a', 'b', 'c', 'd', 'e'])....:In [97]: frame.iloc[::2] = np.nanIn [98]: frame.describe() Out[98]:a b c d e count 500.000000 500.000000 500.000000 500.000000 500.000000 mean 0.033387 0.030045 -0.043719 -0.051686 0.005979 std 1.017152 0.978743 1.025270 1.015988 1.006695 min -3.000951 -2.637901 -3.303099 -3.159200 -3.188821 25% -0.647623 -0.576449 -0.712369 -0.691338 -0.691115 50% 0.047578 -0.021499 -0.023888 -0.032652 -0.025363 75% 0.729907 0.775880 0.618896 0.670047 0.649748 max 2.740139 2.752332 3.004229 2.728702 3.240991此外,還可以指定輸出結(jié)果包含的分位數(shù):
In [99]: series.describe(percentiles=[.05, .25, .75, .95]) Out[99]: count 500.000000 mean -0.021292 std 1.015906 min -2.683763 5% -1.645423 25% -0.699070 50% -0.069718 75% 0.714483 95% 1.711409 max 3.160915 dtype: float64一般情況下,默認(rèn)值包含中位數(shù)。
對(duì)于非數(shù)值型 Series 對(duì)象,?describe()?返回值的總數(shù)、唯一值數(shù)量、出現(xiàn)次數(shù)最多的值及出現(xiàn)的次數(shù)。
In [100]: s = pd.Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a'])In [101]: s.describe() Out[101]: count 9 unique 4 top a freq 5 dtype: object注意:對(duì)于混合型的 DataFrame 對(duì)象,?describe()?只返回?cái)?shù)值列的匯總統(tǒng)計(jì)量,如果沒有數(shù)值列,則只顯示類別型的列。
In [102]: frame = pd.DataFrame({'a': ['Yes', 'Yes', 'No', 'No'], 'b': range(4)})In [103]: frame.describe() Out[103]:b count 4.000000 mean 1.500000 std 1.290994 min 0.000000 25% 0.750000 50% 1.500000 75% 2.250000 max 3.000000include/exclude?參數(shù)的值為列表,用該參數(shù)可以控制包含或排除的數(shù)據(jù)類型。這里還有一個(gè)特殊值,all:
In [104]: frame.describe(include=['object']) Out[104]:a count 4 unique 2 top Yes freq 2In [105]: frame.describe(include=['number']) Out[105]:b count 4.000000 mean 1.500000 std 1.290994 min 0.000000 25% 0.750000 50% 1.500000 75% 2.250000 max 3.000000In [106]: frame.describe(include='all') Out[106]:a b count 4 4.000000 unique 2 NaN top Yes NaN freq 2 NaN mean NaN 1.500000 std NaN 1.290994 min NaN 0.000000 25% NaN 0.750000 50% NaN 1.500000 75% NaN 2.250000 max NaN 3.000000本功能依托于?select_dtypes,要了解該參數(shù)接受哪些輸入內(nèi)容請(qǐng)參閱本文。
最大值與最小值對(duì)應(yīng)的索引
Series 與 DataFrame 的?idxmax()?與?idxmin()?函數(shù)計(jì)算最大值與最小值對(duì)應(yīng)的索引。
In [107]: s1 = pd.Series(np.random.randn(5))In [108]: s1 Out[108]: 0 1.118076 1 -0.352051 2 -1.242883 3 -1.277155 4 -0.641184 dtype: float64In [109]: s1.idxmin(), s1.idxmax() Out[109]: (3, 0)In [110]: df1 = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])In [111]: df1 Out[111]:A B C 0 -0.327863 -0.946180 -0.137570 1 -0.186235 -0.257213 -0.486567 2 -0.507027 -0.871259 -0.111110 3 2.000339 -2.430505 0.089759 4 -0.321434 -0.033695 0.096271In [112]: df1.idxmin(axis=0) Out[112]: A 2 B 3 C 1 dtype: int64In [113]: df1.idxmax(axis=1) Out[113]: 0 C 1 A 2 C 3 A 4 C dtype: object多行或多列中存在多個(gè)最大值或最小值時(shí),idxmax()?與?idxmin()?只返回匹配到的第一個(gè)值的?Index:
In [114]: df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=['A'], index=list('edcba'))In [115]: df3 Out[115]:A e 2.0 d 1.0 c 1.0 b 3.0 a NaNIn [116]: df3['A'].idxmin() Out[116]: 'd'::: tip 注意
idxmin?與?idxmax?對(duì)應(yīng) Numpy 里的?argmin?與?argmax。
:::
值計(jì)數(shù)(直方圖)與眾數(shù)
Series 的?value_counts()?方法及頂級(jí)函數(shù)計(jì)算一維數(shù)組中數(shù)據(jù)值的直方圖,還可以用作常規(guī)數(shù)組的函數(shù):
In [117]: data = np.random.randint(0, 7, size=50)In [118]: data Out[118]: array([6, 6, 2, 3, 5, 3, 2, 5, 4, 5, 4, 3, 4, 5, 0, 2, 0, 4, 2, 0, 3, 2,2, 5, 6, 5, 3, 4, 6, 4, 3, 5, 6, 4, 3, 6, 2, 6, 6, 2, 3, 4, 2, 1,6, 2, 6, 1, 5, 4])In [119]: s = pd.Series(data)In [120]: s.value_counts() Out[120]: 6 10 2 10 4 9 5 8 3 8 0 3 1 2 dtype: int64In [121]: pd.value_counts(data) Out[121]: 6 10 2 10 4 9 5 8 3 8 0 3 1 2 dtype: int64與上述操作類似,還可以統(tǒng)計(jì) Series 或 DataFrame 的眾數(shù),即出現(xiàn)頻率最高的值:
In [122]: s5 = pd.Series([1, 1, 3, 3, 3, 5, 5, 7, 7, 7])In [123]: s5.mode() Out[123]: 0 3 1 7 dtype: int64In [124]: df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),.....: "B": np.random.randint(-10, 15, size=50)}).....:In [125]: df5.mode() Out[125]:A B 0 1.0 -9 1 NaN 10 2 NaN 13離散化與分位數(shù)
cut()函數(shù)(以值為依據(jù)實(shí)現(xiàn)分箱)及?qcut()函數(shù)(以樣本分位數(shù)為依據(jù)實(shí)現(xiàn)分箱)用于連續(xù)值的離散化:
In [126]: arr = np.random.randn(20)In [127]: factor = pd.cut(arr, 4)In [128]: factor Out[128]: [(-0.251, 0.464], (-0.968, -0.251], (0.464, 1.179], (-0.251, 0.464], (-0.968, -0.251], ..., (-0.251, 0.464], (-0.968, -0.251], (-0.968, -0.251], (-0.968, -0.251], (-0.968, -0.251]] Length: 20 Categories (4, interval[float64]): [(-0.968, -0.251] < (-0.251, 0.464] < (0.464, 1.179] <(1.179, 1.893]]In [129]: factor = pd.cut(arr, [-5, -1, 0, 1, 5])In [130]: factor Out[130]: [(0, 1], (-1, 0], (0, 1], (0, 1], (-1, 0], ..., (-1, 0], (-1, 0], (-1, 0], (-1, 0], (-1, 0]] Length: 20 Categories (4, interval[int64]): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]qcut()?計(jì)算樣本分位數(shù)。比如,下列代碼按等距分位數(shù)分割正態(tài)分布的數(shù)據(jù):
In [131]: arr = np.random.randn(30)In [132]: factor = pd.qcut(arr, [0, .25, .5, .75, 1])In [133]: factor Out[133]: [(0.569, 1.184], (-2.278, -0.301], (-2.278, -0.301], (0.569, 1.184], (0.569, 1.184], ..., (-0.301, 0.569], (1.184, 2.346], (1.184, 2.346], (-0.301, 0.569], (-2.278, -0.301]] Length: 30 Categories (4, interval[float64]): [(-2.278, -0.301] < (-0.301, 0.569] < (0.569, 1.184] <(1.184, 2.346]]In [134]: pd.value_counts(factor) Out[134]: (1.184, 2.346] 8 (-2.278, -0.301] 8 (0.569, 1.184] 7 (-0.301, 0.569] 7 dtype: int64定義分箱時(shí),還可以傳遞無窮值:
In [135]: arr = np.random.randn(20)In [136]: factor = pd.cut(arr, [-np.inf, 0, np.inf])In [137]: factor Out[137]: [(-inf, 0.0], (0.0, inf], (0.0, inf], (-inf, 0.0], (-inf, 0.0], ..., (-inf, 0.0], (-inf, 0.0], (-inf, 0.0], (0.0, inf], (0.0, inf]] Length: 20 Categories (2, interval[float64]): [(-inf, 0.0] < (0.0, inf]]推薦閱讀:(點(diǎn)擊標(biāo)題即可跳轉(zhuǎn))
?
總結(jié)
以上是生活随笔為你收集整理的Pandas中文官档~基础用法2的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: DigSci科学数据挖掘大赛:如何在3天
- 下一篇: 520 页机器学习笔记!图文并茂可能更适