當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

pandas pivot 计算占比_数据分析Pandas 基础（二）

發(fā)布時(shí)間：2025/3/15 编程问答 18 豆豆

生活随笔收集整理的這篇文章主要介紹了 pandas pivot 计算占比_数据分析Pandas 基础（二）小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

推薦閱讀：數(shù)據(jù)分析--Pandas 基礎(chǔ)(一)

上一節(jié)課介紹了 Pandas 的基本用法，這一章節(jié)我們通過對 “泰坦尼克號” 幸存者進(jìn)行數(shù)據(jù)分析，來進(jìn)一步的學(xué)習(xí) pandas。

titanic_train.csv

網(wǎng)盤鏈接：鏈接：https://pan.baidu.com/s/1hGc19QAGV6H-hDtOdz-GpQ 提取碼：sgu8

image-20200618091812300

數(shù)據(jù)簡介：

PassengerId:乘客ID
Survived:是否獲救，用1和Rescued表示獲救,用0或者not saved表示沒有獲救
Pclass:乘客等級，“1”表示Upper，“2”表示Middle，“3”表示Lower
Name:乘客姓名
Sex:性別
Age:年齡
SibSp:乘客在船上的配偶數(shù)量或兄弟姐妹數(shù)量)
Parch:乘客在船上的父母或子女?dāng)?shù)量
Ticket:船票信息
Fare:票價(jià)
Cabin:是否住在獨(dú)立的房間，“1”表示是，“0”為否
embarked:表示乘客上船的碼頭距離泰坦尼克出發(fā)碼頭的距離，數(shù)值越大表示距離越遠(yuǎn)

首先讀入數(shù)據(jù)

import?pandas?as?pd
import?numpy?as?np
titanic_survival?=?pd.read_csv("titanic_train.csv")
titanic_survival.head()?#查看前幾行數(shù)據(jù)

image-20200618091440524

在 pandas 中，使用 NaN 表示數(shù)據(jù)為空，表示數(shù)據(jù)缺失

使用 .isnull()函數(shù)判斷一列數(shù)據(jù)是否為空

age?=?titanic_survival["Age"]
age_is_null?=?pd.isnull(age)
print(age_is_null)

image-20200618094104841

查看空數(shù)據(jù)情況

age_null_true?=?age[age_is_null]
print?(age_null_true)

image-20200618094407869

上圖顯示出，Age 這一列，長度 177，數(shù)據(jù)類型 float64

也可以直接使用len()來判斷長度

age_null_count?=?len(age_null_true)
print(age_null_count)

>>>?177

在我們處理數(shù)據(jù)過程中，如果數(shù)據(jù)中包含 nan 會導(dǎo)致計(jì)算出錯(cuò)，下面來演示計(jì)算泰坦尼克號幸存者的平均年齡

mean_age?=?sum(titanic_survival["Age"])?/?len(titanic_survival["Age"])
print?(mean_age)

>>>nan

如上所示，在計(jì)算之前我們需要過濾掉空數(shù)據(jù)

good_ages?=?titanic_survival["Age"][age_is_null?==?False]
print?(good_ages)

我們知道第 888 號數(shù)據(jù)為空，下圖第888號數(shù)據(jù)被過濾

image-20200618095203606

過濾掉空數(shù)據(jù)，再計(jì)算均值：

correct_mean_age?=?sum(good_ages)?/?len(good_ages)
print?(correct_mean_age)

>>>?29.69911764705882

我們也可以使用.mean()來計(jì)算均值，可以過濾空數(shù)據(jù)

correct_mean_age?=?titanic_survival["Age"].mean()
print?(correct_mean_age)

>>>29.69911764705882??#結(jié)果和上式相同

一共有3個(gè)種類的艙位，下面計(jì)算每種艙位的平均價(jià)格

passenger_classes?=?[1,?2,?3]
fares_by_class?=?{}
for?this_class?in?passenger_classes:
????pclass_rows?=?titanic_survival[titanic_survival["Pclass"]?==?this_class]
????pclass_fares?=?pclass_rows["Fare"]
????fare_for_class?=?pclass_fares.mean()
????fares_by_class[this_class]?=?fare_for_class
print?(fares_by_class)

>>>?{1:?84.1546875,?2:?20.662183152173913,?3:?13.675550101832993}

計(jì)算 3 個(gè)艙位的生還概率，可以使用 .pivot_table(index, values, aggfunc)方法

index：索引列

values：目標(biāo)列(待計(jì)算)

aggfunc：使用的方法

我們先看一下原表，0 代表死亡，1 代表生還，Pclass 艙位等級，分 1 ，2， 3 三個(gè)等級的艙位，以 Pclass 為索引，以 Survived 為值，計(jì)算生還概率

image-20200618114805599passenger_survival?=?titanic_survival.pivot_table(index="Pclass",?values='Survived',?aggfunc=np.mean)
print(passenger_survival)

image-20200618114626768

艙位選擇的平均年齡

passenger_age?=?titanic_survival.pivot_table(index="Pclass",?values="Age")
print(passenger_age)

image-20200618121606952

以上船距離為索引，費(fèi)用和生還人數(shù)為值

port_stats?=?titanic_survival.pivot_table(index="Embarked",?values=["Fare","Survived"],?aggfunc=np.sum)
print(port_stats)

image-20200618202225249

令 axis=1 或者 axis=‘columns’，可以刪除含有 null 的列

drop_na_columns?=?titanic_survival.dropna(axis=1)
new_titanic_survival?=?titanic_survival.dropna(axis=0,subset=["Age",?"Sex"])
print?(new_titanic_survival)

image-20200618202841715

如下圖，通過對比可以發(fā)現(xiàn)，“Age”列的第 888 行為空，被去除

image-20200618203144439

生還者按照年齡降序排列

new_titanic_survival?=?titanic_survival.sort_values("Age",ascending=False)
print?(new_titanic_survival[0:10])?#顯示前10個(gè)數(shù)據(jù)

image-20200618204725623

重置索引：

titanic_reindexed?=?new_titanic_survival.reset_index(drop=True)
print(titanic_reindexed.loc[0:10])

image-20200618211141628

返回第 100 個(gè)乘客的信息

def?hundredth_row(column):
????#?Extract?the?hundredth?item
????hundredth_item?=?column.loc[99]
????return?hundredth_item

#?Return?the?hundredth?item?from?each?column
hundredth_row?=?titanic_survival.apply(hundredth_row)
print?(hundredth_row)

image-20200618211108414

表的每列中為空的個(gè)數(shù)

def?null_count(column):
????column_null?=?pd.isnull(column)
????null?=?column[column_null]
????return?len(null)

column_null_count?=?titanic_survival.apply(null_count)
print?(column_null_count)

image-20200618211527756

分別計(jì)算成年與未成年人的生還概率

首先，對乘客進(jìn)行分類，以 18 歲為標(biāo)準(zhǔn)

def?generate_age_label(row):
????age?=?row["Age"]
????if?pd.isnull(age):
????????return?"unknown"
????elif?age?????????return?"minor"
????else:
????????return?"adult"

age_labels?=?titanic_survival.apply(generate_age_label,?axis=1)
print?(age_labels)

image-20200618212156587

計(jì)算生還該概率

titanic_survival['age_labels']?=?age_labels
age_group_survival?=?titanic_survival.pivot_table(index="age_labels",?values="Survived")
print?(age_group_survival)

image-20200618212419599

總結(jié)

以上是生活随笔為你收集整理的pandas pivot 计算占比_数据分析Pandas 基础（二）的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：爬虫单个ip代理设置_爬虫怎么设置代理i
下一篇： alexnet训练多久收敛_AlexNe