4.6高級處理-缺失值處理
點擊標題即可獲取文章源代碼和筆記 數據集:https://download.csdn.net/download/weixin_44827418/12548095
Pandas高級處理缺失值處理數據離散化合并交叉表與透視表分組與聚合綜合案例
4.6 高級處理
- 缺失值處理
1 )如何進行缺失值處理兩種思路:
1 )刪除含有缺失值的樣本
2 )替換
/ 插補
4.6 .1 如何處理nan
1 )判斷數據中是否存在NaNpd
. isnull
( df
) pd
. notnull
( df
) 2 )刪除含有缺失值的樣本df
. dropna
( inplace
= False ) 替換
/ 插補df
. fillna
( value
, inplace
= False ) 4.6 .2 不是缺失值nan,有默認標記的
1 )替換 ?
- > np
. nandf
. replace
( to_replace
= "?" , value
= np
. nan
) 2 )處理np
. nan缺失值的步驟
2 )缺失值處理實例
4.7 高級處理
- 數據離散化性別 年齡
A
1 23
B
2 30
C
1 18 物種 毛發
A
1
B
2
C
3 男 女 年齡
A
1 0 23
B
0 1 30
C
1 0 18 狗 豬 老鼠 毛發
A
1 0 0 2
B
0 1 0 1
C
0 0 1 1
one
- hot編碼
& 啞變量
4.7 .1 什么是數據的離散化原始的身高數據:
165 ,
174 ,
160 ,
180 ,
159 ,
163 ,
192 ,
184
4.7 .2 為什么要離散化
4.7 .3 如何實現數據的離散化
1 )分組自動分組sr
= pd
. qcut
( data
, bins
) 自定義分組sr
= pd
. cut
( data
, [ ] ) 2 )將分組好的結果轉換成one
- hot編碼pd
. get_dummies
( sr
, prefix
= )
4.8 高級處理
- 合并numpynp
. concatnate
( ( a
, b
) , axis
= ) 水平拼接np
. hstack
( ) 豎直拼接np
. vstack
( ) 1 )按方向拼接pd
. concat
( [ data1
, data2
] , axis
= 1 ) 2 )按索引拼接pd
. merge實現合并pd
. merge
( left
, right
, how
= "inner" , on
= [ 索引
] )
4.9 高級處理
- 交叉表與透視表找到、探索兩個變量之間的關系
4.9 .1 交叉表與透視表什么作用
4.9 .2 使用crosstab
( 交叉表
) 實現pd
. crosstab
( value1
, value2
) 4.9 .3 pivot_table
4.10 高級處理
- 分組與聚合
4.10 .1 什么是分組與聚合
4.10 .2 分組與聚合APIdataframesr
4.6.1如何處理nan
import pandas
as pd movie
= pd
. read_csv
( "./datas/IMDB-Movie-Data.csv" )
movie
RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore 01 Guardians of the Galaxy Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76.0 12 Prometheus Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0 23 Split Horror,Thriller Three girls are kidnapped by a man with a diag... M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12 62.0 34 Sing Animation,Comedy,Family In a city of humanoid animals, a hustling thea... Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108 7.2 60545 270.32 59.0 45 Suicide Squad Action,Adventure,Fantasy A secret government agency recruits some of th... David Ayer Will Smith, Jared Leto, Margot Robbie, Viola D... 2016 123 6.2 393727 325.02 40.0 ...... ... ... ... ... ... ... ... ... ... ... ... 995996 Secret in Their Eyes Crime,Drama,Mystery A tight-knit team of rising investigators, alo... Billy Ray Chiwetel Ejiofor, Nicole Kidman, Julia Roberts... 2015 111 6.2 27585 NaN 45.0 996997 Hostel: Part II Horror Three American college students studying abroa... Eli Roth Lauren German, Heather Matarazzo, Bijou Philli... 2007 94 5.5 73152 17.54 46.0 997998 Step Up 2: The Streets Drama,Music,Romance Romantic sparks occur between two dance studen... Jon M. Chu Robert Hoffman, Briana Evigan, Cassie Ventura,... 2008 98 6.2 70699 58.01 50.0 998999 Search Party Adventure,Comedy A pair of friends embark on a mission to reuni... Scot Armstrong Adam Pally, T.J. Miller, Thomas Middleditch,Sh... 2014 93 5.6 4881 NaN 22.0 9991000 Nine Lives Comedy,Family,Fantasy A stuffy businessman finds himself trapped ins... Barry Sonnenfeld Kevin Spacey, Jennifer Garner, Robbie Amell,Ch... 2016 87 5.3 12435 19.64 11.0
1000 rows × 12 columns
movie
. isnull
( )
RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore 0False False False False False False False False False False False False 1False False False False False False False False False False False False 2False False False False False False False False False False False False 3False False False False False False False False False False False False 4False False False False False False False False False False False False ...... ... ... ... ... ... ... ... ... ... ... ... 995False False False False False False False False False False True False 996False False False False False False False False False False False False 997False False False False False False False False False False False False 998False False False False False False False False False False True False 999False False False False False False False False False False False False
1000 rows × 12 columns
import numpy
as np
np
. any ( movie
. isnull
( ) )
True
pd
. notnull
( movie
)
RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore 0True True True True True True True True True True True True 1True True True True True True True True True True True True 2True True True True True True True True True True True True 3True True True True True True True True True True True True 4True True True True True True True True True True True True ...... ... ... ... ... ... ... ... ... ... ... ... 995True True True True True True True True True True False True 996True True True True True True True True True True True True 997True True True True True True True True True True True True 998True True True True True True True True True True False True 999True True True True True True True True True True True True
1000 rows × 12 columns
np
. all ( pd
. notnull
( movie
) )
False
pd
. isnull
( movie
) . any ( )
Rank False
Title False
Genre False
Description False
Director False
Actors False
Year False
Runtime (Minutes) False
Rating False
Votes False
Revenue (Millions) True
Metascore True
dtype: bool
pd
. notnull
( movie
) . all ( )
Rank True
Title True
Genre True
Description True
Director True
Actors True
Year True
Runtime (Minutes) True
Rating True
Votes True
Revenue (Millions) False
Metascore False
dtype: bool
movie_full
= movie
. dropna
( )
movie_full
. isnull
( ) . any ( )
Rank False
Title False
Genre False
Description False
Director False
Actors False
Year False
Runtime (Minutes) False
Rating False
Votes False
Revenue (Millions) False
Metascore False
dtype: bool
movie
. head
( )
RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore 01 Guardians of the Galaxy Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76.0 12 Prometheus Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0 23 Split Horror,Thriller Three girls are kidnapped by a man with a diag... M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12 62.0 34 Sing Animation,Comedy,Family In a city of humanoid animals, a hustling thea... Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108 7.2 60545 270.32 59.0 45 Suicide Squad Action,Adventure,Fantasy A secret government agency recruits some of th... David Ayer Will Smith, Jared Leto, Margot Robbie, Viola D... 2016 123 6.2 393727 325.02 40.0
movie
[ "Revenue (Millions)" ] . mean
( )
82.95637614678897
movie
[ "Revenue (Millions)" ] . fillna
( movie
[ "Revenue (Millions)" ] . mean
( ) , inplace
= True )
movie
[ "Revenue (Millions)" ] . isnull
( ) . any ( )
False
movie
[ "Metascore" ] . fillna
( movie
[ "Metascore" ] . mean
( ) , inplace
= True )
movie
[ "Metascore" ] . isnull
( ) . any ( )
False
movie
. isnull
( ) . any ( )
Rank False
Title False
Genre False
Description False
Director False
Actors False
Year False
Runtime (Minutes) False
Rating False
Votes False
Revenue (Millions) False
Metascore False
dtype: bool
不是缺失值nan,有默認標記的處理方法
data
= pd
. read_csv
( "./datas/GBvideos.csv" , encoding
= "GBK" )
data
video_idtitlechannel_titlecategory_idtagsviewslikesdislikescomment_totalthumbnail_linkdate 0jt2OHQh0HoQ Live Apple Event - Apple September Event 2017 ... Apple Event 28 apple events|apple event|iphone 8|iphone x|iph... 7426393 78240 13548 705 https://i.ytimg.com/vi/jt2OHQh0HoQ/default_liv... 13.09 1AqokkXoa7uE Holly and Phillip Meet Samantha the Sex Robot ... This Morning 24 this morning|interview|holly willoughby|philli... 494203 2651 1309 0 https://i.ytimg.com/vi/AqokkXoa7uE/default.jpg 13.09 2YPVcg45W0z4 My DNA Test Results? I'm WHAT?? emmablackery 24 emmablackery|emma blackery|emma|blackery|briti... 142819 13119 151 1141 https://i.ytimg.com/vi/YPVcg45W0z4/default.jpg 13.09 3T_PuZBdT2iM getting into a conversation in a language you ... ProZD 1 skit|korean|language|conversation|esl|japanese... 1580028 65729 1529 3598 https://i.ytimg.com/vi/T_PuZBdT2iM/default.jpg 13.09 4NsjsmgmbCfc Baby Name Challenge? Sprinkleofglitter 26 sprinkleofglitter|sprinkle of glitter|baby gli... 40592 5019 57 490 https://i.ytimg.com/vi/NsjsmgmbCfc/default.jpg 13.09 ...... ... ... ... ... ... ... ... ... ... ... 1595w8fAellnPns Juicy Chicken Breast - You Suck at Cooking (ep... You Suck At Cooking 26 how to|cooking|recipe|kitchen|chicken|chicken ... 788466 31945 945 2274 https://i.ytimg.com/vi/w8fAellnPns/default.jpg 20.09 1596RsG37JcEQNw Weezer - Beach Boys weezer 10 weezer|pacific daydream|pacificdaydream|beach ... 107927 2435 412 641 https://i.ytimg.com/vi/RsG37JcEQNw/default.jpg 20.09 1597htSiIA2g7G8 Berry Frozen Yogurt Bark Recipe SORTEDfood 26 frozen yogurt bark|frozen yoghurt bark|frozen ... 109222 4840 35 212 https://i.ytimg.com/vi/htSiIA2g7G8/default.jpg 20.09 1598ZQK1F0wz6z4 What Do You Want to Eat?? Wong Fu Productions 24 panda|what should we eat|buzzfeed|comedy|boyfr... 626223 22962 532 1559 https://i.ytimg.com/vi/ZQK1F0wz6z4/default.jpg 20.09 1599DuPXdnSWoLk The Child in Time: Trailer - BBC One BBC 24 BBC|iPlayer|bbc one|bbc 1|bbc1|trailer|the chi... 99228 1699 ? 135 https://i.ytimg.com/vi/DuPXdnSWoLk/default.jpg 20.09
1600 rows × 11 columns
new_data
= data
. replace
( to_replace
= "?" , value
= np
. nan
)
new_data
video_idtitlechannel_titlecategory_idtagsviewslikesdislikescomment_totalthumbnail_linkdate 0jt2OHQh0HoQ Live Apple Event - Apple September Event 2017 ... Apple Event 28 apple events|apple event|iphone 8|iphone x|iph... 7426393 78240 13548 705 https://i.ytimg.com/vi/jt2OHQh0HoQ/default_liv... 13.09 1AqokkXoa7uE Holly and Phillip Meet Samantha the Sex Robot ... This Morning 24 this morning|interview|holly willoughby|philli... 494203 2651 1309 0 https://i.ytimg.com/vi/AqokkXoa7uE/default.jpg 13.09 2YPVcg45W0z4 My DNA Test Results? I'm WHAT?? emmablackery 24 emmablackery|emma blackery|emma|blackery|briti... 142819 13119 151 1141 https://i.ytimg.com/vi/YPVcg45W0z4/default.jpg 13.09 3T_PuZBdT2iM getting into a conversation in a language you ... ProZD 1 skit|korean|language|conversation|esl|japanese... 1580028 65729 1529 3598 https://i.ytimg.com/vi/T_PuZBdT2iM/default.jpg 13.09 4NsjsmgmbCfc Baby Name Challenge? Sprinkleofglitter 26 sprinkleofglitter|sprinkle of glitter|baby gli... 40592 5019 57 490 https://i.ytimg.com/vi/NsjsmgmbCfc/default.jpg 13.09 ...... ... ... ... ... ... ... ... ... ... ... 1595w8fAellnPns Juicy Chicken Breast - You Suck at Cooking (ep... You Suck At Cooking 26 how to|cooking|recipe|kitchen|chicken|chicken ... 788466 31945 945 2274 https://i.ytimg.com/vi/w8fAellnPns/default.jpg 20.09 1596RsG37JcEQNw Weezer - Beach Boys weezer 10 weezer|pacific daydream|pacificdaydream|beach ... 107927 2435 412 641 https://i.ytimg.com/vi/RsG37JcEQNw/default.jpg 20.09 1597htSiIA2g7G8 Berry Frozen Yogurt Bark Recipe SORTEDfood 26 frozen yogurt bark|frozen yoghurt bark|frozen ... 109222 4840 35 212 https://i.ytimg.com/vi/htSiIA2g7G8/default.jpg 20.09 1598ZQK1F0wz6z4 What Do You Want to Eat?? Wong Fu Productions 24 panda|what should we eat|buzzfeed|comedy|boyfr... 626223 22962 532 1559 https://i.ytimg.com/vi/ZQK1F0wz6z4/default.jpg 20.09 1599DuPXdnSWoLk The Child in Time: Trailer - BBC One BBC 24 BBC|iPlayer|bbc one|bbc 1|bbc1|trailer|the chi... 99228 1699 NaN 135 https://i.ytimg.com/vi/DuPXdnSWoLk/default.jpg 20.09
1600 rows × 11 columns
new_data
. isnull
( ) . any ( )
video_id False
title False
channel_title False
category_id False
tags False
views False
likes False
dislikes True
comment_total False
thumbnail_link False
date False
dtype: bool
new_data
. dropna
( inplace
= True )
new_data
. isnull
( ) . any ( )
video_id False
title False
channel_title False
category_id False
tags False
views False
likes False
dislikes False
comment_total False
thumbnail_link False
date False
dtype: bool
4.7 高級處理-數據離散化
import pandas
as pd
data
= pd
. Series
( [ 165 , 174 , 160 , 180 , 159 , 163 , 192 , 184 ] , index
= [ "No1:165" , "No2:174" , "No3:160" , "No4:180" , "No5:159" , "No6:163" , "No7:192" , "No8:184" ] )
data
No1:165 165
No2:174 174
No3:160 160
No4:180 180
No5:159 159
No6:163 163
No7:192 192
No8:184 184
dtype: int64
自動分組
sr
= pd
. qcut
( data
, 3 )
sr
No1:165 (163.667, 178.0]
No2:174 (163.667, 178.0]
No3:160 (158.999, 163.667]
No4:180 (178.0, 192.0]
No5:159 (158.999, 163.667]
No6:163 (158.999, 163.667]
No7:192 (178.0, 192.0]
No8:184 (178.0, 192.0]
dtype: category
Categories (3, interval[float64]): [(158.999, 163.667] < (163.667, 178.0] < (178.0, 192.0]]
sr
. value_counts
( )
(178.0, 192.0] 3
(158.999, 163.667] 3
(163.667, 178.0] 2
dtype: int64
type ( sr
)
pandas.core.series.Series
pd
. get_dummies
( sr
, prefix
= "height" )
height_(158.999, 163.667]height_(163.667, 178.0]height_(178.0, 192.0] No1:1650 1 0 No2:1740 1 0 No3:1601 0 0 No4:1800 0 1 No5:1591 0 0 No6:1631 0 0 No7:1920 0 1 No8:1840 0 1
自定義分組
sr
= pd
. cut
( data
, [ 150 , 165 , 180 , 195 ] )
sr
No1:165 (150, 165]
No2:174 (165, 180]
No3:160 (150, 165]
No4:180 (165, 180]
No5:159 (150, 165]
No6:163 (150, 165]
No7:192 (180, 195]
No8:184 (180, 195]
dtype: category
Categories (3, interval[int64]): [(150, 165] < (165, 180] < (180, 195]]
sr
. value_counts
( )
(150, 165] 4
(180, 195] 2
(165, 180] 2
dtype: int64
pd
. get_dummies
( sr
, prefix
= "身高" )
身高_(150, 165]身高_(165, 180]身高_(180, 195] No1:1651 0 0 No2:1740 1 0 No3:1601 0 0 No4:1800 1 0 No5:1591 0 0 No6:1631 0 0 No7:1920 0 1 No8:1840 0 1
4.8 高級處理-合并
4.8.1 pd.concat實現合并(按方向拼接)
data1
= np
. arange
( 0 , 20 , 1 ) . reshape
( 4 , 5 )
data1
= pd
. DataFrame
( data1
)
data1
01234 00 1 2 3 4 15 6 7 8 9 210 11 12 13 14 315 16 17 18 19
data2
= np
. arange
( 100 , 120 , 1 ) . reshape
( 4 , 5 )
data2
= pd
. DataFrame
( data2
)
data2
01234 0100 101 102 103 104 1105 106 107 108 109 2110 111 112 113 114 3115 116 117 118 119
data_concat
= pd
. concat
( [ data1
, data2
] , axis
= 1 )
data_concat
0123401234 00 1 2 3 4 100 101 102 103 104 15 6 7 8 9 105 106 107 108 109 210 11 12 13 14 110 111 112 113 114 315 16 17 18 19 115 116 117 118 119
data2
. T
0123 0100 105 110 115 1101 106 111 116 2102 107 112 117 3103 108 113 118 4104 109 114 119
data_concat1
= pd
. concat
( [ data1
, data2
. T
] , axis
= 0 )
data_concat1
01234 00 1 2 3 4.0 15 6 7 8 9.0 210 11 12 13 14.0 315 16 17 18 19.0 0100 105 110 115 NaN 1101 106 111 116 NaN 2102 107 112 117 NaN 3103 108 113 118 NaN 4104 109 114 119 NaN
4.8.2 pd.merge實現合并(按索引拼接)
left
= pd
. DataFrame
( { 'key1' : [ 'K0' , 'K0' , 'K1' , 'K2' ] ,
'key2' : [ 'K0' , 'K1' , 'K0' , 'K1' ] ,
'A' : [ 'A0' , 'A1' , 'A2' , 'A3' ] ,
'B' : [ 'B0' , 'B1' , 'B2' , 'B3' ] } )
left
key1key2AB 0K0 K0 A0 B0 1K0 K1 A1 B1 2K1 K0 A2 B2 3K2 K1 A3 B3
right
= pd
. DataFrame
( { 'key1' : [ 'K0' , 'K1' , 'K1' , 'K2' ] , 'key2' : [ 'K0' , 'K0' , 'K0' , 'K0' ] , 'C' : [ 'Co' , 'C1' , 'C2' , 'C3' ] , 'D' : [ 'DO' , 'D1' , 'D2' , 'D3' ] } )
right
key1key2CD 0K0 K0 Co DO 1K1 K0 C1 D1 2K1 K0 C2 D2 3K2 K0 C3 D3
result
= pd
. merge
( left
, right
, on
= [ 'key1' , 'key2' ] , how
= "inner" )
result
key1key2ABCD 0K0 K0 A0 B0 Co DO 1K1 K0 A2 B2 C1 D1 2K1 K0 A2 B2 C2 D2
result_left
= pd
. merge
( left
, right
, on
= [ 'key1' , 'key2' ] , how
= "left" )
result_left
key1key2ABCD 0K0 K0 A0 B0 Co DO 1K0 K1 A1 B1 NaN NaN 2K1 K0 A2 B2 C1 D1 3K1 K0 A2 B2 C2 D2 4K2 K1 A3 B3 NaN NaN
result_right
= pd
. merge
( left
, right
, on
= [ 'key1' , 'key2' ] , how
= "right" )
result_right
key1key2ABCD 0K0 K0 A0 B0 Co DO 1K1 K0 A2 B2 C1 D1 2K1 K0 A2 B2 C2 D2 3K2 K0 NaN NaN C3 D3
result_outer
= pd
. merge
( left
, right
, on
= [ 'key1' , 'key2' ] , how
= "outer" )
result_outer
key1key2ABCD 0K0 K0 A0 B0 Co DO 1K0 K1 A1 B1 NaN NaN 2K1 K0 A2 B2 C1 D1 3K1 K0 A2 B2 C2 D2 4K2 K1 A3 B3 NaN NaN 5K2 K0 NaN NaN C3 D3
4.9 高級處理-交叉表與透視表
4.9.2 使用crosstab(交叉表)實現
data
= pd
. read_excel
( "./datas/szfj_baoan.xls" )
data
districtroomnumhallAREAC_floorfloor_numschoolsubwayper_price 0baoan 3 2 89.3 middle 31 0 0 7.0773 1baoan 4 2 127.0 high 31 0 0 6.9291 2baoan 1 1 28.0 low 39 0 0 3.9286 3baoan 1 1 28.0 middle 30 0 0 3.3568 4baoan 2 2 78.0 middle 8 1 1 5.0769 ...... ... ... ... ... ... ... ... ... 1246baoan 4 2 89.3 low 8 0 0 4.2553 1247baoan 2 1 67.0 middle 30 0 0 3.8060 1248baoan 2 2 67.4 middle 29 1 0 5.3412 1249baoan 2 2 73.1 low 15 1 0 5.9508 1250baoan 3 2 86.2 middle 32 0 1 4.5244
1251 rows × 9 columns
time
= "2020-06-23"
date
= pd
. to_datetime
( time
)
date
Timestamp('2020-06-23 00:00:00')
type ( date
)
pandas._libs.tslibs.timestamps.Timestamp
date
. year
2020
date
. month
6
data
[ "week" ] = date
. weekday
data
. drop
( "week" , axis
= 1 , inplace
= True )
data
districtroomnumhallAREAC_floorfloor_numschoolsubwayper_price 0baoan 3 2 89.3 middle 31 0 0 7.0773 1baoan 4 2 127.0 high 31 0 0 6.9291 2baoan 1 1 28.0 low 39 0 0 3.9286 3baoan 1 1 28.0 middle 30 0 0 3.3568 4baoan 2 2 78.0 middle 8 1 1 5.0769 ...... ... ... ... ... ... ... ... ... 1246baoan 4 2 89.3 low 8 0 0 4.2553 1247baoan 2 1 67.0 middle 30 0 0 3.8060 1248baoan 2 2 67.4 middle 29 1 0 5.3412 1249baoan 2 2 73.1 low 15 1 0 5.9508 1250baoan 3 2 86.2 middle 32 0 1 4.5244
1251 rows × 9 columns
data
[ "feature" ] = np
. where
( data
[ "per_price" ] > 5.0000 , 1 , 0 )
data
districtroomnumhallAREAC_floorfloor_numschoolsubwayper_pricefeature 0baoan 3 2 89.3 middle 31 0 0 7.0773 1 1baoan 4 2 127.0 high 31 0 0 6.9291 1 2baoan 1 1 28.0 low 39 0 0 3.9286 0 3baoan 1 1 28.0 middle 30 0 0 3.3568 0 4baoan 2 2 78.0 middle 8 1 1 5.0769 1 ...... ... ... ... ... ... ... ... ... ... 1246baoan 4 2 89.3 low 8 0 0 4.2553 0 1247baoan 2 1 67.0 middle 30 0 0 3.8060 0 1248baoan 2 2 67.4 middle 29 1 0 5.3412 1 1249baoan 2 2 73.1 low 15 1 0 5.9508 1 1250baoan 3 2 86.2 middle 32 0 1 4.5244 0
1251 rows × 10 columns
data0
= pd
. crosstab
( data
[ "floor_num" ] , data
[ "feature" ] )
data0
feature01 floor_num 16 8 30 1 40 10 63 7 716 25 819 32 92 11 104 9 118 11 121 3 134 20 140 5 158 33 169 19 1720 21 1817 35 1911 5 202 4 211 6 220 1 234 8 2410 26 254 37 269 57 275 38 286 35 2926 68 3030 78 314 151 3221 126 3334 20 341 5 351 2 360 4 371 1 380 1 395 10 401 3 430 1 440 6 450 7 470 1 500 1 510 3 520 2 530 1
data0
. sum ( axis
= 1 )
floor_num
1 14
3 1
4 10
6 10
7 41
8 51
9 13
10 13
11 19
12 4
13 24
14 5
15 41
16 28
17 41
18 52
19 16
20 6
21 7
22 1
23 12
24 36
25 41
26 66
27 43
28 41
29 94
30 108
31 155
32 147
33 54
34 6
35 3
36 4
37 2
38 1
39 15
40 4
43 1
44 6
45 7
47 1
50 1
51 3
52 2
53 1
dtype: int64
data0
. div
( data0
. sum ( axis
= 1 ) , axis
= 0 )
feature01 floor_num 10.428571 0.571429 30.000000 1.000000 40.000000 1.000000 60.300000 0.700000 70.390244 0.609756 80.372549 0.627451 90.153846 0.846154 100.307692 0.692308 110.421053 0.578947 120.250000 0.750000 130.166667 0.833333 140.000000 1.000000 150.195122 0.804878 160.321429 0.678571 170.487805 0.512195 180.326923 0.673077 190.687500 0.312500 200.333333 0.666667 210.142857 0.857143 220.000000 1.000000 230.333333 0.666667 240.277778 0.722222 250.097561 0.902439 260.136364 0.863636 270.116279 0.883721 280.146341 0.853659 290.276596 0.723404 300.277778 0.722222 310.025806 0.974194 320.142857 0.857143 330.629630 0.370370 340.166667 0.833333 350.333333 0.666667 360.000000 1.000000 370.500000 0.500000 380.000000 1.000000 390.333333 0.666667 400.250000 0.750000 430.000000 1.000000 440.000000 1.000000 450.000000 1.000000 470.000000 1.000000 500.000000 1.000000 510.000000 1.000000 520.000000 1.000000 530.000000 1.000000
data_percent
= data0
. div
( data0
. sum ( axis
= 1 ) , axis
= 0 )
data_percent
feature01 floor_num 10.428571 0.571429 30.000000 1.000000 40.000000 1.000000 60.300000 0.700000 70.390244 0.609756 80.372549 0.627451 90.153846 0.846154 100.307692 0.692308 110.421053 0.578947 120.250000 0.750000 130.166667 0.833333 140.000000 1.000000 150.195122 0.804878 160.321429 0.678571 170.487805 0.512195 180.326923 0.673077 190.687500 0.312500 200.333333 0.666667 210.142857 0.857143 220.000000 1.000000 230.333333 0.666667 240.277778 0.722222 250.097561 0.902439 260.136364 0.863636 270.116279 0.883721 280.146341 0.853659 290.276596 0.723404 300.277778 0.722222 310.025806 0.974194 320.142857 0.857143 330.629630 0.370370 340.166667 0.833333 350.333333 0.666667 360.000000 1.000000 370.500000 0.500000 380.000000 1.000000 390.333333 0.666667 400.250000 0.750000 430.000000 1.000000 440.000000 1.000000 450.000000 1.000000 470.000000 1.000000 500.000000 1.000000 510.000000 1.000000 520.000000 1.000000 530.000000 1.000000
data_percent
. plot
( kind
= "bar" , stacked
= True )
<matplotlib.axes._subplots.AxesSubplot at 0x24719dd7488>
data_percent
= data0
. div
( data0
. sum ( axis
= 1 ) , axis
= 0 )
data_percent
<tr><th>50</th><td>0.000000</td><td>1.000000</td>
</tr>
<tr><th>51</th><td>0.000000</td><td>1.000000</td>
</tr>
<tr><th>52</th><td>0.000000</td><td>1.000000</td>
</tr>
<tr><th>53</th><td>0.000000</td><td>1.000000</td>
</tr>
feature01 floor_num 10.428571 0.571429 30.000000 1.000000 40.000000 1.000000 60.300000 0.700000 70.390244 0.609756 80.372549 0.627451 90.153846 0.846154 100.307692 0.692308 110.421053 0.578947 120.250000 0.750000 130.166667 0.833333 140.000000 1.000000 150.195122 0.804878 160.321429 0.678571 170.487805 0.512195 180.326923 0.673077 190.687500 0.312500 200.333333 0.666667 210.142857 0.857143 220.000000 1.000000 230.333333 0.666667 240.277778 0.722222 250.097561 0.902439 260.136364 0.863636 270.116279 0.883721 280.146341 0.853659 290.276596 0.723404 300.277778 0.722222
4.9.3使用pivot_table(透視表)實現
data
. pivot_table
( [ "feature" ] , index
= [ "floor_num" ] )
...
feature floor_num 10.571429 31.000000 41.000000 60.700000 501.000000 511.000000 521.000000 531.000000
4.10 高級處理-分組與聚合
4.10.2 分組與聚合API
col
= pd
. DataFrame
( { 'color' : [ 'white' , 'red' , 'green' , 'red' , 'green' ] , 'object' : [ "pen" , "pencil" , "pencil" , "ashtray" , "pen" ] , 'price1' : [ 4.56 , 4.20 , 1.30 , 0.56 , 2.75 ] , 'price2' : [ 4.75 , 4.12 , 1.68 , 0.75 , 3.15 ] } )
col
colorobjectprice1price2 0white pen 4.56 4.75 1red pencil 4.20 4.12 2green pencil 1.30 1.68 3red ashtray 0.56 0.75 4green pen 2.75 3.15
col
. groupby
( by
= "color" ) [ "price1" ] . max ( )
color
green 2.75
red 4.20
white 4.56
Name: price1, dtype: float64
col
[ 'price1' ] . groupby
( col
[ "color" ] )
<pandas.core.groupby.generic.SeriesGroupBy object at 0x000002471D178D08>
col
[ 'price1' ] . groupby
( col
[ "color" ] ) . max ( )
color
green 2.75
red 4.20
white 4.56
Name: price1, dtype: float64
4.11 綜合案例
movie
= pd
. read_csv
( "./datas/IMDB-Movie-Data.csv" )
movie
RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore 01 Guardians of the Galaxy Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76.0 12 Prometheus Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0 23 Split Horror,Thriller Three girls are kidnapped by a man with a diag... M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12 62.0 34 Sing Animation,Comedy,Family In a city of humanoid animals, a hustling thea... Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108 7.2 60545 270.32 59.0 45 Suicide Squad Action,Adventure,Fantasy A secret government agency recruits some of th... David Ayer Will Smith, Jared Leto, Margot Robbie, Viola D... 2016 123 6.2 393727 325.02 40.0 ...... ... ... ... ... ... ... ... ... ... ... ... 995996 Secret in Their Eyes Crime,Drama,Mystery A tight-knit team of rising investigators, alo... Billy Ray Chiwetel Ejiofor, Nicole Kidman, Julia Roberts... 2015 111 6.2 27585 NaN 45.0 996997 Hostel: Part II Horror Three American college students studying abroa... Eli Roth Lauren German, Heather Matarazzo, Bijou Philli... 2007 94 5.5 73152 17.54 46.0 997998 Step Up 2: The Streets Drama,Music,Romance Romantic sparks occur between two dance studen... Jon M. Chu Robert Hoffman, Briana Evigan, Cassie Ventura,... 2008 98 6.2 70699 58.01 50.0 998999 Search Party Adventure,Comedy A pair of friends embark on a mission to reuni... Scot Armstrong Adam Pally, T.J. Miller, Thomas Middleditch,Sh... 2014 93 5.6 4881 NaN 22.0 9991000 Nine Lives Comedy,Family,Fantasy A stuffy businessman finds himself trapped ins... Barry Sonnenfeld Kevin Spacey, Jennifer Garner, Robbie Amell,Ch... 2016 87 5.3 12435 19.64 11.0
1000 rows × 12 columns
movie
[ "Rating" ] . mean
( )
6.723200000000003
movie
[ "Director" ]
0 James Gunn
1 Ridley Scott
2 M. Night Shyamalan
3 Christophe Lourdelet
4 David Ayer...
995 Billy Ray
996 Eli Roth
997 Jon M. Chu
998 Scot Armstrong
999 Barry Sonnenfeld
Name: Director, Length: 1000, dtype: object
np
. unique
( movie
[ "Director" ] )
array(['Aamir Khan', 'Abdellatif Kechiche', 'Adam Leon', 'Adam McKay','Adam Shankman', 'Adam Wingard', 'Afonso Poyart', 'Aisling Walsh','Akan Satayev', 'Akiva Schaffer', 'Alan Taylor', 'Albert Hughes','Alejandro Amenábar', 'Alejandro González I?árritu',...'Tomas Alfredson', 'Tony Gilroy', 'Tony Scott', 'Travis Knight','Tyler Shields', 'Wally Pfister', 'Walt Dohrn', 'Walter Hill','Warren Beatty', 'Werner Herzog', 'Wes Anderson', 'Wes Ball','Wes Craven', 'Whit Stillman', 'Will Gluck', 'Will Slocombe','William Brent Bell', 'William Oldroyd', 'Woody Allen','Xavier Dolan', 'Yimou Zhang', 'Yorgos Lanthimos', 'Zack Snyder','Zackary Adler'], dtype=object)
np
. unique
( movie
[ "Director" ] ) . size
644
movie
[ "Rating" ] . plot
( kind
= "hist" , figsize
= ( 20 , 8 ) , fontsize
= 40 )
<matplotlib.axes._subplots.AxesSubplot at 0x2471ce18708>
import matplotlib
. pyplot
as plt
plt
. figure
( figsize
= ( 20 , 8 ) , dpi
= 100 )
plt
. hist
( movie
[ "Rating" ] , 20 )
plt
. xticks
( np
. linspace
( movie
[ "Rating" ] . min ( ) , movie
[ "Rating" ] . max ( ) , 21 ) )
plt
. grid
( linestyle
= "--" , alpha
= 0.5 )
plt
. show
( )
movie
[ "Rating" ]
0 8.1
1 7.0
2 7.3
3 7.2
4 6.2...
995 6.2
996 5.5
997 6.2
998 5.6
999 5.3
Name: Rating, Length: 1000, dtype: float64
movie_genre
= [ i
. split
( "," ) for i
in movie
[ "Genre" ] ]
movie_genre
[['Action', 'Adventure', 'Sci-Fi'],['Adventure', 'Mystery', 'Sci-Fi'],['Horror', 'Thriller'],['Animation', 'Comedy', 'Family'],['Action', 'Adventure', 'Fantasy'],...['Horror'],['Drama', 'Music', 'Romance'],['Adventure', 'Comedy'],['Comedy', 'Family', 'Fantasy']]
[ j
for i
in movie_genre
for j
in i
]
['Action','Adventure','Sci-Fi','Adventure','Mystery','Sci-Fi',
...'Animation','Action','Adventure','Action','Adventure','Drama',...]
movie_class
= np
. unique
( [ j
for i
in movie_genre
for j
in i
] )
movie_class
array(['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime','Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music','Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Sport', 'Thriller','War', 'Western'], dtype='<U9')
len ( movie_class
)
20
count
= pd
. DataFrame
( np
. zeros
( shape
= [ 1000 , 20 ] , dtype
= "int32" ) , columns
= movie_class
)
count
. head
( )
ActionAdventureAnimationBiographyComedyCrimeDramaFamilyFantasyHistoryHorrorMusicMusicalMysteryRomanceSci-FiSportThrillerWarWestern 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 40 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
count
. loc
[ 0 , movie_genre
[ 0 ] ]
Action 0
Adventure 0
Sci-Fi 0
Name: 0, dtype: int32
movie_genre
[ 0 ]
['Action', 'Adventure', 'Sci-Fi']
for i
in range ( 1000 ) : count
. loc
[ i
, movie_genre
[ i
] ] = 1
count
ActionAdventureAnimationBiographyComedyCrimeDramaFamilyFantasyHistoryHorrorMusicMusicalMysteryRomanceSci-FiSportThrillerWarWestern 01 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 10 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 20 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 30 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 41 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 ...... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 9950 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 9960 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 9970 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 9980 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9990 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
1000 rows × 20 columns
count
. sum ( axis
= 0 )
Action 303
Adventure 259
Animation 49
Biography 81
Comedy 279
Crime 150
Drama 513
Family 51
Fantasy 101
History 29
Horror 119
Music 16
Musical 5
Mystery 106
Romance 141
Sci-Fi 120
Sport 18
Thriller 195
War 13
Western 7
dtype: int64
count
. sum ( axis
= 0 ) . sort_values
( ascending
= False )
Drama 513
Action 303
Comedy 279
Adventure 259
Thriller 195
Crime 150
Romance 141
Sci-Fi 120
Horror 119
Mystery 106
Fantasy 101
Biography 81
Family 51
Animation 49
History 29
Sport 18
Music 16
War 13
Western 7
Musical 5
dtype: int64
count
. sum ( axis
= 0 ) . sort_values
( ascending
= False ) . plot
( kind
= "bar" , fontsize
= 20 , figsize
= ( 20 , 9 ) , colormap
= "cool" )
<matplotlib.axes._subplots.AxesSubplot at 0x2472450c1c8>
總結
以上是生活随笔 為你收集整理的九、Pandas高级处理 的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔 網站內容還不錯,歡迎將生活随笔 推薦給好友。