生活随笔
收集整理的這篇文章主要介紹了
商务与经济统计(13版,Python)笔记 01-02章
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
文章目錄 第1章 數據與統計資料 1.1 統計學在商務經濟中的應用 1.2 數據 1.3 數據來源 1.4 描述統計 1.5 統計推斷 1.6 邏輯分析方法 1.7 大數據與數據挖掘 1.8 計算機與統計分析 1.9 統計實踐的道德準則 第2章 描述統計學1:表格法和圖形法 2.1 匯總分類變量的數據 條形圖及樣例(bar chart) 餅形圖及樣例(pie chart) 2.2 匯總數量變量的數據 單變量:打點圖(dot plot) 單變量:直方圖(histogram) 單變量:累積分布(displot) 單變量:莖葉顯示(stem-and-leaf display) 2.3 用表格方法匯總兩個變量的數據 2.4 用圖形顯示方法匯總兩個變量的數據 散點圖(scatter diagram)和趨勢線(trendline) 復合條形圖(side-by-side bar chart)和結構條形圖(stacked chart) 2.5 數據可視化:創建有效圖形顯示的最佳實踐
第一次讀本書的時候,因為有大學課程的基礎,更關注于技術性的內容和理解,而忽略了看似簡單的基礎知識。實際上這應該是入門新手的通病,總是著眼于實用性內容,而忽略基礎知識。雖然這樣做有助于維持學習興趣,幫助新人堅持到入門,然后在實踐之中反過來學習基礎知識。但是最好在第一次學習就能認識到基礎知識的重要性,并且盡量掌握。最好的辦法就是做習題。
最初是為了學習數據分析,然而當業內人士說數據分析最重要的知識是‘描述統計學’,我記憶中卻是將其歸為顯淺知識,囫圇吞棗。
第1章 數據與統計資料
1.1 統計學在商務經濟中的應用
會計、財務、市場營銷、生產、經濟、信息系統
1.2 數據
數據、數據集、個體、變量、觀測值、分類型數據、分類變量、數量型數據、數量變量、截面數據、時間序列數據 **1.2.2 測量尺度** 名義尺度、順序尺度、間隔尺度、比率尺度按順序層層包含 其中,順序尺度加減無意義,間隔尺度乘除無意義,只有間隔尺度、比例尺度有計量單位 測量尺度
1.3 數據來源
來源有:現有來源、觀測性研究、實驗,需要注意:時間與成本問題、數據采集誤差
1.4 描述統計
將數據以表格、圖形或數值形式匯總的統計方法
1.5 統計推斷
總體、樣本、普查、抽樣調查 統計學的一個主要貢獻就是利用樣本數據對總體特征進行估計和假設檢驗,即統計推斷
1.6 邏輯分析方法
邏輯分析方法包括: 描述性分析對過去數據的分析、BI、或復盤 預測性分析預測,或指出變量之間的影響 規范性分析產生一個最佳行動過程的分析技術集合,即在實際條件約束情況下的行動指導
1.7 大數據與數據挖掘
大數據容量(volume)、速度(velocity)、種類(variety),3V 數據挖掘data mining,從龐大的數據庫中自動提取預測性的信息
1.8 計算機與統計分析
1.9 統計實踐的道德準則
統計是搜集、分析、表述、和解析數據的藝術和科學
第2章 描述統計學1:表格法和圖形法
2.1 匯總分類變量的數據
頻數分布、相對頻數分布、百分比頻數分布
條形圖及樣例(bar chart)
條形圖(bar chat)描述:頻數分布、相對頻數分布、百分比頻數分布,分類變量的條形圖,應該有一定的間隔 matplotlib.bar(有樣例) 基本用法:
from matplotlib
import pyplot
as plt
x
, y
, x2
, y2
= [ 5 , 8 , 10 ] , [ 12 , 16 , 6 ] , [ 6 , 9 , 11 ] , [ 6 , 15 , 7 ]
plt
. bar
( x
, y
, align
= 'center' )
plt
. bar
( x2
, y2
, color
= 'g' , align
= 'center' )
plt
. title
( 'Bar graph' )
plt
. ylabel
( 'Y axis' )
plt
. xlabel
( 'X axis' )
plt
. show
( )
極坐標條形圖:
import numpy
as np
import matplotlib
. pyplot
as plt
np
. random
. seed
( 19680801 )
N
= 20
theta
= np
. linspace
( 0.0 , 2 * np
. pi
, N
, endpoint
= False )
radii
= 10 * np
. random
. rand
( N
)
width
= np
. pi
/ 4 * np
. random
. rand
( N
)
colors
= plt
. cm
. viridis
( radii
/ 10 . )
ax
= plt
. subplot
( 111 , projection
= 'polar' )
ax
. bar
( theta
, radii
, width
= width
, bottom
= 0.0 , color
= colors
, alpha
= 0.5 )
plt
. show
( )
seaborn.barplot(有樣例)就簡單多了:
ax
= sns
. barplot
( x
= "day" , y
= "total_bill" , hue
= "sex" , data
= tips
)
餅形圖及樣例(pie chart)
餅形圖(pie chat)描述:相對頻數分布、百分比頻數分布(相對角度差異,人更能判斷長度間的差異,所以最好標注比例) matplotlib.pyplot.pie(有樣例),個人覺得不錯的3各樣例(后附代碼):
import matplotlib
. pyplot
as plt
labels
= 'Frogs' , 'Hogs' , 'Dogs' , 'Logs'
sizes
= [ 15 , 30 , 45 , 10 ]
explode
= ( 0 , 0.1 , 0 , 0 )
fig1
, ax1
= plt
. subplots
( )
ax1
. pie
( sizes
, explode
= explode
, labels
= labels
, autopct
= '%1.1f%%' , shadow
= True , startangle
= 90 )
ax1
. axis
( 'equal' )
plt
. show
( )
import numpy
as np
import matplotlib
. pyplot
as plt
fig
, ax
= plt
. subplots
( figsize
= ( 6 , 3 ) , subplot_kw
= dict ( aspect
= "equal" ) )
recipe
= [ "375 g flour" , "75 g sugar" , "250 g butter" , "300 g berries" ]
data
= [ float ( x
. split
( ) [ 0 ] ) for x
in recipe
]
ingredients
= [ x
. split
( ) [ - 1 ] for x
in recipe
]
def func ( pct
, allvals
) : absolute
= int ( pct
/ 100 . * np
. sum ( allvals
) ) return "{:.1f}%\n({:d} g)" . format ( pct
, absolute
)
wedges
, texts
, autotexts
= ax
. pie
( data
, autopct
= lambda pct
: func
( pct
, data
) , textprops
= dict ( color
= "w" ) )
ax
. legend
( wedges
, ingredients
, title
= "Ingredients" , loc
= "center left" , bbox_to_anchor
= ( 1 , 0 , 0.5 , 1 ) )
plt
. setp
( autotexts
, size
= 8 , weight
= "bold" )
ax
. set_title
( "Matplotlib bakery: A pie" )
plt
. show
( )
fig
, ax
= plt
. subplots
( figsize
= ( 6 , 3 ) , subplot_kw
= dict ( aspect
= "equal" ) )
recipe
= [ "225 g flour" , "90 g sugar" , "1 egg" , "60 g butter" , "100 ml milk" , "1/2 package of yeast" ]
data
= [ 225 , 90 , 50 , 60 , 100 , 5 ]
wedges
, texts
= ax
. pie
( data
, wedgeprops
= dict ( width
= 0.5 ) , startangle
= - 40 )
bbox_props
= dict ( boxstyle
= "square,pad=0.3" , fc
= "w" , ec
= "k" , lw
= 0.72 )
kw
= dict ( arrowprops
= dict ( arrowstyle
= "-" ) , bbox
= bbox_props
, zorder
= 0 , va
= "center" )
for i
, p
in enumerate ( wedges
) : ang
= ( p
. theta2
- p
. theta1
) / 2 . + p
. theta1y
= np
. sin
( np
. deg2rad
( ang
) ) x
= np
. cos
( np
. deg2rad
( ang
) ) horizontalalignment
= { - 1 : "right" , 1 : "left" } [ int ( np
. sign
( x
) ) ] connectionstyle
= "angle,angleA=0,angleB={}" . format ( ang
) kw
[ "arrowprops" ] . update
( { "connectionstyle" : connectionstyle
} ) ax
. annotate
( recipe
[ i
] , xy
= ( x
, y
) , xytext
= ( 1.35 * np
. sign
( x
) , 1.4 * y
) , horizontalalignment
= horizontalalignment
, ** kw
)
ax
. set_title
( "Matplotlib bakery: A donut" )
plt
. show
( )
Pandas 畫圖一個函數應該夠用了,參數詳解
DataFrame
. plot
( x
= None , y
= None , kind
= 'line' , ax
= None , subplots
= False , sharex
= None , sharey
= False , layout
= None , figsize
= None , use_index
= True , title
= None , grid
= None , legend
= True , style
= None , logx
= False , logy
= False , loglog
= False , xticks
= None , yticks
= None , xlim
= None , ylim
= None , rot
= None , xerr
= None , secondary_y
= False , sort_columns
= False , ** kwds
)
樣例 Matplotlib examples 樣例 Seaborn Example gallery
2.2 匯總數量變量的數據
組數、組寬、組限、組中值、相對頻數分布、百分比頻數分布、累積頻數分布
單變量:打點圖(dot plot)
使用 matplotlib.scatter,seaborn.swarmplot模擬
import numpy
as np
import matplotlib
. pyplot
as plt
import seaborn
as sns
import pandas
as pd
from matplotlib
. pyplot
import MultipleLocator
fig
, ax
= plt
. subplots
( 1 , 2 , figsize
= ( 12 , 2 ) )
np
. random
. seed
( 1900 )
x
= np
. random
. randint
( 1 , 99 , size
= 20 )
data
= pd
. DataFrame
( x
, columns
= [ 'x' ] )
data
[ 'y' ] = 1
for i
in range ( len ( data
) ) : data
[ 'y' ] . at
[ i
] = data
[ 'x' ] . iloc
[ : i
+ 1 ] [ data
[ 'x' ] . iloc
[ : i
+ 1 ] == data
[ 'x' ] . at
[ i
] ] . count
( )
plt
. subplot
( 121 ) plt
. scatter
( data
[ 'x' ] , data
[ 'y' ] )
plt
. tick_params
( axis
= 'both' , which
= 'major' )
sns
. swarmplot
( x
= "x" , y
= "y" , palette
= [ "r" , "c" , "y" ] , data
= data
, ax
= ax
[ 1 ] )
plt
. show
( )
單變量:直方圖(histogram)
與條形圖原理一樣,只是數量型變量進行分組,方條之間無間隔
from matplotlib
import pyplot
as plt
import numpy
as np
np
. random
. seed
( 1900 )
x
= np
. random
. randint
( 1 , 99 , size
= 50 )
plt
. hist
( x
, bins
= [ 0 , 20 , 40 , 60 , 80 , 100 ] )
plt
. show
( )
單變量:累積分布(displot)
累積分布如果使用matplotlib則需要計算累積量,使用seaborn.displot,一口氣能畫4張圖Distribution plot options
import numpy
as np
import seaborn
as sns
import matplotlib
. pyplot
as plt
sns
. set ( style
= "white" , palette
= "muted" , color_codes
= True )
rs
= np
. random
. RandomState
( 10 )
f
, axes
= plt
. subplots
( 2 , 2 , figsize
= ( 7 , 7 ) , sharex
= True )
sns
. despine
( left
= True )
d
= rs
. normal
( size
= 100 )
sns
. distplot
( d
, kde
= False , color
= "b" , ax
= axes
[ 0 , 0 ] )
sns
. distplot
( d
, hist
= False , rug
= True , color
= "r" , ax
= axes
[ 0 , 1 ] )
sns
. distplot
( d
, hist
= False , color
= "g" , kde_kws
= { "shade" : True } , ax
= axes
[ 1 , 0 ] )
sns
. distplot
( d
, color
= "m" , ax
= axes
[ 1 , 1 ] )
plt
. setp
( axes
, yticks
= [ ] )
plt
. tight_layout
( )
單變量:莖葉顯示(stem-and-leaf display)
暫時沒找到莖葉圖的庫,手動實現
0 | 6 9 8 4 1 | 6 3 7 3 6 1 2 2 | 5 5 9 2 3 | 2 8 0 4 4 | 9 9 5 | 1 5 2 4 9 8 6 6 | 3 6 2 7 | 3 2 1 2 8 | 9 4 1 3 0 7 7 1 9 3 1 9 | 6 2 7 8
import numpy
as np
np
. random
. seed
( 2019 )
data
= np
. random
. randint
( 1 , 99 , size
= 50 )
_stem
= [ ]
for x
in data
: _stem
. append
( x
// 10 ) stem
= list ( set ( _stem
) )
for m
in stem
: leaf
= [ ] leaf
. append
( m
) for n
in data
: if n
// 10 == m
: leaf
. append
( n
% 10 ) print ( leaf
[ 0 ] , '|' , end
= ' ' ) for i
in range ( 1 , len ( leaf
) ) : print ( leaf
[ i
] , end
= ' ' ) print ( '\n' )
2.3 用表格方法匯總兩個變量的數據
辛普森悖論:依據綜合和未綜合的數據得到相反的結論。(原因是未綜合的變量,本身權重不等)
交叉分組表(crosstabulation)
使用pandas.corsstab模擬了一下書上的表格:
import numpy
as np
import pandas
as pd
np
. random
. seed
( 900 )
y
= np
. random
. randint
( 0 , 3 , size
= 300 )
z
= np
. random
. randint
( 11 , 49 , size
= 300 )
data
= pd
. DataFrame
( { '質量等級' : y
, '餐價' : z
} )
data
[ '質量等級' ] . replace
( { 0 : '好' , 1 : '很好' , 2 : '優秀' } , inplace
= True )
bins
= [ 10 , 19 , 29 , 39 , 49 ]
quartiles
= pd
. cut
( data
[ '餐價' ] , bins
, labels
= [ '10~19' , '20~29' , '30~39' , '40~49' ] )
data
[ '餐價' ] = quartiles
pd
. crosstab
( data
[ '質量等級' ] , data
[ '餐價' ] , margins
= True , margins_name
= '總計' )
2.4 用圖形顯示方法匯總兩個變量的數據
散點圖(scatter diagram)和趨勢線(trendline)
帥氣的散點圖(matplotlib中,趨勢線要用numpy.ployfit函數):
import matplotlib
. pyplot
as plt
import numpy
as np
np
. random
. seed
( 19680801 )
x
= np
. arange
( 0.0 , 50.0 , 2.0 )
y
= x
** 1.3 + np
. random
. rand
( * x
. shape
) * 30.0
s
= np
. random
. rand
( * x
. shape
) * 800 + 500
colors
= np
. random
. rand
( * x
. shape
)
plt
. figure
( figsize
= ( 12 , 6 ) )
plt
. scatter
( x
, y
, s
, c
= colors
, alpha
= 0.5 , marker
= r
'$\clubsuit$' , label
= "Luck" )
p1
= np
. poly1d
( np
. polyfit
( x
, y
, 1 ) )
l1
= plt
. plot
( x
, p1
( x
) , 'r--' , label
= 'trendline' )
plt
. xlabel
( "Leprechauns" )
plt
. ylabel
( "Gold" )
plt
. legend
( loc
= 'upper left' )
plt
. show
( )
使用seaborn庫則可以更加絢麗(sns.jointplot太占位置了,沒畫):
import seaborn
as sns
; sns
. set ( )
import matplotlib
. pyplot
as plt
fig
, axes
= plt
. subplots
( 2 , 2 , figsize
= ( 12 , 6 ) )
tips
= sns
. load_dataset
( "tips" )
cmap
= sns
. cubehelix_palette
( dark
= .3 , light
= .8 , as_cmap
= True )
sns
. scatterplot
( x
= "total_bill" , y
= "tip" , hue
= "time" , data
= tips
, ax
= axes
[ 0 , 0 ] )
sns
. residplot
( x
= "total_bill" , y
= "tip" , data
= tips
, ax
= axes
[ 0 , 1 ] )
sns
. regplot
( x
= "size" , y
= "total_bill" , data
= tips
, x_jitter
= .1 , ax
= axes
[ 1 , 1 ] )
sns
. lmplot
( x
= "size" , y
= "total_bill" , hue
= "day" , col
= "day" , data
= tips
, height
= 6 , aspect
= .4 , x_jitter
= .1 )
復合條形圖(side-by-side bar chart)和結構條形圖(stacked chart)
matplotlib做這種復合圖,有點復雜,附上鏈接 Stacked Bar Graph Grouped bar chart with labels Discrete distribution as horizontal bar chart 首先使用,pandas畫圖,還是2.3模擬表格的數字,這次用groupby聚合,然后增加匯總,轉置
import numpy
as np
import pandas
as pd
import matplotlib
. pyplot
as plt
pd
. set_option
( 'precision' , 1 )
np
. random
. seed
( 900 )
y
= np
. random
. randint
( 0 , 3 , size
= 300 )
z
= np
. random
. randint
( 11 , 49 , size
= 300 )
data
= pd
. DataFrame
( { '質量等級' : y
, '餐價' : z
} )
data
[ '質量等級' ] . replace
( { 0 : '好' , 1 : '很好' , 2 : '優秀' } , inplace
= True )
bins
= [ 10 , 19 , 29 , 39 , 49 ]
quartiles
= pd
. cut
( data
[ '餐價' ] , bins
, labels
= [ '10~19' , '20~29' , '30~39' , '40~49' ] )
df
= data
. groupby
( [ '質量等級' , quartiles
] ) . count
( ) . unstack
( )
df
= df
. apply ( lambda x
: x
/ x
. sum ( ) * 100 )
df
. loc
[ '總計' ] = df
. apply ( lambda x
: x
. sum ( ) )
df
. T
. plot
( kind
= 'bar' , stacked
= True )
分組的條形圖,seaborn庫寫得少,圖多:
import matplotlib
. pyplot
as plt
import seaborn
as sns
sns
. set ( style
= "darkgrid" )
fig
, ( ax1
, ax2
) = plt
. subplots
( 1 , 2 , figsize
= ( 12 , 6 ) )
tips
= sns
. load_dataset
( "tips" )
sns
. countplot
( y
= "day" , hue
= "sex" , data
= tips
, ax
= ax1
)
sns
. barplot
( x
= "day" , y
= "total_bill" , data
= tips
, ax
= ax2
)
sns
. catplot
( x
= "sex" , y
= "total_bill" , hue
= "smoker" , col
= "time" , data
= tips
, kind
= "bar" , height
= 4 , aspect
= .7 )
g
= sns
. FacetGrid
( tips
, row
= "sex" , col
= "time" , margin_titles
= True )
bins
= np
. linspace
( 0 , 60 , 13 )
g
. map ( plt
. hist
, "total_bill" , color
= "steelblue" , bins
= bins
)
結構條形圖:
import seaborn
as sns
import matplotlib
. pyplot
as plt
sns
. set ( style
= "whitegrid" )
f
, ax
= plt
. subplots
( figsize
= ( 15 , 6 ) )
crashes
= sns
. load_dataset
( "car_crashes" ) . sort_values
( "total" , ascending
= False )
sns
. set_color_codes
( "pastel" )
sns
. barplot
( y
= "total" , x
= "abbrev" , data
= crashes
, label
= "Total" , color
= "b" )
sns
. set_color_codes
( "muted" )
sns
. barplot
( y
= "alcohol" , x
= "abbrev" , data
= crashes
, label
= "Alcohol-involved" , color
= "b" )
ax
. legend
( ncol
= 2 , loc
= "lower right" , frameon
= True )
ax
. set ( xlim
= ( 0 , 24 ) , ylabel
= "" , xlabel
= "Automobile collisions per billion miles" )
sns
. despine
( left
= True , bottom
= True )
2.5 數據可視化:創建有效圖形顯示的最佳實踐
創建有效的圖形顯示
1、給予圖形顯示一個清晰、簡明的標題。 2、使圖形顯示保持簡潔,當能用二維表示時不要用三維表示。 3、每個坐標有清楚的標記,并給出測量單位。 4、如果使用顏色來區分類別,要確保顏色是不同的。 5、如果使用多種顏色或線型,用圖例來標明時,要將圖例靠近所表示的數據。
總結
以上是生活随笔 為你收集整理的商务与经济统计(13版,Python)笔记 01-02章 的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔 網站內容還不錯,歡迎將生活随笔 推薦給好友。