當前位置：首頁 > 编程语言 > python >内容正文

python

python数据分析与机器学习(Numpy,Pandas,Matplotlib)

發布時間：2024/7/5 python 23 豆豆

生活随笔收集整理的這篇文章主要介紹了 python数据分析与机器学习(Numpy,Pandas,Matplotlib) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

機器學習怎么學？

機器學習包含數學原理推導和實際應用技巧，所以需要清楚算法的推導過程和如何應用。
深度學習是機器學習中神經網絡算法的延伸，在計算機視覺和自然語言處理中應用更厲害一些。
自己從頭開始做筆記。

機器學習怎么動手，哪里去找案例？

最好的資源：github ，kaggle
案例積累的作用很大，很少從頭去寫一個項目。先學會模仿，再去創作。

科學計算庫Numpy

numpy(Numerical Python extensions)是一個第三方的Python包，用于科學計算。這個庫的前身是1995年就開始開發的一個用于數組運算的庫。經過了長時間的發展，基本上成了絕大部分Python科學計算的基礎包，當然也包括所有提供Python接口的深度學習框架。
numpy.genfromtxt方法
從文本文件加載數據，并按指定的方式處理缺少的值

delimiter : 分隔符：用于分隔值的字符串?？梢允莝tr, int, or sequence。默認情況下，任何連續的空格作為分隔符。
dtype：結果數組的數據類型。如果沒有，則dtypes將由每列的內容單獨確定。

import numpy world_alcohol = numpy.genfromtxt("world_alcohol.txt",delimiter=",",dtype=str) print(type(world_alcohol)) print(world_alcohol) print(help(numpy.genfromtxt)) #當想知道numpy.genfromtxt用法時，使用help查詢幫助文檔

輸出結果：
<class ‘numpy.ndarray’> #所有的numpy都是ndarray結構
[[‘Year’ ‘WHO region’ ‘Country’ ‘Beverage Types’ ‘Display Value’]
[‘1986’ ‘Western Pacific’ ‘Viet Nam’ ‘Wine’ ‘0’]
[‘1986’ ‘Americas’ ‘Uruguay’ ‘Other’ ‘0.5’]
…,
[‘1987’ ‘Africa’ ‘Malawi’ ‘Other’ ‘0.75’]
[‘1989’ ‘Americas’ ‘Bahamas’ ‘Wine’ ‘1.5’]
[‘1985’ ‘Africa’ ‘Malawi’ ‘Spirits’ ‘0.31’]]

numpy.array
創建一個向量或矩陣（多維數組）

import numpy as np a = [1, 2, 4, 3] #vector b = np.array(a) # array([1, 2, 4, 3]) type(b) # <type 'numpy.ndarray'>

對數組元素的操作1

b.shape # (4,) 返回矩陣的（行數，列數）或向量中的元素個數 b.argmax() # 2 返回最大值所在的索引 b.max() # 4最大值 b.min() # 1最小值 b.mean() # 2.5平均值

numpy限制了nump.array中的元素必須是相同數據結構。使用dtype屬性返回數組中的數據類型

>>> a = [1,2,3,5] >>> b = np.array(a) >>> b.dtype dtype('int64')

對數組元素的操作2

c = [[1, 2], [3, 4]] # 二維列表 d = np.array(c) # 二維numpy數組 d.shape # (2, 2) d[1,1] #4,矩陣方式按照行、列獲取元素 d.size # 4 數組中的元素個數 d.max(axis=0) # 找維度0，也就是最后一個維度上的最大值，array([3, 4]) d.max(axis=1) # 找維度1，也就是倒數第二個維度上的最大值，array([2, 4]) d.mean(axis=0) # 找維度0，也就是第一個維度上的均值，array([ 2., 3.]) d.flatten() # 展開一個numpy數組為1維數組，array([1, 2, 3, 4]) np.ravel(c) # 展開一個可以解析的結構為1維數組，array([1, 2, 3, 4])

對數組元素的操作3

import numpy as np matrix = np.array([[5,10,15],[20,25,30],[35,40,45]]) print(matrix.sum(axis=1)) #指定維度axis=1，即按行計算輸出結果： [ 30 75 120]

import numpy as np
matrix = np.array([
[5,10,15],
[20,25,30],
[35,40,45]
])
print(matrix.sum(axis=0)) #指定維度axis=0，即按列計算
輸出結果：
[60 75 90]

矩陣中也可以使用切片

import numpy as np vector = [1, 2, 4, 3] print(vector[0:3]) #[1, 2, 4] 對于索引大于等于0，小于3的所有元素matrix = np.array([[5,10,15],[20,25,30],[35,40,45]]) print(matrix[:,1]) #[10 25 40]取出所有行的第一列 print(matrix[:,0:2]) #取出所有行的第一、第二列 #[[ 5 10][20 25][35 40]]

對數組的判斷操作，等價于對數組中所有元素的操作

import numpy as np matrix = np.array([[5,10,15],[20,25,30],[35,40,45]]) print(matrix == 25) 輸出結果： [[False False False][False True False][False False False]]

second_colum_25 = matrix[:,1]== 25
print(second_colum_25)
print(matrix[second_colum_25,:]) #bool類型的值也可以拿出來當成索引
輸出結果：
[False True False]
[[20 25 30]]

對數組元素的與操作,或操作

import numpy as np vector = np.array([5,10,15,20]) equal_to_ten_and_five = (vector == 10) & (vector == 5) print (equal_to_ten_and_five) 輸出結果： [False False False False]

import numpy as np
vector = np.array([5,10,15,20])
equal_to_ten_and_five = (vector == 10) | (vector == 5)
print (equal_to_ten_and_five)
vector[equal_to_ten_and_five] = 50 #bool類型值作為索引時，True有效
print(vector)
輸出結果：
[ True True False False]
[50 50 15 20]

對數組元素類型的轉換

import numpy as np vector = np.array(['lucy','ch','dd']) vector = vector.astype(float) #astype對整個vector進行值類型的轉換 print(vector.dtype) print(vector) 輸出結果： float64 [ 5. 10. 15. 20.]

Numpy常用函數

reshape方法，變換矩陣維度

import numpy as np print(np.arange(15)) a = np.arange(15).reshape(3,5) #將向量變為3行5列矩陣 print(a) print(a.shape) #shape方法獲得（行數，烈數）

輸出結果：
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14]
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]]
(3, 5)

初始化矩陣為0或1

>>> import numpy as np >>> np.zeros((3,4)) #將一個三行四列矩陣初始化為0 輸出結果： array([[ 0., 0., 0., 0.],[ 0., 0., 0., 0.],[ 0., 0., 0., 0.]])

>>> import numpy as np
>>> np.ones((3,4),dtype=np.int32) #指定類型為int型
輸出結果：
array([[1, 1, 1, 1],
[1, 1, 1, 1],
[1, 1, 1, 1]], dtype=int32)

構造序列

np.arange( 10, 30, 5 ) #起始值10，終止值小于30，間隔為5 輸出結果： array([10, 15, 20, 25])

np.arange( 0, 2, 0.3 )
輸出結果：
array([ 0. , 0.3, 0.6, 0.9, 1.2, 1.5, 1.8])

random模塊

np.random.random((2,3)) #random模塊中的random函數，產生一個兩行三列的隨機矩陣。（-1，+1）之間的值輸出結果： array([[ 0.40130659, 0.45452825, 0.79776512],[ 0.63220592, 0.74591134, 0.64130737]])

linspace模塊，將起始值與終止值之間等分成x份

from numpy import pi np.linspace( 0, 2*pi, 100 ) 輸出結果： array([ 0. , 0.06346652, 0.12693304, 0.19039955, 0.25386607,0.31733259, 0.38079911, 0.44426563, 0.50773215, 0.57119866,0.63466518, 0.6981317 , 0.76159822, 0.82506474, 0.88853126,0.95199777, 1.01546429, 1.07893081, 1.14239733, 1.20586385,1.26933037, 1.33279688, 1.3962634 , 1.45972992, 1.52319644,1.58666296, 1.65012947, 1.71359599, 1.77706251, 1.84052903,1.90399555, 1.96746207, 2.03092858, 2.0943951 , 2.15786162,2.22132814, 2.28479466, 2.34826118, 2.41172769, 2.47519421,2.53866073, 2.60212725, 2.66559377, 2.72906028, 2.7925268 ,2.85599332, 2.91945984, 2.98292636, 3.04639288, 3.10985939,3.17332591, 3.23679243, 3.30025895, 3.36372547, 3.42719199,3.4906585 , 3.55412502, 3.61759154, 3.68105806, 3.74452458,3.8079911 , 3.87145761, 3.93492413, 3.99839065, 4.06185717,4.12532369, 4.1887902 , 4.25225672, 4.31572324, 4.37918976,4.44265628, 4.5061228 , 4.56958931, 4.63305583, 4.69652235,4.75998887, 4.82345539, 4.88692191, 4.95038842, 5.01385494,5.07732146, 5.14078798, 5.2042545 , 5.26772102, 5.33118753,5.39465405, 5.45812057, 5.52158709, 5.58505361, 5.64852012,5.71198664, 5.77545316, 5.83891968, 5.9023862 , 5.96585272,6.02931923, 6.09278575, 6.15625227, 6.21971879, 6.28318531])

對矩陣的運算以矩陣為單位進行操作

import numpy as np a = np.array( [20,30,40,50] ) b = np.arange( 4 ) #[0 1 2 3] c = a-b print(c) #[20 29 38 47] print(b**2) #[0 1 4 9] print(a<35) #[ True True False False]

矩陣乘法

A = np.array( [[1,1],[0,1]] ) B = np.array( [[2,0],[3,4]] ) print A.dot(B) #求矩陣乘法的方法一 print np.dot(A, B) ##求矩陣乘法的方法二輸出結果： [[5 4][3 4]] [[5 4][3 4]]

e為底數的運算&開根運算

import numpy as np B = np.arange(3) print (np.exp(B)) #[ 1. 2.71828183 7.3890561 ] e的B次方 print (np.sqrt(B)) #[ 0. 1. 1.41421356]

floor向下取整

import numpy as np a = np.floor(10*np.random.random((3,4))) #floor向下取整 print(a) print (a.ravel()) #將矩陣中元素展開成一行 a.shape = (6, 2) #當采用a.reshape(6,-1) 第二個參數-1表示默認根據行數確定列數 print (a) print (a.T) #a的轉置（矩陣行列互換）

[[ 8. 7. 2. 1.]
[ 5. 2. 5. 1.]
[ 8. 7. 7. 2.]]
[ 8. 7. 2. 1. 5. 2. 5. 1. 8. 7. 7. 2.]
[[ 8. 7.]
[ 2. 1.]
[ 5. 2.]
[ 5. 1.]
[ 8. 7.]
[ 7. 2.]]
[[ 8. 2. 5. 5. 8. 7.]
[ 7. 1. 2. 1. 7. 2.]]

hstack與vstack實現矩陣的拼接（拼接數據常用）

a = np.floor(10*np.random.random((2,2))) b = np.floor(10*np.random.random((2,2))) print(a) print(b) print(np.hstack((a,b))) #橫著拼接 print(np.vstack((a,b))) #豎著拼接輸出結果： [[ 8. 6.][ 7. 6.]] [[ 3. 4.][ 8. 1.]] [[ 8. 6. 3. 4.][ 7. 6. 8. 1.]] [[ 8. 6.][ 7. 6.][ 3. 4.][ 8. 1.]]

hsplit與vsplit實現矩陣的切分

a = np.floor(10*np.random.random((2,12))) print(a) print(np.hsplit(a,3)) #橫著將矩陣切分為3份 print(np.hsplit(a,(3,4))) # 指定橫著切分的位置，第三列和第四列輸出結果： [[ 7. 1. 4. 9. 8. 8. 5. 9. 6. 6. 9. 4.][ 1. 9. 1. 2. 9. 9. 5. 0. 5. 4. 9. 6.]] [array([[ 7., 1., 4., 9.],[ 1., 9., 1., 2.]]), array([[ 8., 8., 5., 9.],[ 9., 9., 5., 0.]]), array([[ 6., 6., 9., 4.],[ 5., 4., 9., 6.]])] [array([[ 7., 1., 4.],[ 1., 9., 1.]]), array([[ 9.],[ 2.]]), array([[ 8., 8., 5., 9., 6., 6., 9., 4.],[ 9., 9., 5., 0., 5., 4., 9., 6.]])]

a = np.floor(10*np.random.random((12,2)))
print(a)
np.vsplit(a,3) #豎著將矩陣切分為3份
輸出結果：
[[ 6. 4.]
[ 0. 1.]
[ 9. 0.]
[ 0. 0.]
[ 0. 4.]
[ 1. 1.]
[ 0. 4.]
[ 1. 6.]
[ 9. 7.]
[ 0. 9.]
[ 6. 1.]
[ 3. 0.]]
[array([[ 6., 4.],
[ 0., 1.],
[ 9., 0.],
[ 0., 0.]]), array([[ 0., 4.],
[ 1., 1.],
[ 0., 4.],
[ 1., 6.]]), array([[ 9., 7.],
[ 0., 9.],
[ 6., 1.],
[ 3., 0.]])]

直接把一個數組賦值給另一個數組，兩個數組指向同一片內存區域，對其中一個的操作就會影響另一個結果

a = np.arange(12) b = a #a和b是同一個數組對象的兩個名字 print (b is a) b.shape = 3,4 print (a.shape) print (id(a)) #id表示指向內存區域，具有相同id，表示a、b指向相同內存區域中的值 print (id(b)) 輸出結果： True (3, 4) 4382560048 4382560048

view方法創建一個新數組，指向的內存區域不同，但元素值共用

import numpy as np a = np.arange(12) c = a.view() print(id(a)) #id值不同 print(id(c)) print(c is a) c.shape = 2,6 print (a.shape) #改變c的shape，a的shape不變 c[0,4] = 1234 #改變c中元素的值 print(a) #a中元素的值也會發生改變輸出結果： 4382897216 4382897136 False (12,) [ 0 1 2 3 1234 5 6 7 8 9 10 11]

copy方法(深復制)創建一個對數組和元素值的完整的copy

d = a.copy()

按照矩陣的行列找出最大值，最大值的索引

import numpy as np data = np.sin(np.arange(20)).reshape(5,4) print (data) ind = data.argmax(axis=0) #找出每列最大值的索引 print (ind) data_max = data[ind, range(data.shape[1])] #通過行列索引取值 print (data_max) 輸出結果： [[ 0. 0.84147098 0.90929743 0.14112001][-0.7568025 -0.95892427 -0.2794155 0.6569866 ][ 0.98935825 0.41211849 -0.54402111 -0.99999021][-0.53657292 0.42016704 0.99060736 0.65028784][-0.28790332 -0.96139749 -0.75098725 0.14987721]] [2 0 3 1] [ 0.98935825 0.84147098 0.99060736 0.6569866 ]

tile方法，對原矩陣的行列進行擴展

import numpy as np a = np.arange(0, 40, 10) b = np.tile(a, (2, 3)) #行變成2倍，列變成3倍 print(b) 輸出結果： [[ 0 10 20 30 0 10 20 30 0 10 20 30][ 0 10 20 30 0 10 20 30 0 10 20 30]]

兩種排序方法
sort方法對矩陣中的值進行排序，argsort方法得到元素從小到大的索引值，根據索引值的到排序結果

a = np.array([[4, 3, 5], [1, 2, 1]]) b = np.sort(a, axis=1) #對a按行由小到大排序，值賦給b print(b) a.sort(axis=1) #直接對a按行由小到大排序 print(a) a = np.array([4, 3, 1, 2]) j = np.argsort(a) #argsort方法得到元素從小到大的索引值 print (j) print (a[j]) #根據索引值輸出a 輸出結果： [[3 4 5][1 1 2]] ------- [[3 4 5][1 1 2]] ------- [2 3 1 0] ------- [1 2 3 4]

數據分析處理庫Pandas，基于Numpy

read_csv方法讀取csv文件

import pandas as pd food_info = pd.read_csv("food_info.csv") print(type(food_info)) #pandas代表的DataFrame可以當成矩陣結構 print(food_info.dtypes) #dtypes在當前數據中包含的數據類型輸出結果： <class 'pandas.core.frame.DataFrame'> NDB_No int64 Shrt_Desc object Water_(g) float64 Energ_Kcal int64 ...... Cholestrl_(mg) float64 dtype: object

獲取讀取到的文件的信息

print(food_info.head(3)) #head()方法如果沒有參數，默認獲取前5行 print(food_info.tail()) #tail()方法獲取最后5行 print(food_info.columns) #columns獲取所有的列名 print(food_info.shape) #獲取當前數據維度(8618, 36)

取出指定某行的數據

print(food_info.loc[0]) #取出第零行的數據 food_info.loc[8620] # 當index值超過最大值，throw an error: "KeyError: 'the label [8620] is not in the [index]'" food_info.loc[3:6] #取出第三到第六行數據，3、4、5、6 two_five_ten = [2,5,10] food_info.loc[two_five_ten] #取出第2、5、10行數據

取出指定某列的數據

ndb_col = food_info["NDB_No"] #取出第一列NDB_No中的數據 print (ndb_col)

columns = [“Zinc_(mg)”, “Copper_(mg)”] #要取出多列，就寫入所要取出列的列名
zinc_copper = food_info[columns]
print(zinc_copper)

取出以(g)為結尾的列名

col_names = food_info.columns.tolist() #tolist()方法將列名放在一個list里 gram_columns = [] for c in col_names:if c.endswith("(g)"): gram_columns.append(c) gram_df = food_info[gram_columns] print(gram_df.head(3)) 輸出結果：Water_(g) Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) \ 0 15.87 0.85 81.11 2.11 0.06 1 15.87 0.85 81.11 2.11 0.06 2 0.24 0.28 99.48 0.00 0.00 3 42.41 21.40 28.74 5.11 2.34 4 41.11 23.24 29.68 3.18 2.79

Fiber_TD_(g) Sugar_Tot_(g) FA_Sat_(g) FA_Mono_(g) FA_Poly_(g)
0 0.0 0.06 51.368 21.021 3.043
1 0.0 0.06 50.489 23.426 3.012
2 0.0 0.00 61.924 28.732 3.694
3 0.0 0.50 18.669 7.778 0.800
4 0.0 0.51 18.764 8.598 0.784

對某列中的數據進行四則運算

import pandas food_info = pandas.read_csv("food_info.csv") iron_grams = food_info["Iron_(mg)"] / 1000 #對列中的數據除以1000 food_info["Iron_(g)"] = iron_grams #新增一列Iron_(g) 保存結果

water_energy = food_info[“Water_(g)”] * food_info[“Energ_Kcal”] #將兩列數字相乘

求某列中的最大值、最小值、均值

max_calories = food_info["Energ_Kcal"].max() print(max_calories) min_calories = food_info["Energ_Kcal"].min() print(min_calories) mean_calories = food_info["Energ_Kcal"].mean() print(mean_calories) 輸出結果： 902 0 226.438616848

使用sort_values()方法對某列數據進行排序

food_info.sort_values("Sodium_(mg)", inplace=True)#默認從小到大排序，inplace=True表示返回一個新的數據結構，而不在原來基礎上做改變 print(food_info["Sodium_(mg)"])

food_info.sort_values(“Sodium_(mg)”, inplace=True, ascending=False)
#ascending=False表示從大到小排序，
print(food_info[“Sodium_(mg)”])

針對titanic_train.csv 的練習（含pivot_table()透視表方法）

import pandas as pd import numpy as np titanic_survival = pd.read_csv("titanic_train.csv") titanic_survival.head()

age = titanic_survival[“Age”]
print(age.loc[0:20]) #打印某一列的0到20行
age_is_null = pd.isnull(age) #isnull()方法用于檢測是否為缺失值，缺失為True 不缺失為False
print(age_is_null)
age_null_true = age[age_is_null] #得到該列所有缺失的行
print(age_null_true)
age_null_count = len(age_null_true)
print(age_null_count) #缺失的行數

#存在缺失值的情況下無法計算均值
mean_age = sum(titanic_survival[“Age”]) / len(titanic_survival[“Age”]) #sum()方法對列中元素求和
print(mean_age) #nan

#在計算均值前要把缺失值剔除
good_ages = titanic_survival[“Age”][age_is_null == False] #不缺失的取出來
correct_mean_age = sum(good_ages) / len(good_ages)
print(correct_mean_age) #29.6991176471

#當然也可以不這么麻煩，缺失值很普遍，pandas提供了mean()方法用于自動剔除缺失值并求均值
correct_mean_age = titanic_survival[“Age”].mean()
print(correct_mean_age) #29.6991176471

#求每個倉位等級，船票的平均價格
passenger_classes = [1, 2, 3]
fares_by_class = {}
for this_class in passenger_classes:
pclass_rows = titanic_survival[titanic_survival[“Pclass”] == this_class]
pclass_fares = pclass_rows[“Fare”] #定為到同一等級艙，船票價格的那一列
fare_for_class = pclass_fares.mean()
fares_by_class[this_class] = fare_for_class
print(fares_by_class)
運算結果：
{1: 84.154687499999994, 2: 20.662183152173913, 3: 13.675550101832993}

#pandas為我們提供了更方便的統計工具，pivot_table()透視表方法
#index 告訴pivot_table方法是根據哪一列分組
#values 指定對哪一列進行計算
#aggfunc 指定使用什么計算方法
passenger_survival = titanic_survival.pivot_table(index=“Pclass”, values=“Survived”, aggfunc=np.mean)
print(passenger_survival)
運算結果：
Pclass Survived
1 0.629630
2 0.472826
3 0.242363

#計算不同等級艙乘客的平均年齡
passenger_age = titanic_survival.pivot_table(index=“Pclass”, values=“Age”) #默認采用aggfunc=np.mean計算方法
print(passenger_age)
運算結果：
Pclass Age
1 38.233441
2 29.877630
3 25.140620

#index 根據一列分組
##values 指定對多列進行計算
port_stats = titanic_survival.pivot_table(index=“Embarked”, values=[“Fare”,“Survived”], aggfunc=np.sum)
print(port_stats)
運算結果：
Embarked Fare Survived
C 10072.2962 93
Q 1022.2543 30
S 17439.3988 217

#丟棄有缺失值的數據行
new_titanic_survival = titanic_survival.dropna(axis=0,subset=[“Age”, “Cabin”]) #subset指定了Age和Cabin中任何一個有缺失的，這行數據就丟棄
print(new_titanic_survival)

#按照行列定位元素，取出值
row_index_83_age = titanic_survival.loc[103,“Age”]
row_index_1000_pclass = titanic_survival.loc[766,“Pclass”]
print(row_index_83_age)
print(row_index_1000_pclass)

#sort_values()排序，reset_index()重新設置行號
new_titanic_survival = titanic_survival.sort_values(“Age”,ascending=False) #ascending=False從大到小
print(new_titanic_survival[0:10]) #但序號是原來的序號
itanic_reindexed = new_titanic_survival.reset_index(drop=True) #reset_index(drop=True)更新行號
print(itanic_reindexed.iloc[0:10]) #iloc通過行號獲取行數據

#通過定義一個函數，把操作封裝起來，然后apply函數
def hundredth_row(column): #這個函數返回第100行的每一列數據
# Extract the hundredth item
hundredth_item = column.iloc[99]
return hundredth_item
hundredth_row = titanic_survival.apply(hundredth_row) #apply()應用函數
print(hundredth_row)
返回結果：
PassengerId 100
Survived 0
Pclass 2
Name Kantor, Mr. Sinai
Sex male
Age 34
SibSp 1
Parch 0
Ticket 244367
Fare 26
Cabin NaN
Embarked S
dtype: object

##統計所有的缺失值
def not_null_count(column):
column_null = pd.isnull(column)
null = column[column_null]
return len(null)
column_null_count = titanic_survival.apply(not_null_count)
print(column_null_count)
輸出結果：
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

#對船艙等級進行轉換
def which_class(row):
pclass = row[‘Pclass’]
if pd.isnull(pclass):
return “Unknown”
elif pclass == 1:
return “First Class”
elif pclass == 2:
return “Second Class”
elif pclass == 3:
return “Third Class”
classes = titanic_survival.apply(which_class, axis=1) #通過axis = 1參數，使用DataFrame.apply（）方法來迭代行而不是列。
print(classes)

#使用兩個自定義函數，統計不同年齡標簽對應的存活率
def generate_age_label(row):
age = row[“Age”]
if pd.isnull(age):
return “unknown”
elif age < 18:
return “minor”
else:
return “adult”

age_labels = titanic_survival.apply(generate_age_label, axis=1)

titanic_survival[‘age_labels’] = age_labels
age_group_survival = titanic_survival.pivot_table(index=“age_labels”, values=“Survived” ,aggfunc=np.mean)
print(age_group_survival)
運算結果：

age_labels Survived
adult 0.381032
minor 0.539823
unknown 0.293785

Series結構

Series (collection of values) DataFrame中的一行或者一列就是Series結構
DataFrame (collection of Series objects)是讀取文件read_csv()方法獲得的矩陣
Panel (collection of DataFrame objects)

import pandas as pd fandango = pd.read_csv('fandango_score_comparison.csv') #讀取電影信息，DataFrame結構 series_film = fandango['FILM'] #定位到“FILM”這一列 print(type(series_film)) #<class 'pandas.core.series.Series'>結構 print(series_film[0:5]) #通過索引切片 series_rt = fandango['RottenTomatoes'] print (series_rt[0:5])

from pandas import Series # Import the Series object from pandas
film_names = series_film.values #把Series結構中的每一個值拿出來
print(type(film_names)) #<class ‘numpy.ndarray’>說明series結構中每一個值的結構是ndarray
rt_scores = series_rt.values
series_custom = Series(rt_scores , index=film_names) #設置以film_names為索引的film結構,創建一個Series
series_custom[[‘Minions (2015)’, ‘Leviathan (2014)’]] #確實可以使用名字索引
fiveten = series_custom[5:10] #也可以使用數字索引
print(fiveten)

Series中的排序

original_index = series_custom.index.tolist() #將index值放入一個list結構中 sorted_index = sorted(original_index) sorted_by_index = series_custom.reindex(sorted_index) #reset index操作 print(sorted_by_index)

sc2 = series_custom.sort_index() #根據index值進行排序
sc3 = series_custom.sort_values() #根據value值進行排序
print(sc3)

在Series中的每一個值的類型是ndarray，即NumPy中核心數據類型

import numpy as np print(np.add(series_custom, series_custom)) #將兩列值相加 np.sin(series_custom) #對每個值使用sin函數 np.max(series_custom) #獲取某一列的最大值

取出series_custom列中數值在50到70之間的數值
對某一列中的所有值進行比較運算，返回boolean值

criteria_one = series_custom > 50 criteria_two = series_custom < 75 both_criteria = series_custom[criteria_one & criteria_two] #返回boolean值的Series對象 print(both_criteria)

對index相同的兩列運算

#data alignment same index rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM']) rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM']) rt_mean = (rt_critics + rt_users)/2 print(rt_mean)

對DataFrame結構進行操作
設置‘FILM’為索引

fandango = pd.read_csv('fandango_score_comparison.csv') print(type(fandango)) #<class 'pandas.core.frame.DataFrame'> fandango_films = fandango.set_index('FILM', drop=False) #以‘FILM’為索引返回一個新的DataFrame ，drop=False不丟棄原來的FILM列

對DataFrame切片

#可以使用[]或者loc[]來切片 fandango_films["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"] #用string值做的索引也可以切片 fandango_films.loc["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"] fandango_films[0:3] #數值索引依然存在，可以用來切片 #選擇特定的列 #movies = ['Kumiko, The Treasure Hunter (2015)', 'Do You Believe? (2015)', 'Ant-Man (2015)']

可視化庫matplotlib

Matplotlib是Python中最常用的可視化工具之一，可以非常方便地創建海量類型地2D圖表和一些基本的3D圖表。

2D圖表之折線圖

Matplotlib中最基礎的模塊是pyplot，先從最簡單的點圖和線圖開始。
更多屬性可以參考官網：http://matplotlib.org/api/pyplot_api.html

import pandas as pd import matplotlib as mpl import matplotlib.pyplot as plt

unrate = pd.read_csv(‘unrate.csv’)
unrate[‘DATE’] = pd.to_datetime(unrate[‘DATE’]) #pd.to_datetime方法標準化日期格式

first_twelve = unrate[0:12] #取0到12行數據
plt.plot(first_twelve[‘DATE’], first_twelve[‘VALUE’]) #plot(x軸,y軸)方法畫圖
plt.xticks(rotation=45) #設置x軸上橫坐標旋轉角度
plt.xlabel(‘Month’) #x軸含義
plt.ylabel(‘Unemployment Rate’) #y軸含義
plt.title(‘Monthly Unemployment Trends, 1948’) #圖標題
plt.show() #show方法顯示圖

子圖操作

添加子圖：add_subplot(first,second,index)
first 表示行數,second 列數.

import matplotlib.pyplot as plt fig = plt.figure() #Creates a new figure. ax1 = fig.add_subplot(3,2,1) #一個3*2子圖中的第一個模塊 ax2 = fig.add_subplot(3,2,2) #一個3*2子圖中的第二個模塊 ax2 = fig.add_subplot(3,2,6) #一個3*2子圖中的第六個模塊 plt.show() import numpy as np #fig = plt.figure() fig = plt.figure(figsize=(3, 6)) #指定畫圖區大小（長，寬） ax1 = fig.add_subplot(2,1,1) ax2 = fig.add_subplot(2,1,2)

ax1.plot(np.random.randint(1,5,5), np.arange(5)) #第一個子圖畫圖
ax2.plot(np.arange(10)*3, np.arange(10)) #第二個子圖畫圖
plt.show()

在同一個圖中畫兩條折線（plot兩次）

fig = plt.figure(figsize=(6,3)) plt.plot(unrate[0:12]['MONTH'], unrate[0:12]['VALUE'], c='red') plt.plot(unrate[12:24]['MONTH'], unrate[12:24]['VALUE'], c='blue') plt.show()

為所畫曲線作標記

fig = plt.figure(figsize=(10,6)) colors = ['red', 'blue', 'green', 'orange', 'black'] for i in range(5):start_index = I*12end_index = (i+1)*12subset = unrate[start_index:end_index]label = str(1948 + i) #label值plt.plot(subset['MONTH'], subset['VALUE'], c=colors[i], label=label) #x軸指標，y軸指標，顏色，label值 plt.legend(loc='upper left') #loc指定legend方框的位置,loc = 'best'/'upper right'/'lower left'等，print(help(plt.legend))查看用法 plt.xlabel('Month, Integer') plt.ylabel('Unemployment Rate, Percent') plt.title('Monthly Unemployment Trends, 1948-1952')plt.show()

2D圖標之條形圖與散點圖

bar條形圖

import pandas as pd reviews = pd.read_csv('fandango_scores.csv') #讀取電影評分表 cols = ['FILM', 'RT_user_norm', 'Metacritic_user_nom', 'IMDB_norm', 'Fandango_Ratingvalue', 'Fandango_Stars'] norm_reviews = reviews[cols] num_cols = ['RT_user_norm', 'Metacritic_user_nom', 'IMDB_norm', 'Fandango_Ratingvalue', 'Fandango_Stars'] bar_heights = norm_reviews.ix[0, num_cols].values #柱高度 bar_positions = arange(5) + 0.75 #設定每一個柱到左邊的距離 tick_positions = range(1,6) #設置x軸刻度標簽為[1,2,3,4,5] fig, ax = plt.subplots()

ax.bar(bar_positions, bar_heights, 0.5) #bar型圖。柱到左邊距離，柱高度，柱寬度
ax.set_xticks(tick_positions) #x軸刻度標簽
ax.set_xticklabels(num_cols, rotation=45)

ax.set_xlabel(‘Rating Source’)
ax.set_ylabel(‘Average Rating’)
ax.set_title(‘Average User Rating For Avengers: Age of Ultron (2015)’)
plt.show()

散點圖

fig, ax = plt.subplots() #fig控制圖的整體情況，如大小，用ax實際來畫圖 ax.scatter(norm_reviews['Fandango_Ratingvalue'], norm_reviews['RT_user_norm']) #scatter方法，畫散點圖的x軸，y軸 ax.set_xlabel('Fandango') ax.set_ylabel('Rotten Tomatoes') plt.show()

散點圖子圖

fig = plt.figure(figsize=(8,3)) ax1 = fig.add_subplot(1,2,1) ax2 = fig.add_subplot(1,2,2) ax1.scatter(norm_reviews['Fandango_Ratingvalue'], norm_reviews['RT_user_norm']) ax1.set_xlabel('Fandango') ax1.set_ylabel('Rotten Tomatoes') ax2.scatter(norm_reviews['RT_user_norm'], norm_reviews['Fandango_Ratingvalue']) ax2.set_xlabel('Rotten Tomatoes') ax2.set_ylabel('Fandango') plt.show() 屏幕快照 2017-11-05 上午11.42.10.png </div></div>

總結

以上是生活随笔為你收集整理的python数据分析与机器学习(Numpy,Pandas,Matplotlib)的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：数据挖掘算法（logistic回归，随机
下一篇：百度开源 FAQ 问答系统（AnyQ）安