Datawhale-数据分析-泰坦尼克-第一单元
1 第一章:數據載入及初步觀察
1.1 載入數據
數據集下載 https://www.kaggle.com/c/titanic/overview
1.1.1 任務一:導入numpy和pandas
#寫入代碼 import numpy as np import pandas as pd import os【提示】如果加載失敗,學會如何在你的python環境下安裝numpy和pandas這兩個庫
1.1.2 任務二:載入數據
(1) 使用相對路徑載入數據
(2) 使用絕對路徑載入數據
| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 100 |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 100 |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 100 |
【提示】相對路徑載入報錯時,嘗試使用os.getcwd()查看當前工作目錄。
【思考】知道數據加載的方法后,試試pd.read_csv()和pd.read_table()的不同,如果想讓他們效果一樣,需要怎么做?了解一下’.tsv’和’.csv’的不同,如何加載這兩個數據集?
【總結】加載的數據是所有工作的第一步,我們的工作會接觸到不同的數據格式(eg:.csv;.tsv;.xlsx),但是加載的方法和思路都是一樣的,在以后工作和做項目的過程中,遇到之前沒有碰到的問題,要多多查資料嗎,使用googel,了解業務邏輯,明白輸入和輸出是什么。
1.1.3 任務三:每1000行為一個數據模塊,逐塊讀取
#寫入代碼 chunker = pd.read_csv('train.csv',chunksize=1000) for piece in chunker:print(type(piece))print(len(piece))print(piece) <class 'pandas.core.frame.DataFrame'> 891PassengerId Survived Pclass \ 0 1 0 3 1 2 1 1 2 3 1 3 3 4 1 1 4 5 0 3 5 6 0 3 6 7 0 1 7 8 0 3 8 9 1 3 9 10 1 2 10 11 1 3 11 12 1 1 12 13 0 3 13 14 0 3 14 15 0 3 15 16 1 2 16 17 0 3 17 18 1 2 18 19 0 3 19 20 1 3 20 21 0 2 21 22 1 2 22 23 1 3 23 24 1 1 24 25 0 3 25 26 1 3 26 27 0 3 27 28 0 1 28 29 1 3 29 30 0 3 .. ... ... ... 861 862 0 2 862 863 1 1 863 864 0 3 864 865 0 2 865 866 1 2 866 867 1 2 867 868 0 1 868 869 0 3 869 870 1 3 870 871 0 3 871 872 1 1 872 873 0 1 873 874 0 3 874 875 1 2 875 876 1 3 876 877 0 3 877 878 0 3 878 879 0 3 879 880 1 1 880 881 1 2 881 882 0 3 882 883 0 3 883 884 0 2 884 885 0 3 885 886 0 3 886 887 0 2 887 888 1 1 888 889 0 3 889 890 1 1 890 891 0 3 Name Sex Age SibSp \ 0 Braund, Mr. Owen Harris male 22.0 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 2 Heikkinen, Miss. Laina female 26.0 0 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 4 Allen, Mr. William Henry male 35.0 0 5 Moran, Mr. James male NaN 0 6 McCarthy, Mr. Timothy J male 54.0 0 7 Palsson, Master. Gosta Leonard male 2.0 3 8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 9 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 10 Sandstrom, Miss. Marguerite Rut female 4.0 1 11 Bonnell, Miss. Elizabeth female 58.0 0 12 Saundercock, Mr. William Henry male 20.0 0 13 Andersson, Mr. Anders Johan male 39.0 1 14 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 15 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 16 Rice, Master. Eugene male 2.0 4 17 Williams, Mr. Charles Eugene male NaN 0 18 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 19 Masselmani, Mrs. Fatima female NaN 0 20 Fynney, Mr. Joseph J male 35.0 0 21 Beesley, Mr. Lawrence male 34.0 0 22 McGowan, Miss. Anna "Annie" female 15.0 0 23 Sloper, Mr. William Thompson male 28.0 0 24 Palsson, Miss. Torborg Danira female 8.0 3 25 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1 26 Emir, Mr. Farred Chehab male NaN 0 27 Fortune, Mr. Charles Alexander male 19.0 3 28 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 29 Todoroff, Mr. Lalio male NaN 0 .. ... ... ... ... 861 Giles, Mr. Frederick Edward male 21.0 1 862 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 0 863 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 864 Gill, Mr. John William male 24.0 0 865 Bystrom, Mrs. (Karolina) female 42.0 0 866 Duran y More, Miss. Asuncion female 27.0 1 867 Roebling, Mr. Washington Augustus II male 31.0 0 868 van Melkebeke, Mr. Philemon male NaN 0 869 Johnson, Master. Harold Theodor male 4.0 1 870 Balkic, Mr. Cerin male 26.0 0 871 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 872 Carlsson, Mr. Frans Olof male 33.0 0 873 Vander Cruyssen, Mr. Victor male 47.0 0 874 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 1 875 Najib, Miss. Adele Kiamie "Jane" female 15.0 0 876 Gustafsson, Mr. Alfred Ossian male 20.0 0 877 Petroff, Mr. Nedelio male 19.0 0 878 Laleff, Mr. Kristo male NaN 0 879 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 880 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 881 Markun, Mr. Johann male 33.0 0 882 Dahlberg, Miss. Gerda Ulrika female 22.0 0 883 Banfield, Mr. Frederick James male 28.0 0 884 Sutehall, Mr. Henry Jr male 25.0 0 885 Rice, Mrs. William (Margaret Norton) female 39.0 0 886 Montvila, Rev. Juozas male 27.0 0 887 Graham, Miss. Margaret Edith female 19.0 0 888 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 889 Behr, Mr. Karl Howell male 26.0 0 890 Dooley, Mr. Patrick male 32.0 0 Parch Ticket Fare Cabin Embarked 0 0 A/5 21171 7.2500 NaN S 1 0 PC 17599 71.2833 C85 C 2 0 STON/O2. 3101282 7.9250 NaN S 3 0 113803 53.1000 C123 S 4 0 373450 8.0500 NaN S 5 0 330877 8.4583 NaN Q 6 0 17463 51.8625 E46 S 7 1 349909 21.0750 NaN S 8 2 347742 11.1333 NaN S 9 0 237736 30.0708 NaN C 10 1 PP 9549 16.7000 G6 S 11 0 113783 26.5500 C103 S 12 0 A/5. 2151 8.0500 NaN S 13 5 347082 31.2750 NaN S 14 0 350406 7.8542 NaN S 15 0 248706 16.0000 NaN S 16 1 382652 29.1250 NaN Q 17 0 244373 13.0000 NaN S 18 0 345763 18.0000 NaN S 19 0 2649 7.2250 NaN C 20 0 239865 26.0000 NaN S 21 0 248698 13.0000 D56 S 22 0 330923 8.0292 NaN Q 23 0 113788 35.5000 A6 S 24 1 349909 21.0750 NaN S 25 5 347077 31.3875 NaN S 26 0 2631 7.2250 NaN C 27 2 19950 263.0000 C23 C25 C27 S 28 0 330959 7.8792 NaN Q 29 0 349216 7.8958 NaN S .. ... ... ... ... ... 861 0 28134 11.5000 NaN S 862 0 17466 25.9292 D17 S 863 2 CA. 2343 69.5500 NaN S 864 0 233866 13.0000 NaN S 865 0 236852 13.0000 NaN S 866 0 SC/PARIS 2149 13.8583 NaN C 867 0 PC 17590 50.4958 A24 S 868 0 345777 9.5000 NaN S 869 1 347742 11.1333 NaN S 870 0 349248 7.8958 NaN S 871 1 11751 52.5542 D35 S 872 0 695 5.0000 B51 B53 B55 S 873 0 345765 9.0000 NaN S 874 0 P/PP 3381 24.0000 NaN C 875 0 2667 7.2250 NaN C 876 0 7534 9.8458 NaN S 877 0 349212 7.8958 NaN S 878 0 349217 7.8958 NaN S 879 1 11767 83.1583 C50 C 880 1 230433 26.0000 NaN S 881 0 349257 7.8958 NaN S 882 0 7552 10.5167 NaN S 883 0 C.A./SOTON 34068 10.5000 NaN S 884 0 SOTON/OQ 392076 7.0500 NaN S 885 5 382652 29.1250 NaN Q 886 0 211536 13.0000 NaN S 887 0 112053 30.0000 B42 S 888 2 W./C. 6607 23.4500 NaN S 889 0 111369 30.0000 C148 C 890 0 370376 7.7500 NaN Q [891 rows x 12 columns]【思考】什么是逐塊讀取?為什么要逐塊讀取呢?
將文本分成若干塊,每次處理chunksize行的數據,最終返回一個TextParser對象,對該對象進行迭代遍歷,可以完成逐塊統計的合并處理。
因為文本太大,需要一部分數據,或者需要一塊一塊進行處理。
【提示】大家可以chunker(數據塊)是什么類型?用for循環打印出來出處具體的樣子是什么?
DataFrame的數據類型
1.1.4 任務四:將表頭改成中文,索引改為乘客ID [對于某些英文資料,我們可以通過翻譯來更直觀的熟悉我們的數據]
PassengerId => 乘客ID
Survived => 是否幸存
Pclass => 乘客等級(1/2/3等艙位)
Name => 乘客姓名
Sex => 性別
Age => 年齡
SibSp => 堂兄弟/妹個數
Parch => 父母與小孩個數
Ticket => 船票信息
Fare => 票價
Cabin => 客艙
Embarked => 登船港口
| 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
【思考】所謂將表頭改為中文其中一個思路是:將英文列名表頭替換成中文。還有其他的方法嗎?
1.2 初步觀察
導入數據后,你可能要對數據的整體結構和樣例進行概覽,比如說,數據大小、有多少列,各列都是什么格式的,是否包含null等
1.2.1 任務一:查看數據的基本信息
#寫入代碼 train_data.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 891 entries, 1 to 891 Data columns (total 11 columns): 是否幸存 891 non-null int64 倉位等級 891 non-null int64 姓名 891 non-null object 性別 891 non-null object 年齡 714 non-null float64 兄弟姐妹個數 891 non-null int64 父母子女個數 891 non-null int64 船票信息 891 non-null object 票價 891 non-null float64 客艙 204 non-null object 登船港口 889 non-null object dtypes: float64(2), int64(4), object(5) memory usage: 83.5+ KB【提示】有多個函數可以這樣做,你可以做一下總結
train_data.describe()| 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
| 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
| 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
| 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
| 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
| 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
| 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
| 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
1.2.2 任務二:觀察表格前10行的數據和后15行的數據
#寫入代碼 train_data.head(10)| 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
| 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
| 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
| 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
| 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
| 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
| 0 | 3 | Gustafsson, Mr. Alfred Ossian | male | 20.0 | 0 | 0 | 7534 | 9.8458 | NaN | S |
| 0 | 3 | Petroff, Mr. Nedelio | male | 19.0 | 0 | 0 | 349212 | 7.8958 | NaN | S |
| 0 | 3 | Laleff, Mr. Kristo | male | NaN | 0 | 0 | 349217 | 7.8958 | NaN | S |
| 1 | 1 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 0 | 1 | 11767 | 83.1583 | C50 | C |
| 1 | 2 | Shelley, Mrs. William (Imanita Parrish Hall) | female | 25.0 | 0 | 1 | 230433 | 26.0000 | NaN | S |
| 0 | 3 | Markun, Mr. Johann | male | 33.0 | 0 | 0 | 349257 | 7.8958 | NaN | S |
| 0 | 3 | Dahlberg, Miss. Gerda Ulrika | female | 22.0 | 0 | 0 | 7552 | 10.5167 | NaN | S |
| 0 | 2 | Banfield, Mr. Frederick James | male | 28.0 | 0 | 0 | C.A./SOTON 34068 | 10.5000 | NaN | S |
| 0 | 3 | Sutehall, Mr. Henry Jr | male | 25.0 | 0 | 0 | SOTON/OQ 392076 | 7.0500 | NaN | S |
| 0 | 3 | Rice, Mrs. William (Margaret Norton) | female | 39.0 | 0 | 5 | 382652 | 29.1250 | NaN | Q |
| 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
| 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
| 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
| 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
| 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
1.2.4 任務三:判斷數據是否為空,為空的地方返回True,其余地方返回False
#寫入代碼 train_data.isnull().head()| False | False | False | False | False | False | False | False | False | True | False |
| False | False | False | False | False | False | False | False | False | False | False |
| False | False | False | False | False | False | False | False | False | True | False |
| False | False | False | False | False | False | False | False | False | False | False |
| False | False | False | False | False | False | False | False | False | True | False |
【總結】上面的操作都是數據分析中對于數據本身的觀察
【思考】對于一個數據,還可以從哪些方面來觀察?找找答案,這個將對下面的數據分析有很大的幫助
1.3 保存數據
1.3.1 任務一:將你加載并做出改變的數據,在工作目錄下保存為一個新文件train_chinese.csv
#寫入代碼 # 注意:不同的操作系統保存下來可能會有亂碼。大家可以加入`encoding='GBK' 或者 ’encoding = ’uft-8‘‘` train_data.to_csv('train_Chinese.csv',encoding='utf-8')【總結】數據的加載以及入門,接下來就要接觸數據本身的運算,我們將主要掌握numpy和pandas在工作和項目場景的運用。
1 第一章:數據載入及初步觀察
1.4 知道你的數據叫什么
我們學習pandas的基礎操作,那么上一節通過pandas加載之后的數據,其數據類型是什么呢?
開始前導入numpy和pandas
import numpy as np import pandas as pd1.4.1 任務一:pandas中有兩個數據類型DateFrame和Series,通過查找簡單了解他們。然后自己寫一個關于這兩個數據類型的小例子🌰[開放題]
https://www.cnblogs.com/lavender1221/p/12664641.html#
Pandas的核心是三大數據結構:Series、DataFrame和Index。絕大多數操作都是圍繞這三種結構進行的。
Series是一個一維的數組對象,它包含一個值序列和一個對應的索引序列。 Numpy的一維數組通過隱式定義的整數索引獲取元素值,而Series用一種顯式定義的索引與元素關聯。顯式索引讓Series對象擁有更強的能力,索引也不再僅僅是整數,還可以是別的類型,比如字符串,索引也不需要連續,也可以重復,自由度非常高。
DataFrame是Pandas的核心數據結構,表示的是二維的矩陣數據表,類似關系型數據庫的結構,每一列可以是不同的值類型,比如數值、字符串、布爾值等等。DataFrame既有行索引,也有列索引,它可以被看做為一個共享相同索引的Series的字典。
創建DataFrame對象的方法有很多,最常用的是利用包含等長度列表或Numpy數組的字典來生成。可以查看DataFrame對象的columns和index屬性。
#寫入代碼 sdata_1 = [7,-2,567,8] example_1 = pd.Series(sdata_1,index = ['a','b','c','d']) example_1 a 7 b -2 c 567 d 8 dtype: int64 sdata_2 = {'a':7,'b':-2,'c':567,'d':8} example_2 = pd.Series(sdata_2) example_2 a 7 b -2 c 567 d 8 dtype: int64 sdata_3 = {'city':['nanjing','wuxi','wuhan','changsha'],'code':['001','002','003','004']} example_3 = pd.DataFrame(sdata_3) example_3| nanjing | 001 |
| wuxi | 002 |
| wuhan | 003 |
| changsha | 004 |
1.4.2 任務二:根據上節課的方法載入"train.csv"文件
#寫入代碼train_chinese = pd.read_csv('train_Chinese.csv')train_chinese.head()train_data = pd.read_csv('train.csv')也可以加載上一節課保存的"train_chinese.csv"文件。通過翻譯版train_chinese.csv熟悉了這個數據集,然后我們對trian.csv來進行操作
1.4.3 任務三:查看DataFrame數據的每列的名稱
#寫入代碼train_chinese.columns Index(['乘客ID', '是否幸存', '倉位等級', '姓名', '性別', '年齡', '兄弟姐妹個數', '父母子女個數', '船票信息', '票價', '客艙', '登船港口'], dtype='object') train_data.columns Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object') train_data.head()| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
1.4.4任務四:查看"Cabin"這列的所有值[有多種方法]
#寫入代碼train_data['Cabin'].head() 0 NaN1 C852 NaN3 C1234 NaNName: Cabin, dtype: object #寫入代碼train_data.Cabin.head() 0 NaN1 C852 NaN3 C1234 NaNName: Cabin, dtype: object1.4.5 任務五:加載文件"test_1.csv",然后對比"train.csv",看看有哪些多出的列,然后將多出的列刪除
經過我們的觀察發現一個測試集test_1.csv有一列是多余的,我們需要將這個多余的列刪去
#寫入代碼test_data = pd.read_csv('test_1.csv')test_data.head()| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 100 |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 100 |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 100 |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 100 |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 100 |
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
| 5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
| 6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
| 7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
| 8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
| 9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
| 10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.0 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
| 11 | 12 | 1 | 1 | Bonnell, Miss. Elizabeth | female | 58.0 | 0 | 0 | 113783 | 26.5500 | C103 | S |
| 12 | 13 | 0 | 3 | Saundercock, Mr. William Henry | male | 20.0 | 0 | 0 | A/5. 2151 | 8.0500 | NaN | S |
| 13 | 14 | 0 | 3 | Andersson, Mr. Anders Johan | male | 39.0 | 1 | 5 | 347082 | 31.2750 | NaN | S |
| 14 | 15 | 0 | 3 | Vestrom, Miss. Hulda Amanda Adolfina | female | 14.0 | 0 | 0 | 350406 | 7.8542 | NaN | S |
| 15 | 16 | 1 | 2 | Hewlett, Mrs. (Mary D Kingcome) | female | 55.0 | 0 | 0 | 248706 | 16.0000 | NaN | S |
| 16 | 17 | 0 | 3 | Rice, Master. Eugene | male | 2.0 | 4 | 1 | 382652 | 29.1250 | NaN | Q |
| 17 | 18 | 1 | 2 | Williams, Mr. Charles Eugene | male | NaN | 0 | 0 | 244373 | 13.0000 | NaN | S |
| 18 | 19 | 0 | 3 | Vander Planke, Mrs. Julius (Emelia Maria Vande... | female | 31.0 | 1 | 0 | 345763 | 18.0000 | NaN | S |
| 19 | 20 | 1 | 3 | Masselmani, Mrs. Fatima | female | NaN | 0 | 0 | 2649 | 7.2250 | NaN | C |
| 20 | 21 | 0 | 2 | Fynney, Mr. Joseph J | male | 35.0 | 0 | 0 | 239865 | 26.0000 | NaN | S |
| 21 | 22 | 1 | 2 | Beesley, Mr. Lawrence | male | 34.0 | 0 | 0 | 248698 | 13.0000 | D56 | S |
| 22 | 23 | 1 | 3 | McGowan, Miss. Anna "Annie" | female | 15.0 | 0 | 0 | 330923 | 8.0292 | NaN | Q |
| 23 | 24 | 1 | 1 | Sloper, Mr. William Thompson | male | 28.0 | 0 | 0 | 113788 | 35.5000 | A6 | S |
| 24 | 25 | 0 | 3 | Palsson, Miss. Torborg Danira | female | 8.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
| 25 | 26 | 1 | 3 | Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... | female | 38.0 | 1 | 5 | 347077 | 31.3875 | NaN | S |
| 26 | 27 | 0 | 3 | Emir, Mr. Farred Chehab | male | NaN | 0 | 0 | 2631 | 7.2250 | NaN | C |
| 27 | 28 | 0 | 1 | Fortune, Mr. Charles Alexander | male | 19.0 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S |
| 28 | 29 | 1 | 3 | O'Dwyer, Miss. Ellen "Nellie" | female | NaN | 0 | 0 | 330959 | 7.8792 | NaN | Q |
| 29 | 30 | 0 | 3 | Todoroff, Mr. Lalio | male | NaN | 0 | 0 | 349216 | 7.8958 | NaN | S |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 861 | 862 | 0 | 2 | Giles, Mr. Frederick Edward | male | 21.0 | 1 | 0 | 28134 | 11.5000 | NaN | S |
| 862 | 863 | 1 | 1 | Swift, Mrs. Frederick Joel (Margaret Welles Ba... | female | 48.0 | 0 | 0 | 17466 | 25.9292 | D17 | S |
| 863 | 864 | 0 | 3 | Sage, Miss. Dorothy Edith "Dolly" | female | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | S |
| 864 | 865 | 0 | 2 | Gill, Mr. John William | male | 24.0 | 0 | 0 | 233866 | 13.0000 | NaN | S |
| 865 | 866 | 1 | 2 | Bystrom, Mrs. (Karolina) | female | 42.0 | 0 | 0 | 236852 | 13.0000 | NaN | S |
| 866 | 867 | 1 | 2 | Duran y More, Miss. Asuncion | female | 27.0 | 1 | 0 | SC/PARIS 2149 | 13.8583 | NaN | C |
| 867 | 868 | 0 | 1 | Roebling, Mr. Washington Augustus II | male | 31.0 | 0 | 0 | PC 17590 | 50.4958 | A24 | S |
| 868 | 869 | 0 | 3 | van Melkebeke, Mr. Philemon | male | NaN | 0 | 0 | 345777 | 9.5000 | NaN | S |
| 869 | 870 | 1 | 3 | Johnson, Master. Harold Theodor | male | 4.0 | 1 | 1 | 347742 | 11.1333 | NaN | S |
| 870 | 871 | 0 | 3 | Balkic, Mr. Cerin | male | 26.0 | 0 | 0 | 349248 | 7.8958 | NaN | S |
| 871 | 872 | 1 | 1 | Beckwith, Mrs. Richard Leonard (Sallie Monypeny) | female | 47.0 | 1 | 1 | 11751 | 52.5542 | D35 | S |
| 872 | 873 | 0 | 1 | Carlsson, Mr. Frans Olof | male | 33.0 | 0 | 0 | 695 | 5.0000 | B51 B53 B55 | S |
| 873 | 874 | 0 | 3 | Vander Cruyssen, Mr. Victor | male | 47.0 | 0 | 0 | 345765 | 9.0000 | NaN | S |
| 874 | 875 | 1 | 2 | Abelson, Mrs. Samuel (Hannah Wizosky) | female | 28.0 | 1 | 0 | P/PP 3381 | 24.0000 | NaN | C |
| 875 | 876 | 1 | 3 | Najib, Miss. Adele Kiamie "Jane" | female | 15.0 | 0 | 0 | 2667 | 7.2250 | NaN | C |
| 876 | 877 | 0 | 3 | Gustafsson, Mr. Alfred Ossian | male | 20.0 | 0 | 0 | 7534 | 9.8458 | NaN | S |
| 877 | 878 | 0 | 3 | Petroff, Mr. Nedelio | male | 19.0 | 0 | 0 | 349212 | 7.8958 | NaN | S |
| 878 | 879 | 0 | 3 | Laleff, Mr. Kristo | male | NaN | 0 | 0 | 349217 | 7.8958 | NaN | S |
| 879 | 880 | 1 | 1 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 0 | 1 | 11767 | 83.1583 | C50 | C |
| 880 | 881 | 1 | 2 | Shelley, Mrs. William (Imanita Parrish Hall) | female | 25.0 | 0 | 1 | 230433 | 26.0000 | NaN | S |
| 881 | 882 | 0 | 3 | Markun, Mr. Johann | male | 33.0 | 0 | 0 | 349257 | 7.8958 | NaN | S |
| 882 | 883 | 0 | 3 | Dahlberg, Miss. Gerda Ulrika | female | 22.0 | 0 | 0 | 7552 | 10.5167 | NaN | S |
| 883 | 884 | 0 | 2 | Banfield, Mr. Frederick James | male | 28.0 | 0 | 0 | C.A./SOTON 34068 | 10.5000 | NaN | S |
| 884 | 885 | 0 | 3 | Sutehall, Mr. Henry Jr | male | 25.0 | 0 | 0 | SOTON/OQ 392076 | 7.0500 | NaN | S |
| 885 | 886 | 0 | 3 | Rice, Mrs. William (Margaret Norton) | female | 39.0 | 0 | 5 | 382652 | 29.1250 | NaN | Q |
| 886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
| 887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
| 888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
| 889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
| 890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 13 columns
【思考】還有其他的刪除多余的列的方式嗎?
# 思考回答del test_data['a']test_data.head()| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
1.4.6 任務六: 將[‘PassengerId’,‘Name’,‘Age’,‘Ticket’]這幾個列元素隱藏,只觀察其他幾個列元素
#寫入代碼test_data.drop(['PassengerId','Name','Age','Ticket'],axis=1).head()| 0 | 0 | 3 | male | 1 | 0 | 7.2500 | NaN | S |
| 1 | 1 | 1 | female | 1 | 0 | 71.2833 | C85 | C |
| 2 | 1 | 3 | female | 0 | 0 | 7.9250 | NaN | S |
| 3 | 1 | 1 | female | 1 | 0 | 53.1000 | C123 | S |
| 4 | 0 | 3 | male | 0 | 0 | 8.0500 | NaN | S |
【思考】對比任務五和任務六,是不是使用了不一樣的方法(函數),如果使用一樣的函數如何完成上面的不同的要求呢?
【思考回答】
如果想要完全的刪除你的數據結構,使用inplace=True,因為使用inplace就將原數據覆蓋了,所以這里沒有用
1.5 篩選的邏輯
表格數據中,最重要的一個功能就是要具有可篩選的能力,選出我所需要的信息,丟棄無用的信息。
下面我們還是用實戰來學習pandas這個功能。
1.5.1 任務一: 我們以"Age"為篩選條件,顯示年齡在10歲以下的乘客信息。
#寫入代碼test_data[test_data['Age']<10].head()| 7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
| 10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.0 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
| 16 | 17 | 0 | 3 | Rice, Master. Eugene | male | 2.0 | 4 | 1 | 382652 | 29.1250 | NaN | Q |
| 24 | 25 | 0 | 3 | Palsson, Miss. Torborg Danira | female | 8.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
| 43 | 44 | 1 | 2 | Laroche, Miss. Simonne Marie Anne Andree | female | 3.0 | 1 | 2 | SC/Paris 2123 | 41.5792 | NaN | C |
1.5.2 任務二: 以"Age"為條件,將年齡在10歲以上和50歲以下的乘客信息顯示出來,并將這個數據命名為midage
#寫入代碼midage = test_data[(test_data['Age']>10) & (test_data['Age']<50)]midage.head()| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
【提示】了解pandas的條件篩選方式以及如何使用交集和并集操作
1.5.3 任務三:將midage的數據中第100行的"Pclass"和"Sex"的數據顯示出來
#寫入代碼midage = midage.reset_index()midage.head()| 0 | 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
【提示】在抽取數據中,我們希望數據的相對順序保持不變,用什么函數可以達到這個效果呢?
reset_index()函數: 使用索引重置生成一個新的DataFrame或Series,可以把索引用作列。保留原索引,即保持數據的相對順序
| 2 | male |
1.5.4 任務四:使用loc方法將midage的數據中第100,105,108行的"Pclass","Name"和"Sex"的數據顯示出來
#寫入代碼midage.loc[[100,105,108],['Pclass','Name','Sex']] #因為你主動的延長了行的距離,所以會產生表格形式| 2 | Byles, Rev. Thomas Roussel Davids | male |
| 3 | Cribb, Mr. John Hatfield | male |
| 3 | Calic, Mr. Jovo | male |
1.5.5 任務五:使用iloc方法將midage的數據中第100,105,108行的"Pclass","Name"和"Sex"的數據顯示出來
#寫入代碼midage.iloc[[100,105,108],[4,5,6]] #iloc的行和列都按照整數,不能按照列名| 2 | Byles, Rev. Thomas Roussel Davids | male |
| 3 | Cribb, Mr. John Hatfield | male |
| 3 | Calic, Mr. Jovo | male |
【思考】對比iloc和loc的異同
iloc是按照行數取值,而loc按著index名取值
復習:在前面我們已經學習了Pandas基礎,知道利用Pandas讀取csv數據的增刪查改,今天我們要學習的就是探索性數據分析,主要介紹如何利用Pandas進行排序、算術計算以及計算描述函數describe()的使用。
1 第一章:探索性數據分析
開始之前,導入numpy、pandas包和數據
#加載所需的庫 import numpy as np import pandas as pd #載入之前保存的train_chinese.csv數據,關于泰坦尼克號的任務,我們就使用這個數據 train_data = pd.read_csv('train_Chinese.csv')1.6 了解你的數據嗎?
教材《Python for Data Analysis》第五章
1.6.1 任務一:利用Pandas對示例數據進行排序,要求升序
# 具體請看《利用Python進行數據分析》第五章 排序和排名 部分#自己構建一個都為數字的DataFrame數據''' 我們舉了一個例子 pd.DataFrame() :創建一個DataFrame對象 np.arange(8).reshape((2, 4)) : 生成一個二維數組(2*4),第一列:0,1,2,3 第二列:4,5,6,7 index=[2,1] :DataFrame 對象的索引列 columns=['d', 'a', 'b', 'c'] :DataFrame 對象的索引行 ''' frame = pd.DataFrame(np.arange(8).reshape(2,4),index=[2,1],columns=['d','a','b','c']) frame| 0 | 1 | 2 | 3 |
| 4 | 5 | 6 | 7 |
【代碼解析】
pd.DataFrame() :創建一個DataFrame對象
np.arange(8).reshape((2, 4)) : 生成一個二維數組(2*4),第一列:0,1,2,3 第二列:4,5,6,7
index=['2, 1] :DataFrame 對象的索引列
columns=[‘d’, ‘a’, ‘b’, ‘c’] :DataFrame 對象的索引行
【問題】:大多數時候我們都是想根據列的值來排序,所以將你構建的DataFrame中的數據根據某一列,升序排列
#回答代碼 frame.sort_values(by = 'c',ascending = True)| 0 | 1 | 2 | 3 |
| 4 | 5 | 6 | 7 |
【思考】通過書本你能說出Pandas對DataFrame數據的其他排序方式嗎?
sort_index()對索引進行排序,axis=1是對列
| 4 | 5 | 6 | 7 |
| 0 | 1 | 2 | 3 |
【總結】下面將不同的排序方式做一個總結
1.讓行索引升序排序
#代碼frame.sort_index()| 4 | 5 | 6 | 7 |
| 0 | 1 | 2 | 3 |
2.讓列索引升序排序
#代碼frame.sort_index(axis=1)| 1 | 2 | 3 | 0 |
| 5 | 6 | 7 | 4 |
3.讓列索引降序排序
#代碼frame.sort_index(axis=1,ascending=False)| 0 | 3 | 2 | 1 |
| 4 | 7 | 6 | 5 |
4.讓任選兩列數據同時降序排序
#代碼frame.sort_values(['a','c'],ascending=False)| 4 | 5 | 6 | 7 |
| 0 | 1 | 2 | 3 |
1.6.2 任務二:對泰坦尼克號數據(trian.csv)按票價和年齡兩列進行綜合排序(降序排列),從這個數據中你可以分析出什么?
'''在開始我們已經導入了train_chinese.csv數據,而且前面我們也學習了導入數據過程,根據上面學習,我們直接對目標列進行排序即可head(20) : 讀取前20條數據'''train_data.head(20)| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
| 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
| 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
| 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
| 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
| 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
| 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.0 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
| 12 | 1 | 1 | Bonnell, Miss. Elizabeth | female | 58.0 | 0 | 0 | 113783 | 26.5500 | C103 | S |
| 13 | 0 | 3 | Saundercock, Mr. William Henry | male | 20.0 | 0 | 0 | A/5. 2151 | 8.0500 | NaN | S |
| 14 | 0 | 3 | Andersson, Mr. Anders Johan | male | 39.0 | 1 | 5 | 347082 | 31.2750 | NaN | S |
| 15 | 0 | 3 | Vestrom, Miss. Hulda Amanda Adolfina | female | 14.0 | 0 | 0 | 350406 | 7.8542 | NaN | S |
| 16 | 1 | 2 | Hewlett, Mrs. (Mary D Kingcome) | female | 55.0 | 0 | 0 | 248706 | 16.0000 | NaN | S |
| 17 | 0 | 3 | Rice, Master. Eugene | male | 2.0 | 4 | 1 | 382652 | 29.1250 | NaN | Q |
| 18 | 1 | 2 | Williams, Mr. Charles Eugene | male | NaN | 0 | 0 | 244373 | 13.0000 | NaN | S |
| 19 | 0 | 3 | Vander Planke, Mrs. Julius (Emelia Maria Vande... | female | 31.0 | 1 | 0 | 345763 | 18.0000 | NaN | S |
| 20 | 1 | 3 | Masselmani, Mrs. Fatima | female | NaN | 0 | 0 | 2649 | 7.2250 | NaN | C |
| 680 | 1 | 1 | Cardeza, Mr. Thomas Drake Martinez | male | 36.00 | 0 | 1 | PC 17755 | 512.3292 | B51 B53 B55 | C |
| 259 | 1 | 1 | Ward, Miss. Anna | female | 35.00 | 0 | 0 | PC 17755 | 512.3292 | NaN | C |
| 738 | 1 | 1 | Lesurer, Mr. Gustave J | male | 35.00 | 0 | 0 | PC 17755 | 512.3292 | B101 | C |
| 439 | 0 | 1 | Fortune, Mr. Mark | male | 64.00 | 1 | 4 | 19950 | 263.0000 | C23 C25 C27 | S |
| 342 | 1 | 1 | Fortune, Miss. Alice Elizabeth | female | 24.00 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S |
| 89 | 1 | 1 | Fortune, Miss. Mabel Helen | female | 23.00 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S |
| 28 | 0 | 1 | Fortune, Mr. Charles Alexander | male | 19.00 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S |
| 743 | 1 | 1 | Ryerson, Miss. Susan Parker "Suzette" | female | 21.00 | 2 | 2 | PC 17608 | 262.3750 | B57 B59 B63 B66 | C |
| 312 | 1 | 1 | Ryerson, Miss. Emily Borie | female | 18.00 | 2 | 2 | PC 17608 | 262.3750 | B57 B59 B63 B66 | C |
| 300 | 1 | 1 | Baxter, Mrs. James (Helene DeLaudeniere Chaput) | female | 50.00 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C |
| 119 | 0 | 1 | Baxter, Mr. Quigg Edmond | male | 24.00 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C |
| 381 | 1 | 1 | Bidois, Miss. Rosalie | female | 42.00 | 0 | 0 | PC 17757 | 227.5250 | NaN | C |
| 717 | 1 | 1 | Endres, Miss. Caroline Louise | female | 38.00 | 0 | 0 | PC 17757 | 227.5250 | C45 | C |
| 701 | 1 | 1 | Astor, Mrs. John Jacob (Madeleine Talmadge Force) | female | 18.00 | 1 | 0 | PC 17757 | 227.5250 | C62 C64 | C |
| 558 | 0 | 1 | Robbins, Mr. Victor | male | NaN | 0 | 0 | PC 17757 | 227.5250 | NaN | C |
| 528 | 0 | 1 | Farthing, Mr. John | male | NaN | 0 | 0 | PC 17483 | 221.7792 | C95 | S |
| 378 | 0 | 1 | Widener, Mr. Harry Elkins | male | 27.00 | 0 | 2 | 113503 | 211.5000 | C82 | C |
| 780 | 1 | 1 | Robert, Mrs. Edward Scott (Elisabeth Walton Mc... | female | 43.00 | 0 | 1 | 24160 | 211.3375 | B3 | S |
| 731 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.00 | 0 | 0 | 24160 | 211.3375 | B5 | S |
| 690 | 1 | 1 | Madill, Miss. Georgette Alexandra | female | 15.00 | 0 | 1 | 24160 | 211.3375 | B5 | S |
| 857 | 1 | 1 | Wick, Mrs. George Dennick (Mary Hitchcock) | female | 45.00 | 1 | 1 | 36928 | 164.8667 | NaN | S |
| 319 | 1 | 1 | Wick, Miss. Mary Natalie | female | 31.00 | 0 | 2 | 36928 | 164.8667 | C7 | S |
| 269 | 1 | 1 | Graham, Mrs. William Thompson (Edith Junkins) | female | 58.00 | 0 | 1 | PC 17582 | 153.4625 | C125 | S |
| 610 | 1 | 1 | Shutes, Miss. Elizabeth W | female | 40.00 | 0 | 0 | PC 17582 | 153.4625 | C125 | S |
| 333 | 0 | 1 | Graham, Mr. George Edward | male | 38.00 | 0 | 1 | PC 17582 | 153.4625 | C91 | S |
| 499 | 0 | 1 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.00 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S |
| 709 | 1 | 1 | Cleaver, Miss. Alice | female | 22.00 | 0 | 0 | 113781 | 151.5500 | NaN | S |
| 298 | 0 | 1 | Allison, Miss. Helen Loraine | female | 2.00 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S |
| 306 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.92 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S |
| 196 | 1 | 1 | Lurette, Miss. Elise | female | 58.00 | 0 | 0 | PC 17569 | 146.5208 | B80 | C |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 612 | 0 | 3 | Jardin, Mr. Jose Neto | male | NaN | 0 | 0 | SOTON/O.Q. 3101305 | 7.0500 | NaN | S |
| 478 | 0 | 3 | Braund, Mr. Lewis Richard | male | 29.00 | 1 | 0 | 3460 | 7.0458 | NaN | S |
| 130 | 0 | 3 | Ekstrom, Mr. Johan | male | 45.00 | 0 | 0 | 347061 | 6.9750 | NaN | S |
| 805 | 1 | 3 | Hedman, Mr. Oskar Arvid | male | 27.00 | 0 | 0 | 347089 | 6.9750 | NaN | S |
| 826 | 0 | 3 | Flynn, Mr. John | male | NaN | 0 | 0 | 368323 | 6.9500 | NaN | Q |
| 412 | 0 | 3 | Hart, Mr. Henry | male | NaN | 0 | 0 | 394140 | 6.8583 | NaN | Q |
| 144 | 0 | 3 | Burke, Mr. Jeremiah | male | 19.00 | 0 | 0 | 365222 | 6.7500 | NaN | Q |
| 655 | 0 | 3 | Hegarty, Miss. Hanora "Nora" | female | 18.00 | 0 | 0 | 365226 | 6.7500 | NaN | Q |
| 203 | 0 | 3 | Johanson, Mr. Jakob Alfred | male | 34.00 | 0 | 0 | 3101264 | 6.4958 | NaN | S |
| 372 | 0 | 3 | Wiklund, Mr. Jakob Alfred | male | 18.00 | 1 | 0 | 3101267 | 6.4958 | NaN | S |
| 819 | 0 | 3 | Holm, Mr. John Fredrik Alexander | male | 43.00 | 0 | 0 | C 7075 | 6.4500 | NaN | S |
| 844 | 0 | 3 | Lemberopolous, Mr. Peter L | male | 34.50 | 0 | 0 | 2683 | 6.4375 | NaN | C |
| 327 | 0 | 3 | Nysveen, Mr. Johan Hansen | male | 61.00 | 0 | 0 | 345364 | 6.2375 | NaN | S |
| 873 | 0 | 1 | Carlsson, Mr. Frans Olof | male | 33.00 | 0 | 0 | 695 | 5.0000 | B51 B53 B55 | S |
| 379 | 0 | 3 | Betros, Mr. Tannous | male | 20.00 | 0 | 0 | 2648 | 4.0125 | NaN | C |
| 598 | 0 | 3 | Johnson, Mr. Alfred | male | 49.00 | 0 | 0 | LINE | 0.0000 | NaN | S |
| 264 | 0 | 1 | Harrison, Mr. William | male | 40.00 | 0 | 0 | 112059 | 0.0000 | B94 | S |
| 807 | 0 | 1 | Andrews, Mr. Thomas Jr | male | 39.00 | 0 | 0 | 112050 | 0.0000 | A36 | S |
| 823 | 0 | 1 | Reuchlin, Jonkheer. John George | male | 38.00 | 0 | 0 | 19972 | 0.0000 | NaN | S |
| 180 | 0 | 3 | Leonard, Mr. Lionel | male | 36.00 | 0 | 0 | LINE | 0.0000 | NaN | S |
| 272 | 1 | 3 | Tornquist, Mr. William Henry | male | 25.00 | 0 | 0 | LINE | 0.0000 | NaN | S |
| 303 | 0 | 3 | Johnson, Mr. William Cahoone Jr | male | 19.00 | 0 | 0 | LINE | 0.0000 | NaN | S |
| 278 | 0 | 2 | Parkes, Mr. Francis "Frank" | male | NaN | 0 | 0 | 239853 | 0.0000 | NaN | S |
| 414 | 0 | 2 | Cunningham, Mr. Alfred Fleming | male | NaN | 0 | 0 | 239853 | 0.0000 | NaN | S |
| 467 | 0 | 2 | Campbell, Mr. William | male | NaN | 0 | 0 | 239853 | 0.0000 | NaN | S |
| 482 | 0 | 2 | Frost, Mr. Anthony Wood "Archie" | male | NaN | 0 | 0 | 239854 | 0.0000 | NaN | S |
| 634 | 0 | 1 | Parr, Mr. William Henry Marsh | male | NaN | 0 | 0 | 112052 | 0.0000 | NaN | S |
| 675 | 0 | 2 | Watson, Mr. Ennis Hastings | male | NaN | 0 | 0 | 239856 | 0.0000 | NaN | S |
| 733 | 0 | 2 | Knight, Mr. Robert J | male | NaN | 0 | 0 | 239855 | 0.0000 | NaN | S |
| 816 | 0 | 1 | Fry, Mr. Richard | male | NaN | 0 | 0 | 112058 | 0.0000 | B102 | S |
891 rows × 12 columns
【思考】排序后,如果我們僅僅關注年齡和票價兩列。根據常識我知道發現票價越高的應該客艙越好,所以我們會明顯看出,票價前20的乘客中存活的有14人,這是相當高的一個比例,那么我們后面是不是可以進一步分析一下票價和存活之間的關系,年齡和存活之間的關系呢?當你開始發現數據之間的關系了,數據分析就開始了。
當然,這只是我的想法,你還可以有更多想法,歡迎寫在你的學習筆記中。
存活數與男女之間的關系
多做幾個數據的排序
#代碼train_data.sort_values(['兄弟姐妹個數','父母子女個數','性別'],ascending=False).head(20)| 160 | 0 | 3 | Sage, Master. Thomas Henry | male | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | S |
| 202 | 0 | 3 | Sage, Mr. Frederick | male | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | S |
| 325 | 0 | 3 | Sage, Mr. George John Jr | male | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | S |
| 847 | 0 | 3 | Sage, Mr. Douglas Bullen | male | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | S |
| 181 | 0 | 3 | Sage, Miss. Constance Gladys | female | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | S |
| 793 | 0 | 3 | Sage, Miss. Stella Anna | female | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | S |
| 864 | 0 | 3 | Sage, Miss. Dorothy Edith "Dolly" | female | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | S |
| 60 | 0 | 3 | Goodwin, Master. William Frederick | male | 11.0 | 5 | 2 | CA 2144 | 46.9000 | NaN | S |
| 387 | 0 | 3 | Goodwin, Master. Sidney Leonard | male | 1.0 | 5 | 2 | CA 2144 | 46.9000 | NaN | S |
| 481 | 0 | 3 | Goodwin, Master. Harold Victor | male | 9.0 | 5 | 2 | CA 2144 | 46.9000 | NaN | S |
| 684 | 0 | 3 | Goodwin, Mr. Charles Edward | male | 14.0 | 5 | 2 | CA 2144 | 46.9000 | NaN | S |
| 72 | 0 | 3 | Goodwin, Miss. Lillian Amy | female | 16.0 | 5 | 2 | CA 2144 | 46.9000 | NaN | S |
| 183 | 0 | 3 | Asplund, Master. Clarence Gustaf Hugo | male | 9.0 | 4 | 2 | 347077 | 31.3875 | NaN | S |
| 262 | 1 | 3 | Asplund, Master. Edvin Rojj Felix | male | 3.0 | 4 | 2 | 347077 | 31.3875 | NaN | S |
| 851 | 0 | 3 | Andersson, Master. Sigvard Harald Elias | male | 4.0 | 4 | 2 | 347082 | 31.2750 | NaN | S |
| 69 | 1 | 3 | Andersson, Miss. Erna Alexandra | female | 17.0 | 4 | 2 | 3101281 | 7.9250 | NaN | S |
| 120 | 0 | 3 | Andersson, Miss. Ellis Anna Maria | female | 2.0 | 4 | 2 | 347082 | 31.2750 | NaN | S |
| 234 | 1 | 3 | Asplund, Miss. Lillian Gertrud | female | 5.0 | 4 | 2 | 347077 | 31.3875 | NaN | S |
| 542 | 0 | 3 | Andersson, Miss. Ingeborg Constanzia | female | 9.0 | 4 | 2 | 347082 | 31.2750 | NaN | S |
| 543 | 0 | 3 | Andersson, Miss. Sigrid Elisabeth | female | 11.0 | 4 | 2 | 347082 | 31.2750 | NaN | S |
1.6.3 任務三:利用Pandas進行算術計算,計算兩個DataFrame數據相加結果
# 具體請看《利用Python進行數據分析》第五章 算術運算與數據對齊 部分#自己構建兩個都為數字的DataFrame數據"""我們舉了一個例子:frame1_a = pd.DataFrame(np.arange(9.).reshape(3, 3), columns=['a', 'b', 'c'], index=['one', 'two', 'three'])frame1_b = pd.DataFrame(np.arange(12.).reshape(4, 3), columns=['a', 'e', 'c'], index=['first', 'one', 'two', 'second'])frame1_a""" #代碼frame1_a = pd.DataFrame(np.arange(9.).reshape(3,3),columns=['a','b','c'],index=['one','two','three'])frame1_b = pd.DataFrame(np.arange(12.).reshape(4, 3),columns=['a', 'e', 'c'], index=['first', 'one', 'two', 'second'])將frame_a和frame_b進行相加
#代碼frame1_a| 0.0 | 1.0 | 2.0 |
| 3.0 | 4.0 | 5.0 |
| 6.0 | 7.0 | 8.0 |
【提醒】兩個DataFrame相加后,會返回一個新的DataFrame,對應的行和列的值會相加,沒有對應的會變成空值NaN。
當然,DataFrame還有很多算術運算,如減法,除法等,有興趣的同學可以看《利用Python進行數據分析》第五章 算術運算與數據對齊 部分,多在網絡上查找相關學習資料。
| 0.0 | 1.0 | 2.0 |
| 3.0 | 4.0 | 5.0 |
| 6.0 | 7.0 | 8.0 |
| 9.0 | 10.0 | 11.0 |
| NaN | NaN | NaN | NaN |
| 3.0 | NaN | 7.0 | NaN |
| NaN | NaN | NaN | NaN |
| NaN | NaN | NaN | NaN |
| 9.0 | NaN | 13.0 | NaN |
1.6.4 任務四:通過泰坦尼克號數據如何計算出在船上最大的家族有多少人?
'''還是用之前導入的chinese_train.csv如果我們想看看在船上,最大的家族有多少人(‘兄弟姐妹個數’+‘父母子女個數’),我們該怎么做呢?'''max(train_data['兄弟姐妹個數']+train_data['父母子女個數']) 10【提醒】我們只需找出”兄弟姐妹個數“和”父母子女個數“之和最大的數,當然你還可以想出很多方法和思考角度,歡迎你來說出你的看法。
多做幾個數據的相加,看看你能分析出什么?
1.6.5 任務五:學會使用Pandas describe()函數查看數據基本統計信息
#(1) 關鍵知識點示例做一遍(簡單數據)# 具體請看《利用Python進行數據分析》第五章 匯總和計算描述統計 部分#自己構建一個有數字有空值的DataFrame數據"""我們舉了一個例子:frame2 = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3] ], index=['a', 'b', 'c', 'd'], columns=['one', 'two'])frame2""" #代碼frame2 = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3] ], index=['a', 'b', 'c', 'd'], columns=['one', 'two'])frame2| 1.40 | NaN |
| 7.10 | -4.5 |
| NaN | NaN |
| 0.75 | -1.3 |
調用 describe 函數,觀察frame2的數據基本信息
#代碼frame2.describe()| 3.000000 | 2.000000 |
| 3.083333 | -2.900000 |
| 3.493685 | 2.262742 |
| 0.750000 | -4.500000 |
| 1.075000 | -3.700000 |
| 1.400000 | -2.900000 |
| 4.250000 | -2.100000 |
| 7.100000 | -1.300000 |
1.6.6 任務六:分別看看泰坦尼克號數據集中 票價、父母子女 這列數據的基本統計數據,你能發現什么?
'''看看泰坦尼克號數據集中 票價 這列數據的基本統計數據''' #代碼train_data['票價'].describe() count 891.000000mean 32.204208std 49.693429min 0.00000025% 7.91040050% 14.45420075% 31.000000max 512.329200Name: 票價, dtype: float64 train_data['父母子女個數'].describe() count 891.000000mean 0.381594std 0.806057min 0.00000025% 0.00000050% 0.00000075% 0.000000max 6.000000Name: 父母子女個數, dtype: float64【思考】從上面數據我們可以看出,試試在下面寫出你的看法。然后看看我們給出的答案。
【思考】從上面數據我們可以看出,
一共有891個票價數據,
平均值約為:32.20,
標準差約為49.69,說明票價波動特別大,
25%的人的票價是低于7.91的,50%的人的票價低于14.45,75%的人的票價低于31.00,
票價最大值約為512.33,最小值為0。
75%的人沒有子女或父母,說明出玩人員大部分都孤身一身
當然,答案只是我的想法,你還可以有更多想法,歡迎寫在你的學習筆記中。
多做幾個組數據的統計,看看你能分析出什么?
# 寫下你的其他分析【思考】有更多想法,歡迎寫在你的學習筆記中。
【總結】本節中我們通過Pandas的一些內置函數對數據進行了初步統計查看,這個過程最重要的不是大家得掌握這些函數,而是看懂從這些函數出來的數據,構建自己的數據分析思維,這也是第一章最重要的點,希望大家學完第一章能對數據有個基本認識,了解自己在做什么,為什么這么做,后面的章節我們將開始對數據進行清洗,進一步分析。
總結
以上是生活随笔為你收集整理的Datawhale-数据分析-泰坦尼克-第一单元的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Frank Luna DirectX12
- 下一篇: 未命名文章图灵奖Yann LeCun团队