泰坦尼克数据集预测分析_探索性数据分析—以泰坦尼克号数据集为例(第1部分)
泰坦尼克數據集預測分析
Imagine your group of friends have decided to spend the vacations by travelling to an amazing destination. And you have been given the responsibility to find one. Interesting? However interesting it may seem, choosing a single location for which everyone agrees is still a hectic task. There are various factors that need to be considered while choosing the location. The cost of the travel journey to the location should fit everyone's pockets. There should be proper accommodation options. The remoteness of the location, activities available, the best time to visit the locations and so on.
想象一下,您的一群朋友決定通過前往一個奇妙的目的地來度過假期。 并且您有責任找到一個。 有趣? 無論看起來多么有趣,選擇每個人都同意的單一位置仍然是一項繁重的任務。 選擇位置時,需要考慮多種因素。 前往該地點的旅行費用應適合每個人的腰包。 應該有適當的住宿選擇。 位置的偏遠性,可用的活動 , 訪問位置的最佳時間等。
To get the information about all these factors you have to search on the internet and get information from various sources. After getting all these information, you then have to compare and find a right trade-off between the factors of these locations. The very same activity of gathering, understanding and comparing the data(can be in a visual representation) to help make better decisions is known as Exploratory Data Analysis.
要獲取有關所有這些因素的信息,您必須在Internet上搜索并從各種來源獲取信息。 在獲得所有這些信息之后,您必須比較并在這些位置的因素之間找到正確的權衡。 收集 , 理解和比較數據(可以以視覺方式表示)以幫助做出更好的決策的相同活動被稱為探索性數據分析 。
In this article we’ll be Exploring the data of the legendary — ‘Titanic’. We will be using ‘Titanic: Machine Learning from Disaster’ data-set on Kaggle to dive deep into it. The main objective, however, is to predict the survival of the passengers based on the attributes given, here, we’ll be exploring the data-set to find the Hidden Story which was covered along with Sinking of the Titanic. We’ll be unraveling some amazing mysteries behind the sinking of Titanic which you hardly might have heard of. So, get in your detective hat and magnifying glass, ‘coz we’ll be exploring History’s one of the most interesting data till date!
在本文中,我們將探索傳說中的“泰坦尼克號”數據。 我們將使用Kaggle上的“ 泰坦尼克號:從災難中學習機器 ”數據集來深入研究它。 但是,主要目的是根據給定的屬性預測乘客的生存,在這里,我們將探索數據集,以查找《泰坦尼克號沉沒》所涵蓋的隱藏故事 。 在泰坦尼克號沉沒的背后,我們將揭開一些令人難以置信的神秘面紗 ,您可能幾乎沒有聽說過。 因此,戴上您的偵探帽和放大鏡,因為到目前為止,我們將探索歷史上最有趣的數據之一!
The basic requirements for this project will be a basic understanding about the Python language along with some basic plotting library of Matplotlib, Seaborn etc. If you don’t know any of these, it’s completely alright ‘coz at the end of this article you will be having a basic understanding about it. I’ll be recommending to use ‘Jupyter notebook’ or the simpler and the better option- ‘Google Colab’. Know how to setup your Google Colab by following the simple steps mentioned in this link — https://www.geeksforgeeks.org/how-to-use-google-colab/. After opening and setting up the Google Colab, it’s time to bring your inner Data Scientist out :)
該項目的基本要求將是對Python語言以及Matplotlib,Seaborn等基本繪圖庫的基本了解。如果您不了解其中任何一個,那么在本文結尾處完全可以使用'coz',對它有基本的了解。 我建議使用“ Jupyter筆記本”或更簡單,更好的選擇-“ Google Colab”。 遵循此鏈接中提到的簡單步驟,了解如何設置Google Colab — https://www.geeksforgeeks.org/how-to-use-google-colab/ 。 打開并設置Google Colab之后,是時候將您的內部數據科學家帶出來了:)
I have also provided the link to the Google Colab Notebook having all these code.
我還提供了包含所有這些代碼的Google Colab Notebook的鏈接。
To import the data from Kaggle into Google Colab follow these steps—
要將數據從Kaggle導入Go??ogle Colab,請按照以下步驟操作:
選擇您已下載的kaggle.json文件 (Choose the kaggle.json file that you have downloaded)
創建名為kaggle的目錄,并將kaggle.json文件復制到其中。 (Make directory named kaggle and copy kaggle.json file into it.)
更改文件的權限并下載數據集 (Change the permissions of the file and download the data-set)
Make sure the Drive is Mounted and appropriate folders are created . Here in my case folders Projects>datasets were created where we’ll be moving the downloaded data-set to avoid importing from kaggle repeatedly.
確保已安裝驅動器并創建了適當的文件夾。 在我的案例中,這里創建了Projects> datasets文件夾,我們將在其中移動下載的數據集,以避免重復從kaggle導入。
Congratulations, you’ve successfully imported the data-set from Kaggle and stored it into your Google Drive. Next time whenever we need the data-set we can do so by simply copying the path of the file and loading it.
恭喜,您已成功從Kaggle導入數據集并將其存儲到Google云端硬盤中。 下次,只要我們需要數據集,我們都可以通過簡單地復制文件的路徑并加載它來實現。
Now, let’s start by importing the required libraries
現在,讓我們開始導入所需的庫
Before jumping into EDA let’s first load the data-set and have quick glance on it. Below are the libraries that we may require.
在進入EDA之前,我們首先加載數據集并快速瀏覽一下。 以下是我們可能需要的庫。
The data can be loaded in the format of Pandas’ Dataframe as follows and ‘data.shape’ print the dimensions of the data where 891 represents the number of records and 12 represents the number of attributes. Now the path to ‘train.csv’ may vary as per your file’s location. You may directly copy and paste the path to the file in ‘Files’ section right below the ‘Table of Contents’ section (As on year 2020).
數據可以如下所示以Pandas的Dataframe格式加載,“ data.shape”打印數據的尺寸,其中891代表記錄數,12代表屬性數。 現在,根據您文件的位置,“ train.csv”的路徑可能會有所不同。 您可以直接將路徑復制并粘貼到“目錄”部分下方的“文件”部分中(截至2020年)。
Output:
輸出:
Viewing an entire data-set at once can be confusing. So, let’s view some sample of the data. ‘data.head()’ gives the ‘starting 5’ and ‘data.tail()’ gives the bottom 5 records/rows of the dataframe based on the index of the row.
一次查看整個數據集可能會造成混淆。 因此,讓我們查看一些數據樣本。 “ data.head()”給出“起始5”,而“ data.tail()”給出基于行的索引的數據幀的底部5條記錄/行。
Output:
輸出:
data.head()data.head() data.tail()data.tail()Now, let’s print the columns of the dataframe.
現在,讓我們打印數據框的列。
Output:
輸出:
‘data.info()’ gives information about each attribute and the count of non-null/ non-missing values in each attribute and its datatype. As you can see in the output, the attributes, ‘Age’, ‘Cabin’ and ‘Embarked’ have some missing values present in them.(The processing of these missing values will be done in later modules.)
“ data.info()”提供有關每個屬性以及每個屬性及其數據類型中非空/非缺失值的計數的信息。 從輸出中可以看到,屬性'Age','Cabin'和'Embarked'中存在一些缺失值(這些缺失值的處理將在以后的模塊中完成)。
Output:
輸出:
If you have numerical data in the data-set, ‘data.describe()’ can be used to get count, standard deviation, mean and five number summary i.e minimum, 25%(Q1), 50%(median), 75%(Q3) and maximum of each attribute.
如果數據集中有數字數據,則可以使用“ data.describe()”來獲取計數,標準偏差,均值和五個數字摘要,即最小值,25%(Q1),50%(中位數),75% (Q3)和每個屬性的最大值。
www.statisticshowto.comwww.statisticshowto.comOutput:
輸出:
了解數據 (Understanding the data)
Okay, so we’ve seen the samples of the data. But what does each of the attributes denote. The description of the attributes are provided in the Kaggle itself. But I’ll try to explain it here to get a better gist of it.
好的,我們已經看到了數據樣本。 但是每個屬性代表什么。 屬性的說明在Kaggle本身中提供。 但是,我將在這里嘗試解釋它,以便更好地理解它。
There are a total of 891 instances, each consisting of 12 attributes. So here’s a brief information about what the data consist of-
共有891個實例,每個實例包含12個屬性。 因此,這是有關數據組成的簡要信息-
Passenger Id: A unique id given for each passenger in the data-set.
乘客ID :為數據集中的每個乘客提供的唯一ID。
2. Survived: It denotes whether the passenger survived or not.
2.幸存 :表示乘客是否幸存。
Here,
這里,
- 0 = Not Survived 0 =未幸存
- 1 = Survived 1 =幸存
3. Pclass: Pclass represents the Ticket class which is also considered as proxy for socio-economic status (SES)
3. Pclass :Pclass代表票證類,也被視為社會經濟地位(SES)的代理
Here,
這里,
- 1 = Upper Class 1 =上層階級
- 2 = Middle Class 2 =中產階級
- 3 = Lower Class 3 =下層階級
4. Name: Name of the Passenger
4.姓名 :旅客姓名
5. Sex: Denotes the Sex/Gender of the passenger i.e ‘male’ or ‘female’.
5.性別 :表示乘客的性別/性別,即“男性”或“女性”。
6. Age: Denotes the age of the passenger
6.年齡 :表示乘客的年齡
Note: If the passenger’s a baby then it’s age is represented in fraction. e.g. 0.33. If the age is estimated, is it in the form of xx.5. e.g. 18.5
注意:如果乘客是嬰兒,則年齡以分數表示。 例如0.33。 如果估計年齡,則采用xx.5的形式。 例如18.5
7. SibSp: It represents no. of siblings / spouses aboard the Titanic
7. SibSp :它代表否。 泰坦尼克號上的兄弟姐妹/配偶
The data-set defines family relations in this way…
數據集以這種方式定義了家庭關系…
- Sibling = brother, sister, stepbrother, stepsister 兄弟姐妹=兄弟,姐妹,繼兄弟,繼父
- Spouse = husband, wife (mistresses and fiances were ignored) 配偶=丈夫,妻子(情婦和未婚夫被忽略)
8. Parch: It represents no. of parents / children aboard the Titanic
8. Parch :代表否。 泰坦尼克號上的父母/子女總數
The dataset defines family relations in this way…
數據集以這種方式定義家庭關系…
- Parent = mother, father 父母=母親,父親
- Child = daughter, son, stepdaughter, stepson 孩子=女兒,兒子,繼女,繼子
- Some children travelled only with a nanny, therefore parch=0 for them. 一些孩子只帶一個保姆旅行,因此他們的parch = 0。
9. Ticket: It represents the ticket number of the passenger
9.機票 :代表乘客的機票號碼
10. Fare: It represents Passenger fare.
10.票價 :代表旅客票價。
11. Cabin: It represents the Cabin No.
11.機艙 :代表機艙號。
12. Embarked: It represents the Port of Embarkation
12.登船 :代表登船港
Here,
這里,
- C = Cherbourg C =瑟堡
- Q = Queenstown Q =皇后鎮
- S = Southampton S =南安普敦
Okay, so now that we have understood the data, let’s hop on to understand the relation between each of the attributes and understand what factors played a major role in the Survival of a Passenger and to also predict if you were in the Titanic, would you have survived or not? Click on the Link to the next story to find out!
好的,現在我們已經了解了數據,讓我們開始了解每個屬性之間的關系,并了解哪些因素在旅客的生存中起著重要作用,并預測您是否在泰坦尼克號上,幸存了沒有? 單擊鏈接到下一個故事以查找!
Link to the Notebook: Click Here
鏈接到筆記本: 單擊此處
Link to Part 2 of the Blog: Click Here
鏈接到博客的第2部分: 單擊此處
翻譯自: https://medium.com/@bapreetam/exploratory-data-analysis-a-case-study-on-titanic-data-set-part-1-d1376b2a6cef
泰坦尼克數據集預測分析
總結
以上是生活随笔為你收集整理的泰坦尼克数据集预测分析_探索性数据分析—以泰坦尼克号数据集为例(第1部分)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 离谱!男子竟在高速上与公鸡打架!网友:“
- 下一篇: 快手极速版戳一下怎么取消