泰坦尼克号 数据分析_第1部分:泰坦尼克号-数据分析基础
泰坦尼克號 數據分析
My goal was to get a better understanding of how to work with tabular data so I challenged myself and started with the Titanic -project. I think this was an excellent way to learn the basics of data analysis with python.
我的目標是更好地了解如何使用表格數據,因此我挑戰自我并開始了Titanic項目。 我認為這是學習python數據分析基礎知識的絕佳方法。
You can find the competition here: https://www.kaggle.com/c/titanicI really recommend you to try it yourself if you want to learn how to analyze the data and build machine learning models.
您可以在這里找到比賽: https : //www.kaggle.com/c/titanic如果您想學習如何分析數據和建立機器學習模型,我真的建議您自己嘗試一下。
I started by uploading the packages:
我首先上傳了軟件包:
import pandas as pd import numpy as npimport matplotlib.pyplot as plt
import seaborn as sns
Pandas is a great package for tabular data analysis. Numpy provides a high-performance multidimensional array object and tools for working with these arrays. Matplotlib packages help you to generate plots, histograms, power spectra, bar charts, etc., with just a few lines of code. Seaborn is developed based on the Matplotlib library and it can be used to create attractive and informative statistical graphics.
Pandas是用于表格數據分析的出色軟件包。 Numpy提供了高性能的多維數組對象和用于處理這些數組的工具。 Matplotlib軟件包可幫助您僅用幾行代碼即可生成圖,直方圖,功率譜,條形圖等。 Seaborn是基于Matplotlib庫開發的,可用于創建引人入勝且內容豐富的統計圖形。
After loading these packages I loaded the data:
加載這些軟件包后,我加載了數據:
df=pd.read_csv("train.csv")Then I had a quick look at the data:
然后,我快速瀏覽了一下數據:
df.head()#This prints you the first 5 rows of the table
#If you want to print 10 rows of the table instead of 5, then use
df.head(10)Screenshot of the first rows第一行的屏幕截圖 df.tail()
# This prints you out the last five rows of the table
I recommend starting with a look at the data so that you can be sure everything is as it should be. This is how you can avoid stupid mistakes in further analysis.
我建議先查看數據,以確保所有內容都應該是正確的。 這樣可以避免進一步分析中的愚蠢錯誤。
df.shape#This prints you the number of rows and columns
It is a good habit to print out the shape of the data in the beginning so you can check the number of columns and rows and be sure you haven’t missed any data during the analysis.
在開始時打印出數據的形狀是個好習慣,因此您可以檢查列數和行數,并確保在分析過程中沒有遺漏任何數據。
分析數據 (Analyze the data)
Then I continued to look at the data by counting the values. This gave me a lot of information about the content of the data.
然后,我繼續通過計算值來查看數據。 這給了我很多有關數據內容的信息。
df['Pclass'].value_counts()# Prints out count of classes valuesThe number of persons in each class. 3rd class was the most popular.每個班級的人數。 第三類是最受歡迎的。
I prefer using percentages to showcase values. It is easier to understand the values in percentages.
我更喜歡使用百分比來展示價值。 更容易理解百分比值。
df['Pclass'].value_counts(normalize=True)# same as above just that using "normalize=True" value is printed in percentages55% of people were in 3rd class55%的人在三等艙
I counted values for each column separately. In the future, I challenge myself to do the function which prints out values but it was not my scope in this project.
我分別計算每列的值。 將來,我會挑戰自己執行輸出值的功能,但這不是我在本項目中的工作范圍。
I wanted to understand also the values of different columns so I used the describe() method for that.
我還想了解不同列的值,因此我使用了describe()方法。
df['Fare'].describe()# describe() is used to view basic statistical details like count, mean, minimum and maximum values.“Fare” column values“票價”列值
Here you can see for example that the minimum price for the ticket was 0,00 $ and the maximum price was 512,33 $.
例如,在這里您可以看到門票的最低價格為0,00 $,最高價格為512,33 $。
I did several crosstables to understand which were the determinant values for the surviving.
我做了幾個交叉表,以了解哪些是生存的決定性價值。
pd.crosstab(df['Survived'], df['Sex'])# crosstable number of sex based on surviving.Here I also recommend using percentages instead of numerical values在這里,我還建議使用百分比而不是數值 pd.crosstab(df['Survived'], df['Sex'], normalize=True)
# Using "normalize=True", you get values in percentage.Same as above just in percentages與上面相同,只是百分比
Doing crosstables with different values gives you information about the possible correlations between the variables, for example, sex and surviving. As you can see, 26% of women survived and most of the men, 52%, didn’t survive.
使用不同的值進行交叉表可為您提供有關變量之間可能的相關性的信息,例如性別和存活率。 如您所見,有26%的女性幸存下來,而大多數男性(52%)沒有幸存。
可視化數據 (Visualize the data)
It is nice to have numerical values in tables but it is easier to understand the visualized data, at least for me. This is why I plotted histograms and bar charts. By creating histograms and bar charts I learned how to visualize the data. Here are a few examples:
在表格中有數值很高興,但至少對于我來說,更容易理解可視化數據。 這就是為什么我繪制直方圖和條形圖的原因。 通過創建直方圖和條形圖,我學習了如何可視化數據。 這里有一些例子:
df.hist(column='Age')In this histogram, you can see that passengers were mostly 20–40 years old.在此直方圖中,您可以看到乘客的年齡大多為20-40歲。I used seaborn library for the bar charts.
我使用seaborn庫制作條形圖。
sns.countplot(x='Sex', hue='Survived', data=df);More females survived than males.存活下來的女性多于男性。Also, I used a heatmap to see the correlation between different columns.
另外,我使用熱圖來查看不同列之間的相關性。
corrmat = df.corr()f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, annot=True, square=True, annot_kws={'size': 15});
Heatmap shows that there is a strong negative correlation between Fares and Classes, so that when one increases other decreases. It is logical because ticket prices in the 1st class are higher than in the 3rd class.
熱圖顯示,票價和艙位之間有很強的負相關性,因此當票價增加時,其他票價會下降。 這是合乎邏輯的,因為第一類的機票價格高于第三類的機票價格。
If we focus on analyzing the correlations between surviving and other values, we see that there is a strong positive correlation between surviving and fare. The probability to survive is higher when the ticket price has been higher.
如果我們專注于分析幸存值與其他值之間的相關性,我們會發現幸存率和票價之間存在很強的正相關性。 當門票價格較高時,生存的可能性較高。
You can find the project in Github. please feel free to try it yourself and comment if there is something that needs clarifying!
您可以在Github中找到該項目。 請隨時嘗試一下,如果有需要澄清的地方,請發表評論!
Thank you for the highly trained monkey (Risto Hinno) for motivating and inspiring me!
感謝您訓練有素的猴子( Risto Hinno )激勵和啟發我!
翻譯自: https://medium.com/swlh/part-1-titanic-basic-of-data-analysis-ab3025d29f6e
泰坦尼克號 數據分析
總結
以上是生活随笔為你收集整理的泰坦尼克号 数据分析_第1部分:泰坦尼克号-数据分析基础的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 孕妇梦到狗咬咬自己预示什么
- 下一篇: 梦到自己得绝症意味着什么