eda可视化_5用于探索性数据分析(EDA)的高级可视化
eda可視化
Early morning, a lady comes to meet Sherlock Holmes and Watson. Even before the lady opens her mouth and starts telling the reason for her visit, Sherlock can tell a lot about a person by his sheer power of observation and deduction. Similarly, we can deduce a lot about the data and relationship among the features before complex modelling and feeding the data to algorithms.
清晨,一位女士來(lái)見(jiàn)福爾摩斯和沃森。 甚至在這位女士張開(kāi)嘴并開(kāi)始說(shuō)出拜訪原因之前,Sherlock都能憑借其觀察力和推論的絕對(duì)能力來(lái)講述一個(gè)人的事。 同樣,在進(jìn)行復(fù)雜建模并將數(shù)據(jù)輸入算法之前,我們可以推斷出很多數(shù)據(jù)以及要素之間的關(guān)系。
Objective
目的
In this article, I will discuss five advanced data visualisation options to perform an advanced EDA and become Sherlock Holmes of data science. The goal is to deduce most on the relationship among different data points with minimal coding and quickest built-in options available.
在本文中,我將討論五個(gè)高級(jí)數(shù)據(jù)可視化選項(xiàng),以執(zhí)行高級(jí)EDA并成為數(shù)據(jù)科學(xué)的Sherlock Holmes。 目的是通過(guò)最少的編碼和最快的內(nèi)置選項(xiàng)來(lái)推斷不同數(shù)據(jù)點(diǎn)之間的關(guān)系。
Step 1: We will be using the seaborn package inbuilt datasets and advanced option to illustrate the advanced data visualisation.
步驟1:我們將使用seaborn軟件包內(nèi)置的數(shù)據(jù)集和高級(jí)選項(xiàng)來(lái)說(shuō)明高級(jí)數(shù)據(jù)可視化。
import seaborn as snsimport matplotlib.pyplot as plt
Step 2:Seaborn package comes with a few in-built datasets to quickly prototype a visualisation and evaluate its suitability for EDA with own data points. In the article, we will be using the seaborn “penguins” dataset. From the online seaborn repository, the dataset is loaded with load_dataset method. We can get the list of all the inbuilt Seaborn datasets with get_dataset_names() names method.
第2步: Seaborn軟件包附帶了一些內(nèi)置數(shù)據(jù)集,可快速創(chuàng)建可視化原型并使用自己的數(shù)據(jù)點(diǎn)評(píng)估其對(duì)EDA的適用性。 在本文中,我們將使用原始的“企鵝”數(shù)據(jù)集。 從在線Seaborn存儲(chǔ)庫(kù)中,使用load_dataset方法加載數(shù)據(jù)集。 我們可以使用get_dataset_names()names方法獲取所有內(nèi)置Seaborn數(shù)據(jù)集的列表。
In the below code, in FacetGrid method, dataset name i.e. “bird”, feature by which we want to organise the visualisation of the data i.e. “island”, and feature by which we want to group i.e. hue as “specifies” is mentioned as the parameter. Further, in the “map” method X and Y-axis of the scatter plot i.e. “flipper_length_mm”, and “body_mass_g” mentioned in the example below.
在下面的代碼中,在FacetGrid方法中,數(shù)據(jù)集名稱即“ bird”,我們要通過(guò)其組織數(shù)據(jù)可視化的功能(即“ island”)和我們要對(duì)其進(jìn)行分組(例如“ specify”的色相)的功能如下:參數(shù)。 此外,在“映射”方法中,在以下示例中提到的散點(diǎn)圖的X軸和Y軸,即“ flipper_length_mm”和“ body_mass_g”。
bird= sns.load_dataset("penguins")g = sns.FacetGrid(bird, col="island", hue="species")
g.map(plt.scatter, "flipper_length_mm", "body_mass_g", alpha=.6)
g.add_legend()
plt.show()
Data set visualisation based on the above code plots the data points organised by “island” and colour-coded by “species”.
基于以上代碼的數(shù)據(jù)集可視化將按“島嶼”組織并按“物種”按顏色編碼的數(shù)據(jù)點(diǎn)繪制成圖。
With one glance we can infer that “Gentoo” species are only present on the island “Biscoe”. Gentoo species is heavier and has a longer flipper length that other species. “Adelie” species is available on all three islands and “Chinstrap” is only available on the island “Dream”. You can see that with only 5 lines of code we can get so much information without any modelling.
乍一看,我們可以推斷出“ Gentoo”物種僅存在于“ Biscoe”島上。 Gentoo物種較重,并且鰭狀肢的長(zhǎng)度比其他物種更長(zhǎng)。 在所有三個(gè)島上都可以使用“阿德利”物種,而在“夢(mèng)”島上則可以使用“ Chinstrap”物種。 您會(huì)看到,僅用5行代碼,我們無(wú)需任何建模就可以獲得大量信息。
I will encourage you to post a comment on other information we can deduce from the below visualisation.
我鼓勵(lì)您對(duì)我們可以從下面的圖表中得出的其他信息發(fā)表評(píng)論。
Step 3: We want to quickly get an idea on the range of weights of the penguins by species and islands. Also, identify the concentration of the weight range.
步驟3:我們想快速了解按物種和島嶼劃分的企鵝體重范圍。 另外,確定體重范圍的濃度。
With a strip plot, we can plot the weight of the penguins organised by species for each the islands.
通過(guò)條形圖,我們可以繪制每個(gè)島嶼按物種組織的企鵝的體重。
sns.stripplot(x="island", y="body_mass_g", hue= "species", data=bird, palette="Set1")plt.show()
Strip plot is helpful to get an insight as long as plots are not overcrowded with densely populated data points. In the island “Dream” points are densely populated in the plot, and it is a bit difficult to get meaningful information from it.
只要圖塊不被人口稠密的數(shù)據(jù)點(diǎn)過(guò)度擁擠,帶狀圖就有助于獲得洞察力。 在島上,“夢(mèng)”點(diǎn)在圖中密密麻麻地填充著,很難從中獲取有意義的信息。
Swarmplot can help to visualise the range of weight of the penguins by species in each of the islands non-overlapping points.
Swarmplot可以幫助按島上每個(gè)非重疊點(diǎn)的物種可視化企鵝的體重范圍。
In the below code, we mention similar information like above in strip plot.
在下面的代碼中,我們?cè)趲顖D中提到了類(lèi)似的信息。
sns.swarmplot(x="island", y="body_mass_g", hue="species",data=bird, palette="Set1", dodge=True)
plt.show()
This improves the comprehension of the data points immensely in case of densely populated data. We can infer with a glance that “Adelie” weight ranges from approx 2500 to 4800 grams and a typical Gentoo species is heavier than Chinstrap species. I will leave you to perform other exploratory data analysis based on the below swarm plot.
在數(shù)據(jù)密集的情況下,這極大地改善了數(shù)據(jù)點(diǎn)的理解。 我們可以一目了然地推斷出“阿德利”的重量范圍約為2500至4800克,典型的Gentoo物種比Chinstrap物種重。 我將讓您根據(jù)以下群圖執(zhí)行其他探索性數(shù)據(jù)分析。
Step 4: Next, we would like to understand the relationship between body mass and culmen length of penguins in each of the island based on their sex.
步驟4:接下來(lái),我們想了解根據(jù)島嶼的性別,企鵝的體重與高短長(zhǎng)度之間的關(guān)系。
In the below code, in x and y parameter features between which we are interested in identifying, the relationship is mentioned. Hue is mentioned as “sex” as we want to learn the relation for male and female penguins separately.
在下面的代碼中,在我們希望識(shí)別的x和y參數(shù)特征中,提到了這種關(guān)系。 順化被稱為“性別”,因?yàn)槲覀兿敕謩e學(xué)習(xí)男女企鵝的關(guān)系。
sns.lmplot(x="body_mass_g", y="culmen_length_mm", hue="sex", col="island", markers=["o", "x"],palette="Set1",data=bird)plt.show()
We can gather that body mass and culmen length relationship in Biscoe island penguins is similar for both male and female sex. On the contrary, in the island Dream, the relationship trend is quite the opposite for male and female penguins. Are the body mass and culmen length relationship of the penguins in the island Dream is linear or not linear?
我們可以發(fā)現(xiàn),在Biscoe島上,企鵝的體重和長(zhǎng)度與男性和女性的相似。 相反,在“夢(mèng)島”中,男女企鵝的關(guān)系趨勢(shì)相反。 夢(mèng)島中企鵝的體重和宮長(zhǎng)長(zhǎng)度關(guān)系是線性的還是線性的?
We can visualise the polynomial relationship by specifying the parameter order in the lmplot method.
我們可以通過(guò)在lmplot方法中指定參數(shù)順序來(lái)可視化多項(xiàng)式關(guān)系。
In the case of heavily densely populated data points, we can further extend our exploratory data analysis by visualizing the body mass culmen relationship by male and female sex separately for each island.
在人口稠密的數(shù)據(jù)點(diǎn)的情況下,我們可以通過(guò)可視化每個(gè)島嶼分別以男性和女性性別劃分的身體標(biāo)本關(guān)系來(lái)進(jìn)一步擴(kuò)展探索性數(shù)據(jù)分析。
Visualization is organised by the col and row parameter mentioned in the code.
可視化由代碼中提到的col和row參數(shù)組織。
sns.lmplot(x="body_mass_g", y="culmen_length_mm", hue="sex", col="island",row="sex",order=2, markers=["o", "x"],palette="Set1",data=bird)plt.show()
Step 5: It helps to visualise a scatter plot and histogram side by side to get a holistic view of the spread of data points and also the frequency of observation at the same time. Joint plots are an efficient way to picture it. In the code below, x & y parameter is the feature between which we are trying to identify the relationship. As the data points are densely populated, hence we will plot a hexbin instead of a scatter plot. “Kind” parameter indicates the type of plot. If you want to know more about hexbin plot, then please refer my article 5 Powerful Visualisation with Pandas for Data Preprocessing.
第5步:這有助于并排可視化散點(diǎn)圖和直方圖,從而獲得數(shù)據(jù)點(diǎn)分布的整體視圖以及同時(shí)觀察的頻率。 聯(lián)合圖是描繪它的有效方法。 在下面的代碼中,x&y參數(shù)是我們?cè)噲D識(shí)別關(guān)系的功能。 由于數(shù)據(jù)點(diǎn)密集,因此我們將繪制六邊形而不是散點(diǎn)圖。 “種類(lèi)”參數(shù)指示圖的類(lèi)型。 如果您想了解更多關(guān)于hexbin圖的信息,請(qǐng)參閱我的文章5:使用Pandas進(jìn)行功能強(qiáng)大的可視化以進(jìn)行數(shù)據(jù)預(yù)處理 。
sns.set_palette("gist_rainbow_r")sns.jointplot(x="body_mass_g", y="culmen_length_mm", kind="hex",data=bird )
plt.show()
Based on the histogram we can infer that most numbers of penguins are between 3500 and 4500 grams. Can you deduce the most frequent culmen length range of the penguins?
根據(jù)直方圖,我們可以推斷出大多數(shù)企鵝數(shù)量在3500至4500克之間。 您能推斷出企鵝最常見(jiàn)的莖長(zhǎng)度范圍嗎?
We can also plot the individual data points (as shown in the right side plot) inside the hexbin plot with the code shown below.
我們還可以使用下面所示的代碼在hexbin圖內(nèi)繪制各個(gè)數(shù)據(jù)點(diǎn)(如右側(cè)圖所示)。
g = sns.jointplot(x="body_mass_g", y="culmen_length_mm", data=bird, kind="hex", color="c")g.plot_joint(plt.scatter, c="w", s=30, linewidth=1, marker="+")
g.set_axis_labels("Body Mass (in gram)", "Culmen Length ( in mm)")
plt.show()
Step 6: At last, we would like to get an overview of the spread and relationship among different features by island. Paitplots are very handy to visualise the scatter plot among different feature. The feature “island” is mentioned as the hue as we want to colour code the plot based on it.
第6步:最后,我們希望按島嶼概述不同要素之間的傳播和關(guān)系。 Paitplots非常便于查看不同特征之間的散點(diǎn)圖。 由于我們要根據(jù)其對(duì)地塊進(jìn)行顏色編碼,因此將特征“島”稱為色相。
sns.pairplot(bird, hue="island")plt.show()
We can see from the visualisation that in the island Biscoe most of the penguins have shallow culmen depth but are heavier than penguins in other islands. Similarly, we can conclude that flipper length is longer for most penguins in Biscoe island but have shallow culmen depth.
從可視化中我們可以看到,在比斯科島中,大多數(shù)企鵝的陰莖深度較淺,但比其他島嶼的企鵝重。 同樣,我們可以得出結(jié)論,在比斯科島上的大多數(shù)企鵝,鰭狀肢的長(zhǎng)度更長(zhǎng),但洞室深度卻較淺。
I hope you will use these advanced visualisations for the exploratory data analysis and get a sense of data points relationship before embarking any complex modelling exercise. I would love to hear your favourite visualization plots for EDA and also a list of conclusions we can draw from the examples illustrated in this article.
我希望您可以在進(jìn)行任何復(fù)雜的建模練習(xí)之前,將這些高級(jí)可視化用于探索性數(shù)據(jù)分析,并了解數(shù)據(jù)點(diǎn)之間的關(guān)系。 我很想聽(tīng)聽(tīng)您最喜歡的EDA可視化圖,以及我們可以從本文所示示例中得出的結(jié)論列表。
In case, you would like to learn data visualisation using pandas then please read by trending article on 5 Powerful Visualisation with Pandas for Data Preprocessing.
如果您想使用熊貓學(xué)習(xí)數(shù)據(jù)可視化,那么請(qǐng)閱讀趨勢(shì)文章5關(guān)于數(shù)據(jù)預(yù)處理的熊貓的強(qiáng)大可視化 。
If you are interested in learning different Scikit-Learn scalers then please do read my article Feature Scaling — Effect Of Different Scikit-Learn Scalers: Deep Dive
如果您有興趣學(xué)習(xí)其他Scikit-Learn潔牙機(jī),請(qǐng)閱讀我的文章Feature Scaling —不同Scikit-Learn潔牙機(jī)的效果:深入研究
"""Full Code"""import seaborn as snsimport matplotlib.pyplot as pltbird= sns.load_dataset("penguins")
g = sns.FacetGrid(bird, col="island", hue="species")
g.map(plt.scatter, "flipper_length_mm", "body_mass_g", alpha=.6)
g.add_legend()
plt.show()sns.stripplot(x="island", y="body_mass_g", hue= "species", data=bird, palette="Set1")
plt.show()sns.swarmplot(x="island", y="body_mass_g", hue="species",data=bird, palette="Set1", dodge=True)
plt.show()sns.lmplot(x="body_mass_g", y="culmen_length_mm", hue="sex", col="island", markers=["o", "x"],palette="Set1",data=bird)
plt.show()sns.lmplot(x="body_mass_g", y="culmen_length_mm", hue="sex", col="island",row="sex",order=2, markers=["o", "x"],palette="Set1",data=bird)
plt.show()sns.set_palette("gist_rainbow_r")
sns.jointplot(x="body_mass_g", y="culmen_length_mm", kind="hex",data=bird )
plt.show()g = sns.jointplot(x="body_mass_g", y="culmen_length_mm", data=bird, kind="hex", color="c")
g.plot_joint(plt.scatter, c="w", s=30, linewidth=1, marker="+")
g.set_axis_labels("Body Mass (in gram)", "Culmen Length ( in mm)")
plt.show()sns.pairplot(bird, hue="island")
plt.show()
翻譯自: https://towardsdatascience.com/5-advanced-visualisation-for-exploratory-data-analysis-eda-c8eafeb0b8cb
eda可視化
總結(jié)
以上是生活随笔為你收集整理的eda可视化_5用于探索性数据分析(EDA)的高级可视化的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 梦到什么可以发财
- 下一篇: 梦到别人包粽子是什么意思