熊猫烧香分析报告_熊猫分析进行最佳探索性数据分析
熊貓燒香分析報告
目錄 (Table of Contents)
介紹 (Introduction)
There are countless ways to perform exploratory data analysis (EDA) in Python (and in R). I do most of mine in the popular Jupyter Notebook. Once I realized there was a library that could summarize my dataset with just one line of code, I made sure to utilize it for every project, reaping countless benefits from the ease of this EDA tool. The EDA step should be performed first before executing any Machine Learning models for all Data Scientists, therefore, the kind and intelligent developers from Pandas Profiling [2] have made it easy to view your dataset in a beautiful format, while also describing the information well in your dataset.
在Python( 和R )中執(zhí)行探索性數(shù)據(jù)分析(EDA)的方法有無數(shù)種。 我在流行的Jupyter筆記本電腦上做大多數(shù)事情。 一旦意識到有一個庫可以用一行代碼來總結(jié)我的數(shù)據(jù)集,我便確保將其用于每個項目,并從此EDA工具的易用性中獲得了無數(shù)的收益。 在為所有數(shù)據(jù)科學(xué)家執(zhí)行任何機器學(xué)習(xí)模型之前,應(yīng)首先執(zhí)行EDA步驟,因此, Pandas Profiling [2]的友善而又聰明的開發(fā)人員已輕松以美觀的格式查看數(shù)據(jù)集,同時也很好地描述了信息在您的數(shù)據(jù)集中。
The Pandas Profiling report serves as this excellent EDA tool that can offer the following benefits: overview, variables, interactions, correlations, missing values, and a sample of your data. I will be using randomly generated data to serve as an example of this useful tool.
熊貓分析報告是一種出色的EDA工具,可提供以下好處:概述,變量,交互作用,相關(guān)性,缺失值和數(shù)據(jù)樣本。 我將使用隨機生成的數(shù)據(jù)作為此有用工具的示例。
總覽 (Overview)
Overview example. Screenshot by Author [3].概述示例。 作者[3]的屏幕截圖。The overview tab in the report provides a quick glance at how many variables and observations you have or the number of rows and columns. It will also perform a calculation to see how many of your missing cells there are compared to the whole dataframe column. Additionally, it will point out duplicate rows as well and calculate that percentage. This tab is most similar to part of the describe function from Pandas, while providing a better user-interface (UI) experience.
報告中的“概述”選項卡可讓您快速瀏覽一下您擁有多少變量和觀測值,或者行和列的數(shù)量。 它還將執(zhí)行計算,以查看與整個數(shù)據(jù)框列相比有多少個丟失的單元格。 此外,它還將指出重復(fù)的行并計算該百分比。 此選項卡與Pandas的describe函數(shù)的一部分最為相似,同時提供了更好的用戶界面 ( UI )體驗。
The overview is broken into dataset statistics and variable types. You can also refer to warnings and reproduction for more specific information on your data.
概述分為數(shù)據(jù)集統(tǒng)計信息和變量類型。 您還可以參考警告和復(fù)制以獲取有關(guān)數(shù)據(jù)的更多特定信息。
I will be discussing variables, which are also referred to as columns or features of your dataframe
我將討論變量,這些變量也稱為數(shù)據(jù)框的列或特征
變數(shù) (Variables)
Variables example. Screenshot by Author [4].變量示例。 作者[4]的屏幕截圖。To achieve more granularity in your descriptive statistics, the variables tab is the way to go. You can look at distinct, missing, aggregations or calculations like mean, min, and max of your dataframe features or variables. You can also see the type of data you are working with (i.e., NUM). Not pictured is when you click on ‘Toggle details’. This toggle prompts a whole plethora of more usable statistics. The details include:
為了使描述性統(tǒng)計信息更加精確,可以使用“變量”選項卡。 您可以查看數(shù)據(jù)框特征或變量的不同,缺失,聚合或計算,例如均值,最小值和最大值。 您還可以查看正在使用的數(shù)據(jù)類型( 即NUM )。 當(dāng)您單擊“ 切換詳細信息 ”時,未顯示圖片。 此切換提示大量更多可用統(tǒng)計信息。 詳細信息包括:
Statistics — quantile and descriptive
統(tǒng)計-分位數(shù)和描述性
quantile
分位數(shù)
Minimum5th percentile
Q1
Median
Q3
95th percentile
Maximum
Range
Interquartile range (IQR)
descriptive
描述性的
Standard deviationCoefficient of variation (CV)
Kurtosis
Mean
Median Absolute Deviation (MAD)
Skewness
Sum
Variance
Monotonicity
These statistics also provide similar information from the describe function I see most Data Scientists using today, however, there are a few more and it presents them in an easy-to-view format.
這些統(tǒng)計信息還提供了我今天看到的大多數(shù)數(shù)據(jù)科學(xué)家使用的describe函數(shù)的類似信息,但是,還有更多信息,并且以易于查看的格式顯示。
Histograms
直方圖
The histograms provide for an easily digestible visual of your variables. You can expect to see the frequency of your variable on the y-axis and fixed-size bins (bins=15 is the default) on the x-axis.
直方圖為您的變量提供了易于理解的視覺效果。 你可以期望看到的在y軸變量的在x軸的頻率和固定大小的塊( 倉= 15是默認值 )。
Common Values
共同價值觀
The common values will provide the value, count, and frequency that are most common for your variable.
公用值將提供最常用于變量的值,計數(shù)和頻率。
Extreme Values
極端值
The extreme values will provide the value, count, and frequency that are in the minimum and maximum values of your dataframe.
極值將提供數(shù)據(jù)框的最小值和最大值中的值,計數(shù)和頻率。
互動互動 (Interactions)
Interactions example. Screenshot by Author [5].互動示例。 作者[5]的屏幕截圖。The interactions feature of the profiling report is unique in that you can choose from your list of columns to either be on the x-axis or y-xis provided. For example, pictured above is variable A against variable A, which is why you see overlapping. You can easily switch to other variables or columns to achieve a different plot and an excellent representation of your data points.
分析報告的交互功能是獨特的,因為您可以從列列表中選擇在提供的x軸還是y-xis上 。 例如,如上圖所示, 變量A相對于變量A ,這就是為什么看到重疊的原因。 您可以輕松地切換到其他變量或列,以實現(xiàn)不同的圖并很好地表示數(shù)據(jù)點。
相關(guān)性 (Correlations)
Correlations example. Screenshot by Author [6].相關(guān)示例。 作者[6]的屏幕截圖。Sometimes making fancier or colorful correlation plots can be time-consuming if you make them from line-by-line Python code. However, with this correlation plot, you can easily visualize the relationships between variables in your data, which are also nicely color-coded. There are four main plots that you can display:
如果使用逐行的 Python代碼進行繪制,有時制作更精美的彩色關(guān)聯(lián)圖可能會很耗時。 但是,使用此相關(guān)圖,您可以輕松地可視化數(shù)據(jù)中變量之間的關(guān)系,這些變量也已進行了很好的顏色編碼 。 您可以顯示四個主要圖表:
Pearson’s r
皮爾遜河
Spearman’s ρ
斯皮爾曼的ρ
Kendall’s τ
肯德爾的τ
Phik (φk)
皮克(φk)
You may only be used to one of these correlation methods, so the other ones may sound confusing or not usable. Therefore, the correlation plot also comes provided with a toggle for details onto the meaning of each correlation you can visualize — this feature really helps when you need a refresher on correlation, as well as when you are deciding between which plot(s) to use for your analysis
您可能只習(xí)慣了這些相關(guān)方法之一,因此其他方法可能聽起來令人困惑或無法使用。 因此,相關(guān)情節(jié)還附帶提供了一個切換為細節(jié)上可以直觀的每個相關(guān)的含義-這個功能真的幫助,當(dāng)你需要在相關(guān)復(fù)習(xí),以及當(dāng)你決定為與該陰謀( 縣 )使用供您分析
缺失值 (Missing Values)
Missing Values example. Screenshot by Author [7].缺失值示例。 作者[7]的屏幕截圖。As you can see from the plot above, the report tool also includes missing values. You can see how much of each variable is missing, including the count, and matrix. It is a nice way to visualize your data before you perform any models with it. You would preferably want to see a plot like the above, meaning you have no missing values.
從上圖可以看到,報告工具還包含缺失值。 您可以看到缺少每個變量的多少,包括計數(shù)和矩陣。 這是在執(zhí)行任何模型之前可視化數(shù)據(jù)的好方法。 您最好希望看到上面的圖,這意味著您沒有缺失的值。
樣品 (Sample)
Sample example. Screenshot by Author [8].示例示例。 作者[8]的屏幕截圖。Sample acts similarly to the head and tail function where it returns your dataframe’s first few rows or last rows. In this example, you can see the first rows and last rows as well. I use this tab when I want a sense of where my data started and where it ended — I recommend ranking or ordering to see more benefit out of this tab, as you can see the range of your data, with a visual respective representation.
Sample的行為類似于head和tail函數(shù),它返回數(shù)據(jù)框的前幾行或最后幾行。 在此示例中,您還可以看到第一行和最后一行。 當(dāng)我想了解我的數(shù)據(jù)的開始和結(jié)束位置時,可以使用此選項卡-我建議進行排序或排序,以便從該選項卡中獲得更多好處,因為您可以看到數(shù)據(jù)的范圍,并具有直觀的外觀。
摘要 (Summary)
Photo by Elena Loshina on Unsplash [9]. 艾琳娜·洛西娜 ( Elena Loshina)在《 Unsplash [9]》上拍攝 。I hope this article provided you with some inspiration for your next exploratory data analysis. Being a Data Scientist can be overwhelming and EDA is often forgotten or not practiced as much as model-building. With the Pandas Profiling report, you can perform EDA with minimal code, providing useful statistics and visualizes as well. That way, you can focus on the fun part of Data Science and Machine Learning, the model process.
我希望本文能為您的下一個探索性數(shù)據(jù)分析提供一些啟發(fā)。 身為數(shù)據(jù)科學(xué)家可能會令人不知所措,而EDA常常像建立模型一樣被遺忘或未得到實踐。 使用Pandas Profiling報告,您可以用最少的代碼執(zhí)行EDA,同時提供有用的統(tǒng)計信息并進行可視化。 這樣,您就可以專注于數(shù)據(jù)科學(xué)和機器學(xué)習(xí)的有趣部分,即模型過程。
To summarize, the main features of Pandas Profiling report include overview, variables, interactions, correlations, missing values, and a sample of your data.
總之,Pandas Profiling報告的主要功能包括概述,變量,交互作用,相關(guān)性,缺失值以及數(shù)據(jù)樣本。
Here is the code I used to install and import libraries, as well as to generate some dummy data for the example, and finally, the one line of code used to generate the Pandas Profile report based on your Pandas dataframe [10].
這是我用于安裝和導(dǎo)入庫以及為示例??生成一些虛擬數(shù)據(jù)的代碼,最后是用于基于您的Pandas數(shù)據(jù)框[10]生成Pandas Profile報告的一行代碼。
# install library#!pip install pandas_profilingimport pandas_profiling
import pandas as pd
import numpy as np# create data
df = pd.DataFrame(np.random.randint(0,200,size=(15, 6)), columns=list('ABCDEF'))# run your report!
df.profile_report()# I did get an error and had to reinstall matplotlib to fix
Please feel free to comment down below if you have any questions or have used this feature before. There is still some information I did not describe, but you can find more of that information on the link I provided from above.
如果您有任何疑問或以前使用過此功能,請在下面隨意評論。 仍然有一些我沒有描述的信息,但是您可以從上面提供的鏈接中找到更多的信息。
Thank you for reading, I hope you enjoyed!
謝謝您的閱讀,希望您喜歡!
翻譯自: https://towardsdatascience.com/the-best-exploratory-data-analysis-with-pandas-profiling-e85b4d514583
熊貓燒香分析報告
總結(jié)
以上是生活随笔為你收集整理的熊猫烧香分析报告_熊猫分析进行最佳探索性数据分析的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 梦到自己和老公吵架闹离婚是什么意思
- 下一篇: 梦到宝宝从楼上掉下来是什么意思