认识数据分析_认识您的最佳探索数据分析新朋友
認識數據分析
Visualization often plays a minimal role in the data science and model-building process, yet Tukey, the creator of Exploratory Data Analysis, specifically advocated for the heavy use of visualization to address the limitations of numerical indicators.
可視化通常在數據科學和模型構建過程中起著最小的作用,但是“探索性數據分析”的創建者Tukey特別提倡大量使用可視化來解決數字指標的局限性。
Everyone’s heard — and understands — a picture equals a thousand words, and following this logic, a visualization of the data is worth at least as much as dozens of statistical metrics, from quartiles to means to standard deviations to mean absolute errors to kurtosis to entropy. Wherever there is an abundance of data, it is best understood when it is visualized.
每個人都能聽到并理解,一幅圖片等于一千個單詞,按照這種邏輯,數據的可視化至少值幾十種統計指標,從四分位數到均值到標準差再到絕對誤差,峰度到熵。 無論何時何地都有大量數據,最好以可視化方式理解。
Exploratory Data Analysis was created to investigate the data, emphasizing visualization because it was more informative. This short article will present one of the most useful tools in visual EDA and how to interpret it.
創建了探索性數據分析來研究數據,并強調可視化,因為它更具信息性。 這篇簡短的文章將介紹可視化EDA中最有用的工具之一,以及如何解釋它。
Seaborn’s pairplot is magical: at its most simple, it gives us a rich and informational visual representation of univariate and bivariate relationships within the data. For instance, consider two pairplots below, created with one line of code, sns.pairplot(data) (the second adding hue=’species’ as a parameter).
Seaborn的pairplot是不可思議的:最簡單的說,它為我們提供了數據中單變量和雙變量關系的豐富且信息化的視覺表示。 例如,考慮下面的兩個pairplot ,它們由一行代碼sns.pairplot(data) (第二個將hue='species'作為參數添加)。
There’s so much information to be gleaned about the data, be it the success of classification (how much entropy/overlap is there between classes), potential results of a feature selection process, variance, and what the best choice of model may be, based on these observed attributes. The pairplot is like an unfolding of multidimensional space.
有關數據的信息太多,包括分類是否成功(類別之間存在多少熵/重疊),特征選擇過程的潛在結果,方差以及最佳模型選擇,這些觀察到的屬性。 對圖就像多維空間的展開。
Usually, people stop at the one-liner pairplot, but with a few more lines or even words of code, we can reap even more information and insights.
通常,人們會停留在單線對圖上,但是只要再增加幾行甚至是代碼的話,我們就可以獲取更多的信息和見解。
For one, pairplots can get notoriously large. To select a subset of the variables to be displayed, use the vars parameter, which can be set to a list of variable names. For instance, sns.pairplot(data,vars=[‘a’,’b’]) would only give the relationships between the two columns ‘a’ and ‘b’, being aa, ab, ba, and bb. Alternatively, one can specify x_vars and y_vars (each lists) to be the variables for each of those axes.
首先,成對的圖可以變得很大。 要選擇要顯示的變量的子集,請使用vars參數,可以將其設置為變量名列表。 例如, sns.pairplot(data,vars=['a','b'])僅給出兩列'a'和'b'之間的關系,即aa , ab , ba和bb 。 或者,可以將x_vars和y_vars (每個列表)指定為每個軸的變量。
The result of setting the first two plots (setting the vars parameter) is a symmetrical grid of plots:
設置前兩個圖(設置vars參數)的結果是一個對稱的圖網格:
The third plot sets the y-component to only one variable — ‘sepal_length’ — and the x-component to all the columns of the data. This returns the interactions between that one column and all other columns. Note that for the first column — when it is paired against itself — and the fifth column — where it is paired against a categorical variable, the scatterplot is not an appropriate plot. We’ll explore how to deal with this later.
第三'sepal_length'圖將y分量設置為僅一個變量'sepal_length' ,并將x分量設置為數據的所有列。 這將返回該一列與所有其他列之間的交互。 請注意,對于第一列(與它自身配對)和第五列(與類別變量配對),散點圖不是合適的圖。 稍后我們將探討如何處理。
By adding a kind=’reg’ keyword into your pairplot, you can get linear regression fits for the data. This is a great gage as to the linearity and variance of your data, which can lead to decisions about which types of models, both supervised and unsupervised, to choose. Additionally, since pairplots are symmetrical, to a) declutter the plot and b) reduce long loading times, setting corner=True removes the upper-right half, which is a duplicate.
通過在您的對圖中添加kind='reg'關鍵字,您可以獲得數據的線性回歸擬合。 對于數據的線性和方差,這是一個很好的衡量標準,它可以決定要選擇哪種類型的模型,包括監督模型和非監督模型。 此外,由于成對圖是對稱的,因此要a)整理曲線圖和b)減少較長的加載時間,設置corner=True將刪除右上半部分,這是重復項。
Regression plot — left, corner plot — right回歸圖-左圖,角圖-右圖The pairplot alone, however, is relatively limited in its ability to easily and intuitively display several relationships between variables. It is merely an interface to access the pairgrid, which is the real generator behind the ‘pairplot’. Properly handling visualization through pairgrid can yield valuable results.
然而, pairplot在其容易且直觀地顯示變量之間的幾種關系的能力方面相對有限。 它僅僅是訪問pairgrid的接口, pairgrid是“ pairplot ”背后的真正生成器。 通過pairgrid正確處理可視化pairgrid會產生有價值的結果。
Grids in seaborn are initialized to a variable, most commonly g (for grid).For instance, we may write g=sns.PairGrid(data). When grids are initialized, they are completely empty, but they will be filled in with visualizations soon. The grid is a method to access and visualize cross-feature aspects of the data in an efficient and clean way.
seaborn中的網格被初始化為一個變量,最常見的是g (對于網格)。例如,我們可以寫g=sns.PairGrid(data) 。 初始化網格后,它們將完全為空,但是很快將被可視化填充。 網格是一種以有效且干凈的方式訪問和可視化數據的跨功能方面的方法。
We can use map methods to fill in the grid with data. For instance, calling g.map(sns.scatterplot) fills the grid with scatterplots. We can also pass in the model’s parameters: in g.map(sns.kdeplot,shade=True), shade is a parameter of sns.kdeplot but it can be specified in the mapping. Since this is a grid, all the data is sorted out; we only need to call the type of plot.
我們可以使用地圖方法用數據填充網格。 例如,調用g.map(sns.scatterplot)用散點圖填充網格。 我們還可以傳入模型的參數:在g.map(sns.kdeplot,shade=True) ,shade是sns.kdeplot的參數,但可以在映射中指定。 由于這是一個網格,因此將所有數據整理出來; 我們只需要調用情節類型即可。
Note that the diagonals are still scatterplots. We can change this by using g.map_offdiag(sns.scatterplot) for plots not on the diagonal and g.map_diag(plt.hist) for plots on the diagonal. Note that we are able to use plotting objects from other libraries.
請注意,對角線仍然是散點圖。 我們可以通過改變這個g.map_offdiag(sns.scatterplot)未對角和情節g.map_diag(plt.hist)的對角線上的地塊。 注意,我們能夠使用其他庫中的繪圖對象。
We can do one better. Since the top and bottom halves are identical, we can change the plot type between the top and bottom halves using g.map_upper and g.map_lower. In this example, we compare the fits of quadratic and linear regression on the same data by varying the order parameter in seaborn’s regression plot, regplot.
我們可以做得更好。 由于上半部分和下半部分相同,因此我們可以使用g.map_upper和g.map_lower在上半部分和下半部分之間更改繪圖類型。 在此示例中,我們通過更改seaborn回歸圖regplot中的order參數,比較了二次回歸和線性回歸在同一數據上的擬合regplot 。
To specify a hue, we can add the hue=’species’ parameter into the initialization of the PairGrid. Note that we cannot do something like g.map(sns.scatterplot, hue=’species’) because mapping is simply a visualization of the data, not a reprocessing of it. All the data is processed in the initialization of the grid, so all things data-related must be processed then.
要指定色調,我們可以將hue='species'參數添加到PairGrid的初始化中。 請注意,我們無法執行g.map(sns.scatterplot, hue='species')因為映射只是數據的可視化,而不是數據的重新處理。 所有數據都在網格的初始化中處理,因此所有與數據相關的事物都必須進行處理。
Pairgrids are often used to build complex plots, but for the purposes of EDA, the operations covered should be enough.
Pairgrids通常用于構建復雜的地塊,但就EDA而言,所涉及的操作應足夠。
With a few more lines of code, you’ve been able to maximize the information gained from the pairplot and pairgrids. Here are some tips to take away as much insight as you can from it.
再多幾行代碼,您就可以最大化從pairplot和pairgrids獲得的信息。 這里有一些技巧,您可以從中獲得盡可能多的見識。
- Look for curvatures and transformations (e.g. Tukey’s ladder of powers) that can be used to improve model performance. 尋找可用于改善模型性能的曲率和變換(例如Tukey的冪階)。
Approach features by how well they work in their entire row or column. For example, petal_width and petal_length perform well in separating classes along their designated axis very well across all other features. The same cannot be said for sepal_width, where there is much overlap along their axis. This means that it provides less information, can may be good cause for us to run a feature importance and remove it if it provides a negligible boost in predictive power.
通過功能在整個行或整個列中的性能來評估功能。 例如,在所有其他petal_width ,沿著它們的指定軸分隔類時, petal_width和petal_length性能很好。 sepal_width不能說sepal_width ,因為它們的軸上有很多重疊。 這意味著它提供的信息較少,如果它對預測能力的提升可忽略不計,則可能是促使我們發揮功能重要性并予以刪除的良好原因。
- Find how much data points vary from a regression fit (you can try different degrees as well) to get a visual understanding of how stable/stationary the data is. If data points vary widely from the fit and/or a fit must have a high degree to fit the data well, using methods like standardization or normalization may be helpful. 查找與回歸擬合有多少不同的數據點(您也可以嘗試不同的程度),以直觀了解數據的穩定性/平穩性。 如果數據點與擬合值相差很大,并且/或者擬合度必須高度匹配才能很好地擬合數據,則使用標準化或歸一化等方法可能會有所幫助。
- Spend a decent amount of time looking at visual bivariate representations of your data, playing around with comparisons and chart types. There are countless operations you can do to your data, and the purpose of EDA is not to give you answers but to spike your interest in taking a particular action. Data is different every time; no standard procedure fits all sizes. 花大量的時間查看數據的可視雙變量表示形式,進行比較和圖表類型。 您可以對數據執行無數操作,而EDA的目的不是給您答案,而是激發您對采取特定行動的興趣。 每次數據都不一樣; 沒有適合所有尺寸的標準程序。
翻譯自: https://towardsdatascience.com/meet-your-new-best-exploratory-data-analysis-friend-772a60864227
認識數據分析
總結
以上是生活随笔為你收集整理的认识数据分析_认识您的最佳探索数据分析新朋友的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 梦到断头蛇预示着什么
- 下一篇: 红楼梦到底讲的是什么意思