数据eda_关于分类和有序数据的EDA
數(shù)據(jù)eda
數(shù)據(jù)科學和機器學習統(tǒng)計 (STATISTICS FOR DATA SCIENCE AND MACHINE LEARNING)
Categorical variables are the ones where the possible values are provided as a set of options, it can be pre-defined or open. An example can be the gender of a person. In the case of Ordinal variables, the options can be ordered by some rule, like the Likert Scale:
分類變量是將可能的值作為一組選項提供的變量,可以預定義或打開。 一個例子可以是一個人的性別。 對于序數(shù)變量,可以按照某些規(guī)則對選項進行排序,例如Likert Scale:
- Like 喜歡
- Like Somewhat 有點像
- Neutral 中性
- Dislike Somewhat 有點不喜歡
- Dislike 不喜歡
To simplify further examples, we will use a simple example, based on a group of students that have passed or not 2 distinct exams, the results are represented in the next RxC table:
為了簡化更多示例,我們將使用一個簡單示例,該示例基于一組已通過或未通過2次不同考試的學生,結(jié)果顯示在下一個RxC表中:
The example used in the whole article, self-generated.整篇文章中使用的示例是自生成的。Statisticians have developed specific techniques to analyze this data, the most important are:
統(tǒng)計人員已經(jīng)開發(fā)出分析此數(shù)據(jù)的特定技術(shù),其中最重要的是:
協(xié)議措施 (Measures of Agreement)
百分比協(xié)議 (Percent Agreement)
Calculated as the divisions between the number of cases where the rates are in a certain class by the total number of rates.
計算為費率在特定類別中的案例數(shù)除以費率總數(shù)。
Adding totals to the example, self-generated.將總計添加到示例中,自行生成。- The percent agreement for Passing the exam 2 is 25/(25+60) = 0.29, so 29.4% 通過考試2的百分比協(xié)議是25 /(25 + 60)= 0.29,所以29.4%
- The percent agreement for Passing the exam 1 is 30/85 = 0.35, so 35.3% 通過考試1的百分比協(xié)議是30/85 = 0.35,所以35.3%
- The percent agreement of passing the exam 1 and not passing the exam 2 is 10/85 = 0.117, so 11.7%. 通過考試1和未通過考試2的百分比協(xié)議是10/85 = 0.117,所以11.7%。
The problem with the percent agreement is that the data can be obtained only by chance.
百分比一致性的問題在于只能偶然獲得數(shù)據(jù)。
科恩的卡帕 (Cohen’s Kappa)
The example used in the whole article, self-generated.整篇文章中使用的示例是自生成的。To overcome the problems of percent agreement, we calculate Kappa as:
為了克服百分比協(xié)議的問題,我們將Kappa計算為:
Cohen’s Kappa formula, self-generated.科恩的Kappa公式,是自生成的。where P0 is the observed agreement and Pe the expected agreement, calculated as:
其中P0是觀察到的協(xié)議, Pe是期望的協(xié)議,計算公式為:
P0 and Pe formulas, self-generated.P0和Pe公式,是自生成的。In our example:
在我們的示例中:
P0 = 70/85 = 0.82
P0 = 70/85 = 0.82
Pe = 30 x 25 / 852 + 55 x 60 / 852 = 0.56
Pe = 30 x 25 /852+ 55 x 60 /852= 0.56
K = 0.26 / 0.44 = 0.59
K = 0.26 / 0.44 = 0.59
The Kappa results are in possible range is (-1,1), where 0 means that observed agreement and chance agreement is the same, 1 if all cases were in agreement and -1 if all cases were in disagreement.
Kappa結(jié)果的可能范圍是(-1,1),其中0表示觀察到的一致和機會一致是相同的,如果所有情況都一致,則為1;如果所有情況都不一致,則為-1。
卡方分布 (The Chi-Squared Distribution)
To do hypothesis testing with categorical variables, we need to use custom distributions, the most common is the Chi-Square, being a continuous theoretical probability distribution.
要使用分類變量進行假設(shè)檢驗,我們需要使用自定義分布,最常見的是卡方,即連續(xù)的理論概率分布。
This distribution has only one parameter, k which means degrees of freedom. As k approaches infinity, the chi-Squared distribution becomes similar to the normal distribution.
這種分布只有一個參數(shù), k表示自由度。 當k接近無窮大時,卡方分布變得類似于正態(tài)分布。
卡方檢驗 (Chi-Squared Test)
This test is used to check if two categorical variables are independent, we will use the same example to explain how to calculate it:
該測試用于檢查兩個類別變量是否獨立,我們將使用相同的示例來說明如何計算它:
First, we define the hypothesis that we want to test, in our case, we want to check if passing exam 1 and exam 2 are independent, so:
首先,我們定義要測試的假設(shè),在本例中,我們要檢查通過考試1和考試2是否獨立,因此:
- H0 = Pass exam 1 and pass exam 2 are independent. H0 =通過考試1和通過考試2是獨立的。
- Ha = Pass exam 1 and pass exam 2 are dependent. Ha =通過考試1和通過考試2是相關(guān)的。
This test relies on the difference between expected and observed values, to calculate the expected values(what you expect to find if both variables were independent), we use:
該測試依賴于期望值與觀察值之間的差異,以計算期望值(如果兩個變量都是獨立的,您會發(fā)現(xiàn)什么),我們使用:
Expected values formula, self-generated.期望值公式,自行生成。To simplify the calculations first we calculate the marginals, these values are the sums per row and column that we already calculated in the second table if this post. The expected values are calculated as:
為了簡化計算,首先我們計算邊際,這些值是我們在第二張表中已經(jīng)計算出的每行和每列的總和。 期望值的計算公式為:
Expected values calculation for our example, self-generated.本示例的期望值計算,是自生成的。Now we have all we need to calculate the chi-squared formula:
現(xiàn)在我們有了計算卡方公式所需的全部:
The chi-Squared formula, self-generated.卡方公式,自生成。With the sum symbol, we mean that we have to calculate the formula for all combinations of our variables, in our case 4, and sum the results:
對于總和符號,我們的意思是我們必須為變量4的所有組合計算公式,并對結(jié)果求和:
Results for each sum of the formula, self-generated.公式的每個和的結(jié)果,自生成。The final values are the sum of all 4, being 26.96, now we have to compare this result with the statistical tables, for this we need to know the degrees of freedom, they are calculated as (num rows-1)*(num columns-1), in our case we have a degree of freedom = 1.
最終值是所有4的總和,即26.96 ,現(xiàn)在我們必須將此結(jié)果與統(tǒng)計表進行比較,為此,我們需要知道自由度,它們的計算方式為(num rows-1)*(num columns -1) ,在我們的情況下,我們的自由度= 1。
According to the tables found easy searching Chi-Squared table at Google(statistical packages for any language should have them in a function), the critical value for 𝝰 = 0.05, is 3.841, our result is much larger, so, we reject the null hypothesis which means that pass exam 1 and pass exam 2 are dependent.
根據(jù)在Google上發(fā)現(xiàn)的易于搜索的Chi-Squared表(任何語言的統(tǒng)計軟件包都應在函數(shù)中包含它們),, = 0.05的臨界值為3.841,我們的結(jié)果要大得多,因此,我們拒絕空值假設(shè)意味著通過考試1和通過考試2是相互依賴的。
分類數(shù)據(jù)的相關(guān)統(tǒng)計 (Correlation statistics for categorical data)
As person correlation requires variables to be measured on at least interval level, we need to adopt a new calculation for binary and ordinal variables, let’s introduce them:
由于人的相關(guān)性要求至少在區(qū)間水平上測量變量,因此我們需要對二進制和序數(shù)變量采用新的計算方法,讓我們對其進行介紹:
二進制變量 (Binary Variables)
Phi is a measure of the degree of association between two binary variables, based on the table introduced at the Cohen’s Kappa sections, it’s calculated as:
Phi是兩個二進制變量之間關(guān)聯(lián)度的度量,基于Cohen Kappa部分介紹的表,其計算公式為:
Formulas to calculate the phi statistic, self-generated.自行計算phi統(tǒng)計信息的公式。Using the second formula, in our example, Φ = (26.96/85)^(1/2) = 0.1
在我們的示例中,使用第二個公式,Φ=( 26.96 / 85)^(1/2)= 0.1
Notice that the first formula can obtain negative values, meanwhile, the second one can only result in positive values, we don't care about the direction of our result, we just analyze the absolute value.
注意,第一個公式可以得出負值,而第二個公式只能得出正值,我們不在乎結(jié)果的方向,我們只分析絕對值。
If the distribution of the data is 50–50, so data is evenly distributed, phi can reach the value of 1, else the potential max value is lower. In our case, we have very little relationship.
如果數(shù)據(jù)的分布是50–50,則數(shù)據(jù)分布均勻,phi可以達到1的值,否則潛在的最大值較低。 就我們而言,我們之間的關(guān)系很少。
點-雙相關(guān) (The Point-Biserial Correlation)
It’s a measure that calculates the correlation between dichotomous and continuous variables, the formula is the next-one:
這是一種計算二分變量和連續(xù)變量之間的相關(guān)性的度量,公式為下一個:
Point biserial correlation formula, self-generated.點雙數(shù)相關(guān)公式,自生成。Where:
哪里:
x?1 = mean of the continuous variable for group 1
x?1 =組1連續(xù)變量的平均值
x?2 = mean of the continuous variable for group 2
x?2 =第2組連續(xù)變量的平均值
p = proportion of class 1 in the dichotomous variable
p = 1類在二分變量中的比例
s_x = Standart deviation of the continuous variable
s_x =連續(xù)變量的標準偏差
To follow our example we will suppose the next values, obtained comparing the exam 1 variable with the number of hours studied:
遵循我們的示例,我們將假定下一個值,該值是將考試1變量與學習的小時數(shù)進行比較而獲得的:
x? pass = 5.5
x?通過 = 5.5
x? not pass = 3.1
x?不及格 = 3.1
p = 20/25 = 0.8
p = 20/25 = 0.8
s_x = 2
s_x = 2
With these values, we obtain a result of 2.4 * 0.4 / 2 = 0.48, indicating that there’s some relation between our variables.
使用這些值,我們得到的結(jié)果為2.4 * 0.4 / 2 = 0.48 ,表明變量之間存在某種關(guān)系。
序數(shù)變數(shù) (Ordinal Variables)
The most used correlation coefficient for ordinal variables is the Spearman’s rank-order coefficient, usually called Spearman’s r.
序數(shù)變量最常用的相關(guān)系數(shù)是Spearman的秩序系數(shù) ,通常稱為Spearman的r 。
Spearman’s r correlation coefficient for ordinal variables, self-generated.Spearman的r相關(guān)系數(shù),用于自變量。where d_i means the difference between 2 variables for each individual and n the size of the sample.
其中d_i表示每個個體的2個變量與樣本大小的n之差。
摘要 (Summary)
In data science, we’re used to do some scatter plots of the binary, categorical or ordinary variables, use them as color differences in other plots, but when we calculate the correlations it’s easy to skip this variable, because of the built-in functions for pandas in the case of python or Dplyr in R don't use them.
在數(shù)據(jù)科學中,我們習慣于對二進制,分類或普通變量進行散點圖繪制,將它們用作其他圖中的色差,但是當我們計算相關(guān)性時,由于內(nèi)置變量,很容易跳過此變量R中的python或Dplyr的熊貓函數(shù)不使用它們。
In this post, we showed how to analyze these variables' distribution and their correlation with all the other variables.
在這篇文章中,我們展示了如何分析這些變量的分布以及它們與所有其他變量的相關(guān)性。
This is the tenth post of my particular #100daysofML, I will be publishing the advances of this challenge at GitHub, Twitter, and Medium (Adrià Serra).
這是我特別#十后100daysofML,我會發(fā)布在GitHub上,Twitter和中型企業(yè)(這一挑戰(zhàn)的進步阿德里亞塞拉 )。
https://twitter.com/CrunchyML
https://twitter.com/CrunchyML
https://github.com/CrunchyPistacho/100DaysOfML
https://github.com/CrunchyPistacho/100DaysOfML
翻譯自: https://medium.com/ai-in-plain-english/eda-on-categorical-and-ordinal-data-22f8a4407836
數(shù)據(jù)eda
總結(jié)
以上是生活随笔為你收集整理的数据eda_关于分类和有序数据的EDA的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 做梦梦到虫子是什么意思又全消灭了
- 下一篇: 梦到自己上班迟到是什么意思