tukey检测_回到数据分析的未来:Tukey真空度的整洁实现
tukey檢測
One of John Tukey’s landmark papers, “The Future of Data Analysis”, contains a set of analytical techniques that have gone largely unnoticed, as if they’re hiding in plain sight.
John Tukey的標志性論文之一,“ 數(shù)據(jù)分析的未來 ”,包含了一套幾乎未被注意的分析技術,好像它們隱藏在視線中一樣。
Multiple sources identify Tukey’s paper as a seminal moment in the history of data science. Both Forbes (“A Very Short History of Data Science”) and Stanford (“50 years of Data Science”) have published histories that use the paper as their starting point. I’ve quoted Tukey myself in articles about data science at Microsoft (“Using Azure to understand Azure”).
多個來源將Tukey的論文視為數(shù)據(jù)科學歷史上的開創(chuàng)性時刻。 福布斯 (“ 數(shù)據(jù)科學史很短 ”)和斯坦福大學(“ 數(shù)據(jù)科學50年 ”)都發(fā)表了以該論文為起點的歷史。 我在Microsoft的有關數(shù)據(jù)科學的文章(“ 使用Azure理解Azure ”中)中親自引用了Tukey。
Independent of the paper, Tukey’s impact on data science has been immense: He was author of Exploratory Data Analysis. He developed the Fast Fourier Transform (FFT) algorithm, the box plot, and multiple statistical techniques that bear his name. He even coined the term “bit.”
獨立于論文之外,Tukey對數(shù)據(jù)科學的影響是巨大的:他是《 探索性數(shù)據(jù)分析》的作者。 他開發(fā)了快速傅立葉變換(FFT)算法,箱形圖以及多種以他的名字命名的統(tǒng)計技術。 他甚至創(chuàng)造了“位”一詞。
But it wasn’t until I actually read “The Future of Data Analysis” that I discovered Tukey’s forgotten techniques. Of course, I already knew the paper was important. But I also knew that if I wanted to understand why — to understand the breakthrough in Tukey’s thinking — I had to read it myself.
但是直到我真正閱讀了“數(shù)據(jù)分析的未來”之后,我才發(fā)現(xiàn)了Tukey被遺忘的技術。 當然,我已經(jīng)知道該論文很重要。 但是我也知道,如果我想了解為什么 -要了解Tukey思維的突破-我必須自己閱讀。
Tukey does not disappoint. He opens with a powerful declaration: “For a long time I have thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt” (p 2). Like the opening of Beethoven’s Fifth, the statement is immediate and bold. “All in all,” he says, “I have come to feel that my central interest is in data analysis…” (p 2).
Tukey并不令人失望。 他以強有力的宣言開頭:“很長一段時間以來,我一直以為我是統(tǒng)計學家,他對從個人到將軍的推論很感興趣。 但是當我看著數(shù)理統(tǒng)計的發(fā)展時,我就有理由懷疑和懷疑”(第2頁)。 就像貝多芬第五屆電影節(jié)的開幕一樣,這份聲明既直接又大膽。 他說:“總的來說,我已經(jīng)感到我的主要興趣是對數(shù)據(jù)分析…… ”(第2頁)。
Despite Tukey’s use of first person, his opening statement is not about himself. He’s putting his personal and professional interests aside to make the much bolder assertion that statistics and data analysis are separate disciplines. He acknowledges that the two are related: “Statistics has contributed much to data analysis. In the future it can, and in my view should, contribute much more” (p 2).
盡管Tukey使用第一人稱,但他的開場白與他本人無關 。 他將個人和職業(yè)興趣放在一邊,以大膽的斷言認為統(tǒng)計和數(shù)據(jù)分析是獨立的學科。 他承認這兩者是相關的:“統(tǒng)計對數(shù)據(jù)分析做出了很大貢獻。 在未來,它可以而且我認為應該做出更多貢獻”(第2頁)。
Moreover, Tukey states that statistics is “pure mathematics.” And, in his words, “…mathematics is not a science, since its ultimate standard of validity is an agreed-upon sort of logical consistency and provability” (p 6). Data analysis, however, is a science, distinguished by its “reliance upon the test of experience as the ultimate standard of validity” (p 5).
此外,Tukey指出統(tǒng)計是“ 純粹的數(shù)學”。 用他的話來說,“……數(shù)學不是一門科學,因為其最終的有效性標準是人們商定的一種邏輯一致性和可證明性”(第6頁)。 然而,數(shù)據(jù)分析是一門科學,其特點是“依靠經(jīng)驗檢驗作為有效性的最終標準”(第5頁)。
CRAN上什么都沒有 (Nothing on CRAN)
Not far into the paper, however, I stumbled. About a third of the way in (p 22), Tukey introduces FUNOP, a technique for automating the interpretation of plots. I paged ahead and spotted a number of equations. I worried that — before I could understand the equations — I might need an intuitive understanding of FUNOP. I paged further ahead and spotted a second technique, FUNOR-FUNOM. I soon realized that this pair of techniques, combined with a third that I didn’t yet realized was waiting for me, make up nearly half the paper.
然而,在論文不遠處,我偶然發(fā)現(xiàn)。 在(p 22)中大約三分之一的方式中,Tukey引入了FUNOP,FUNOP是一種用于自動解釋圖的技術。 我向前翻頁,發(fā)現(xiàn)了許多方程式。 我擔心,在我無法理解方程式之前,我可能需要對FUNOP有直觀的了解。 我向前翻頁,發(fā)現(xiàn)了第二種技術,FUNOR-FUNOM。 我很快意識到,這兩種技術加上我尚未意識到的三分之一正在等我,幾乎占了論文的一半。
To understand “The Future of Data Analysis,” I would definitely need to learn more about FUNOP and FUNOR-FUNOM. I took that realization in stride, though, because I learned long ago that data science is — and will always be — full of terms and techniques that I don’t yet know. I’d do my research and come back to Tukey’s paper.
要理解“數(shù)據(jù)分析的未來”,我肯定需要了解有關FUNOP和FUNOR-FUNOM的更多信息。 但是,我大步邁進了這一認識,因為我很久以前就了解到,數(shù)據(jù)科學充滿了并且將永遠充滿著我尚不知道的術語和技術。 我會做研究,然后回到Tukey的論文。
But when I searched online for FUNOP, I found almost nothing. More surprising, there was nothing in CRAN. Given the thousands of packages in CRAN and the widespread adoption of Tukey’s techniques, I expected there to be multiple implementations of the techniques from such an important paper. Instead, nothing.
但是,當我在網(wǎng)上搜索FUNOP時,卻什么也沒發(fā)現(xiàn)。 更令人驚訝的是,CRAN中什么也沒有 。 鑒于CRAN中有成千上萬的軟件包以及Tukey技術的廣泛采用,我希望從如此重要的論文中可以有多種技術實現(xiàn)。 相反,什么都沒有。
FUNOP (FUNOP)
Fortunately, Tukey describes in detail how FUNOP and FUNOR-FUNOM work. And, fortunately, he provides examples of how they work. Unfortunately, he provides only written descriptions of these procedures and their effect on example data. So, to understand the procedures, I implemented each of them in R. (See my repository on GitHub.) And to further clarify what they do, I generated a series of charts that make it easier to visualize what’s going on.
幸運的是,Tukey詳細描述了FUNOP和FUNOR-FUNOM的工作方式。 而且,幸運的是,他提供了它們?nèi)绾喂ぷ鞯氖纠?不幸的是,他僅提供了這些過程及其對示例數(shù)據(jù)的影響的書面說明。 因此,為了理解這些過程,我在R中實現(xiàn)了每個過程。(請參閱GitHub上的我的存儲庫 。)為了進一步闡明它們的作用,我生成了一系列圖表,使可視化的過程變得更加容易。
Here’s Tukey’s definition of FUNOP (FUll NOrmal Plot):
這是Tukey對FUNOP(完整標稱圖)的定義:
(b1) Let a???? be a typical value for the ith ordered observation in a sample of n from a unit normal distribution.
(b1)設a????為單位正態(tài)分布的n個樣本中第i次有序觀察的典型值。
(b2) Let y? ≤ y? ≤ … ≤ y? be the ordered values to be examined. Let y? be their median (or let ?, read “y trimmed”, be the mean of the y? with ?n < i ≤ ?(2n).
(B2)設y?≤??≤... ≤??被有序進行檢查值。 令y是其值(或讓?,讀“Y修整”,是平均與??<I≤?(2 n中的y?的)。
(b3) For i ≤ ?n or > ?(2n) only, let z? = (y? - y?)/a???? (or let
(B3),其中i≤?n或>?(2 n)的唯一的,讓z?=(y? - Y)/a????(或讓
(b3) For i ≤ ?n or > ?(2n) only, let z? = (y? - y?)/a???? (or let z? = (y? - ?) /a????).
(B3),其中i≤?n或>?(2 n)的唯一的,讓z?=(y? - Y)/a????(或讓z?=(y? - ?)/a????)。
(b4) Let z? be the median of the z’s thus obtained (about ?(2n) in number).
(b4)令z?為由此獲得的z的中值(數(shù)量約為?(2 n ))。
(b5) Give special attention to z’s for which both |y? - y?| ≥ A · z? and z? ≥ B · z? where A and B are prechosen.
(b5)特別注意z的兩個 | y? - y? | ≥A·Z和z?≥ 乙 ·Z,其中A和B是預先選定的。
(b5*) Particularly for small n, z?’s with j more extreme than an i for which (b5) selects z? also deserve special attention… (p23).
(b5 *)特別是對于較小的n , z?的j值要比i (b5選擇z?的i ) 極端的情況還要特別注意…(p23)。
The basic idea is very similar to a Q-Q plot.
基本思想與QQ圖非常相似。
Tukey gives us an example of 14 data points. On a normal Q-Q plot, if data are normally distributed, they form a straight line. But in the chart below, based upon the example data, we can clearly see that a couple of the points are relatively distant from the straight line. They’re outliers.
Tukey為我們提供了14個數(shù)據(jù)點的示例。 在正常的QQ圖上,如果數(shù)據(jù)呈正態(tài)分布,則它們會形成一條直線。 但是在下面的圖表中,基于示例數(shù)據(jù),我們可以清楚地看到其中一些點與直線相對遠離。 他們是離群值。
The goal of FUNOP is to eliminate the need for visual inspection by automating interpretation.
FUNOP的目標是通過自動解釋來消除視覺檢查的需要。
The first variable in the FUNOP procedure (a????) simply gives us the theoretical distribution, where i is the ordinal value in the range 1..n and Gau?1 is the quantile function of the normal distribution (i.e., the “Q” in Q-Q plot):
FUNOP過程中的第一個變量( a???? )簡單地提供了理論分布,其中i是范圍1內(nèi)的序數(shù)。n和Gau?1是正態(tài)分布的分位數(shù)函數(shù)(即“ QQ圖中的“ Q”):
The key innovation of FUNOP is to calculate the slope of each point, relative to the median.
FUNOP的關鍵創(chuàng)新是計算每個點相對于中位數(shù)的斜率。
If y? is the median of the sample, and we presume that it’s located at the midpoint of the distribution (where a(y) = 0), then the slope of each point can be calculated as:
如果Y是樣品的中值,并且我們假定它是位于所述分布(其中a(Y)= 0)的中點,那么每個點的斜率可以被計算為:
The chart above illustrates how slope of one point (1.2, 454) is calculated, relative to the median (0, 33.5).
上面的圖表說明了如何計算相對于中位數(shù)(0,33.5)的一個點(1.2,454)的斜率。
Any point that has a slope significantly steeper than the rest of the population is necessarily farther from the straight line. To do this, FUNOP simply compares each slope (z?) to the median of all calculated slopes (z?).
斜率明顯比其余人口陡峭的任何點都必須遠離直線。 為此,FUNOP只需將每個斜率( z? )與所有計算斜率的中值進行比較 ( z? )。
Note, however, that FUNOP calculates slopes for the top and bottom thirds of the sorted population only, in part because z? won’t vary much over the inner third of that population, but also because the value of a???? for the inner third will be close to 0 and dividing by ≈0 when calculating z? might lead to instability.
但是請注意,這FUNOP計算山坡的頂部,只有有序人口的三分之二的底部,部分原因是z?不會有太大了,人口的內(nèi)第三次有所不同,但也因為a????對內(nèi)部價值第三個將接近0,并且在計算z?時除以≈0可能會導致不穩(wěn)定。
Significance — what Tukey calls “special attention” — is partially determined by B, one of two predetermined values (or hyperparameters). For his example, Tukey recommends a value between 1.5 and 2.0, which means that FUNOP simply checks whether the slope of any point, relative to the midpoint, is 1.5 or 2.0 times larger than the median.
重要性(Tukey稱為“特別注意”)部分由B (兩個預定值(或超參數(shù))之一)確定。 對于他的示例,Tukey建議使用介于1.5和2.0之間的值,這意味著FUNOP只是檢查相對于中點的任何點的斜率是否比中值大1.5或2.0倍。
The other predetermined value is A, which is roughly equivalent to the number of standard deviations of y? from y? and serves as a second criterion for significance.
另一個預定值是A ,它大致等于y 1與y 1的標準偏差的數(shù)量,并用作重要性的第二標準。
The following chart shows how FUNOP works.
下圖顯示了FUNOP的工作方式。
Our original values are plotted along the x-axis. The points in the green make up the inner third of our sample, and we use them to calculate y?, the median of just those points, indicated by the green vertical line.
我們的原始值沿x軸繪制。 綠色的點構(gòu)成了樣本的內(nèi)部三分之一,我們用它們來計算y? ,即這些點的中值,由綠色的垂直線表示。
The points not in green make up the outer thirds (i.e., the top and bottom thirds) of our sample, and we use them to calculate z?, the median slope of just those points, indicated by the black horizontal line.
非綠色的點構(gòu)成樣本的外部三分之二(即頂部和底部的三分之二),我們使用它們來計算z? ,即這些點的中間斜率,用黑色水平線表示。
Our first selection criterion is z? ≥ B · z?. In his example, Tukey sets B = 1.5, so our threshold of interest is 1.5z?, indicated by the blue horizontal line. We’ll consider any point above that line (the shaded blue region) as deserving “special attention”. We have only one such point, colored red.
我們的第一選擇標準是z?≥·B·Z ^。 在他的示例中,Tukey設置B = 1.5,因此我們的關注閾值為1.5z? ,由藍色水平線表示。 我們將認為該線上方的任何點(藍色陰影區(qū)域)都值得“特別注意”。 我們只有一個這樣的點,紅色。
Our second criterion is |y? - y?| ≥ A · z?. In his example, Tukey sets A = 0, so our threshold of interest is |y? - y?| ≥ 0 or (more simply) y? ≠ y?. Basically, any point not on the green line. Our one point in the shaded blue region isn’t on the green line, so we still have our one point.
我們的第二個標準是| y?- y? | ≥ 甲 ·Z。 在他的示例中,Tukey設置A = 0,因此我們的興趣閾值為|。 y?- y? | ≥0或(或更簡單地) y? ≠ y? 。 基本上,任何一點都不在綠線上。 我們在陰影藍色區(qū)域中的一點不在綠線上,因此我們?nèi)匀挥幸稽c。
Our final criterion is any z?’s with j more extreme than any i selected so far. Basically, that’s any value more extreme than the ones already identified. In this case, we have one value that’s larger (further to the right on the x-axis) than our red dot. That point is colored orange, and we add it to our list.
我們的最終標準是任何j?的z都比我迄今為止選擇的任何z都極端。 基本上,這是比已確定的值更極端的任何值。 在這種情況下,我們有一個比紅點更大的值(在x軸上更靠右)。 該點顯示為橙色,我們將其添加到列表中。
The two points identified by FUNOP are the same ones that we identified visually in Chart 1.
FUNOP標識的兩點與我們在圖1中直觀標識的相同。
技術 (Technology)
FUNOP represents a turning point in the paper.
FUNOP代表了本文的一個轉(zhuǎn)折點。
In the first section, Tukey explores a variety of topics from a much more philosophical perspective: The role of judgment in data analysis, problems in teaching analysis, the importance of practicing the discipline, facing uncertainty, and more.
在第一部分中,Tukey從更哲學的角度探討了多個主題:判斷在數(shù)據(jù)分析中的作用,教學分析中的問題,實踐該學科的重要性,面臨不確定性等。
In the second section, Turkey turns his attention to “spotty data” and its challenges. The subsections get increasingly more technical, and the first of many equations appears. Just before he introduces FUNOP, Tukey explores “automated examination”, where he discusses the role of technology.
在第二部分中,土耳其將注意力轉(zhuǎn)移到“斑點數(shù)據(jù)”及其挑戰(zhàn)上。 這些小節(jié)的技術水平越來越高,出現(xiàn)了許多方程式中的第一個。 在介紹FUNOP之前,Tukey探索了“自動檢查”,并在其中討論了技術的作用。
Even though Tukey wrote his paper nearly 60 years ago, he anticipates the dual role that technology continues to play to this day: It will democratize analysis, making it more accessible for casual users, but it will also enable the field’s advances:
盡管Tukey在將近60年前寫了論文,但他預計技術將在今天繼續(xù)發(fā)揮雙重作用:它將使分析民主化,使休閑用戶更易于使用,但也將推動該領域的發(fā)展:
“(1) Most data analysis is going to be done by people who are not sophisticated data analysts…. Properly automated tools are the easiest to use for [someone] with a computer.
“(1)大多數(shù)數(shù)據(jù)分析將由不是經(jīng)驗豐富的數(shù)據(jù)分析師的人來完成。 適當?shù)淖詣踊ぞ咦钸m合[某人]與計算機一起使用。
“(2) …[Sophisticated data analysts] must have both the time and the stimulation to try out new procedures of analysis; hence the known procedures must be made easy for them to apply as possible. Again automation is called for.
“(2)……[復雜的數(shù)據(jù)分析人員]必須有時間和動力去嘗試新的分析程序; 因此,必須使已知程序易于使用。 再次需要自動化。
“(3) If we are to study and intercompare procedures, it will be much easier if the procedures have been fully specified, as must happen [in] the process of being made routine and automatizable” (p 22).
“(3)如果我們要研究和相互比較程序,那么,如果程序被完全指定,就會變得容易得多,因為這必定會在[使其成為例行程序和自動化過程中發(fā)生](p 22)。
The juxtaposition of “automated examination” and “FUNOP” made me wonder about Tukey’s reason for including the technique in his paper. Did he develop FUNOP simply to prove his point about technology? It effectively identifies outliers, but it’s complicated enough to benefit from automation.
“自動檢查”和“ FUNOP”的并置使我想知道Tukey將這項技術包括在他的論文中的原因。 他開發(fā)FUNOP只是為了證明他對技術的觀點嗎? 它可以有效地識別異常值,但是它很復雜,可以從自動化中受益。
Feel free to skip ahead if you’re not interested in the code:
如果您對代碼不感興趣,請隨時跳過:
# a helper function for FUNOP and FUNOR_FUNOM, which use the output of # a_qnorm as the denominator for their slope calculations. a_qnorm <- function(i, n) {qnorm((3 * i - 1) / (3 * n + 1)) }funop <- function(x, A = 0, B = 1.5) {# (b1)# Let a_{i|n} be a typical value for the ith ordered observation in# a sample of n from a unit normal distribution.n <- length(x)# initialze dataframe to hold resultsresult <- data.frame(y = x,orig_order = 1:n,a = NA,z = NA,middle = FALSE,special = FALSE)# put array in orderresult <- result %>%dplyr::arrange(x) %>%dplyr::mutate(i = dplyr::row_number())# calculate a_{i|n}result$a <- a_qnorm(result$i, n)# (b2)# Let y_1 ≤ y_2 ≤ … ≤ y_n be the ordered values to be examined.# Let y_split be their median (or let y_trimmed be the mean of the y_i# with (1/3)n < i ≤ (2/3)n).middle_third <- (floor(n / 3) + 1):ceiling(2 * n / 3)outer_thirds <- (1:n)[-middle_third]result$middle[middle_third] <- TRUEy_split <- median(result$y)y_trimmed <- mean(result$y[middle_third])# (b3)# For i ≤ (1/3)n or > (2/3)n only,# let z_i = (y_i – y_split) / a_{i|n}# (or let z_i = (y_i – y_trimmed) / a_{i|n}).result$z[outer_thirds] <-(result$y[outer_thirds] - y_split) / result$a[outer_thirds]# (b4)# Let z_split be the median of the z’s thus obtained.z_split <- median(result$z[outer_thirds])# (b5)# Give special attention to z’s for which both# |y_i – y_split| ≥ A · z_split and z_i ≥ B · z_split# where A and B are prechosen.result$special <-ifelse(result$z >= (B * z_split) &abs(result$y - y_split) >= (A * z_split), TRUE, FALSE)# (b5*)# Particularly for small n, z_j’s with j more extreme than an i# for which (b5) selects z_i also deserve special attention.# in the top third, look for values larger than ones already foundtop_third <- outer_thirds[outer_thirds > max(middle_third)]# take advantage of the fact that we've already indexed our result set# and simply look for values of i larger than the smallest i in the# top third (further to the right of our x-axis)if (any(result$special[top_third])) {min_i <- result %>%dplyr::filter(special == TRUE) %>%{min(.$i)}result$special[which(result$i > min_i)] <- TRUE}# in the top third, look for values smaller than ones already foundbottom_third <- outer_thirds[outer_thirds < min(middle_third)]# look for values of i smaller than the largest i in the bottom third# (further to the left of our x-axis)if (any(result$special[bottom_third])) {max_i <- result %>%dplyr::filter(special == TRUE) %>%{.$max_i}result$special[which(result$i < max_i)] <- TRUE}result <- result %>%dplyr::arrange(orig_order) %>%dplyr::select(y, i, middle, a, z, special)attr(result, 'y_split') <- y_splitattr(result, 'y_trimmed') <- y_trimmedattr(result, 'z_split') <- z_splitresult }火葬場 (FUNOR-FUNOM)
One common reason for identifying outliers is to do something about them, often by trimming or Winsorizing the dataset. The former simply removes an equal number of values from upper and lower ends of a sorted dataset. Winsorizing is similar but doesn’t remove values. Instead, it replaces them with the closest original value not affected by the process.
識別離群值的一個常見原因是通常通過修剪或Winsorize數(shù)據(jù)集來對它們進行處理。 前者只是從排序后的數(shù)據(jù)集的上端和下端刪除相等數(shù)量的值。 Winsorizing與之類似,但不會刪除值。 相反,它替換它們不受進程最接近原始值。
Tukey’s FUNOR-FUNOM (FUll NOrmal Rejection-FUll NOrmal Modification) offers an alternate approach. The procedure’s name reflects its purpose: FUNOR-FUNOM uses FUNOP to identify outliers, and then uses separate rejection and modification procedures to treat them.
Tukey的FUNOR-FUNOM(完全范式拒絕-完全范式修改)提供了另一種方法。 該過程的名稱反映了其目的:FUNOR-FUNOM使用FUNOP識別異常值,然后使用單獨的拒絕和修改過程對其進行處理。
The technique offers a number of innovations. First, unlike trimming and Winsorizing, which affect all the values at the top and bottom ends of a sorted dataset, FUNOR-FUNOM uses FUNOP to identify individual outliers to treat. Second, FUNOR-FUNOM leverages statistical properties of the dataset to determine individual modifications for those outliers.
該技術提供了許多創(chuàng)新。 首先,與修剪和Winsorizing不同,修剪和Winsorizing會影響排序數(shù)據(jù)集的頂端和底端的所有值,而FUNOR-FUNOM使用FUNOP來識別要處理的單個異常值。 其次,FUNOR-FUNOM利用數(shù)據(jù)集的統(tǒng)計屬性來確定這些異常值的單獨修改。
FUNOR-FUNOM is specifically designed to operate on two-way (or contingency) tables. Similar to other techniques that operate on contingency tables, it uses the table’s grand mean (x..) and the row and column means (x?. and x.?, respectively) to calculate expected values for entries in the table.
FUNOR-FUNOM是專門為在雙向(或偶發(fā)性)表上運行而設計的。 與對列聯(lián)表進行操作的其他技術類似,它使用表的均值( x .. )以及行和列均值(分別為x?。和x.? )計算表中條目的期望值。
The equation below shows how these effects are combined. Because it’s unlikely for expected values to match the table’s actual values exactly, the equation includes a residual term (y??) to account for any deviation.
以下等式顯示了這些效應如何組合。 由于期望值不太可能與表的實際值完全匹配,因此該方程式包含一個殘差項( y?? )以解決任何偏差。
FUNOR-FUNOM is primarily interested in the values that deviate most from their expected values, the ones with the largest residuals. So, to calculate residuals, simply swap the above equation around:
FUNOR-FUNOM主要對與期望值有最大差異的值感興趣,這些期望值具有最大的殘差。 因此,要計算殘差,只需將上面的等式交換為:
FUNOR-FUNOM starts by repeatedly applying FUNOP, looking for outlier residuals. When it finds them, it modifies the outlier with the greatest deviation by applying the following modification:
FUNOR-FUNOM首先重復應用FUNOP,以尋找異常值殘差。 找到它們后,通過應用以下修改,以最大的偏差修改離群值:
where
哪里
Recalling the definition of slope (from FUNOP)
回顧坡度的定義(來自FUNOP)
the first portion of the Δx?? equation reduces to just y?? - y?, the difference of the residual from the median. The second portion of the equation is a factor, based solely upon table size, meant to compensate for the effect of an outlier on the table’s grand, row, and column means.
Δx??方程的第一部分減少到y(tǒng)??- y? ,即殘差與中值之差。 等式的第二部分是一個僅基于表大小的因數(shù),旨在補償異常值對表的盛大,行和列均值的影響。
When Δx?? is applied to the original value, the y?? terms cancel out, effectively setting the outlier to its expected value (based upon the combined effects of the contingency table) plus a factor of the median residual (~ x?. + x.? + x.. + y?).
當將Δx to應用于原始值時, y??項會被抵消,從而有效地將離群值設置為其預期值(基于列聯(lián)表的綜合影響)加上中位數(shù)殘差因子( ?x?。 + x.?)。 + x .. + y? )。
FUNOR-FUNOM repeats this same process until it no longer finds values that “deserve special attention.”
FUNOR-FUNOM重復相同的過程,直到不再找到“值得特別注意”的值為止。
In the final phase, the FUNOM phase, the procedure uses a lower threshold of interest — FUNOP with a lower A — to identify a final set of outliers for treatment. The adjustment becomes
在最后階段,即FUNOM階段,該過程使用較低的關注閾值(FUNOP的A較低)來識別最終的離群值進行治療。 調(diào)整變?yōu)?
There are a couple of changes here. First, the inclusion of (–B? · z?) effectively sets the residual of outliers to FUNOP’s threshold of interest, much like the way that Winsorizing sets affected values to the same cut-off threshold. FUNOM, though, sets only the residual of affected values to that threshold: The greater part of the value is determined by the combined effects of the grand, row, and column means.
這里有幾個更改。 首先,列入( - B?·Z)有效地將剩余離群感興趣FUNOP的門檻,很像設置極值調(diào)整受影響的值相同的截止閾值的方式。 但是,FUNOM 僅將受影響的值的殘差設置為該閾值:值的較大部分由grand,row和column平均值的組合效果確定。
Second, because we’ve already taken care of the largest outliers (whose adjustment would have a more significant effect on the table’s means), we no longer need the compensating factor.
其次,因為我們已經(jīng)處理了最大的離群值(其調(diào)整將對表格的均值產(chǎn)生更大的影響),所以我們不再需要補償因子。
The chart below shows the result of applying FUNOR-FUNOM to the data in Table 2 of Tukey’s paper.
下圖顯示了將FUNOR-FUNOM應用于Tukey論文表2中的數(shù)據(jù)的結(jié)果。
The black dots represent the original values affected by the procedure. The color of their treated values is based upon whether they were determined by the FUNOR or FUNOM portion of the procedure. The grey dots represent values unaffected by the procedure.
黑點表示受該過程影響的原始值。 其處理值的顏色取決于它們是由程序的FUNOR還是FUNOM部分確定的。 灰色點表示不受影響的值 通過程序。
FUNOR handles the largest adjustments, which Tukey accomplishes by setting A? = 10 and B? = 1.5 for that portion of the process, and FUNOM handles the finer adjustments by setting A? = 0 and B? = 1.5.
FUNOR處理最大的調(diào)整,Tukey通過在該過程的那部分設置A? = 10和B? = 1.5來完成,而FUNOM通過設置A? = 0和B? = 1.5處理更細的調(diào)整。
Again, because the procedure leverages the statistical properties of the data, each of the resulting adjustments is unique.
同樣,由于該過程利用了數(shù)據(jù)的統(tǒng)計屬性,因此每個調(diào)整結(jié)果都是唯一的。
Here is the code:
這是代碼:
funor_funom <- function(x, A_r = 10, B_r = 1.5, A_m = 0, B_m = 1.5) {x <- as.matrix(x)# Initializer <- nrow(x)c <- ncol(x)n <- r * c# this will be used in step a3, but only need to calc oncechange_factor <- r * c / ((r - 1) * (c - 1))# this data frame makes it easy to track all values# j and k are the rows and columns of the tabledat <- data.frame(x = as.vector(x),j = ifelse(1:n %% r == 0, r, 1:n %% r),k = ceiling(1:n / r),change_type = 0)############# FUNOR #############repeat {dat <- dat %>%dplyr::select(x, j, k, change_type) # clean up dat from last loop# (a1)# Fit row and column means to the original observations# and form the residuals# calculate the row meansdat <- dat %>%dplyr::group_by(j) %>%dplyr::summarise(j_mean = mean(x)) %>%dplyr::ungroup() %>%dplyr::select(j, j_mean) %>%dplyr::inner_join(dat, by = 'j')# calculate the column meansdat <- dat %>%dplyr::group_by(k) %>%dplyr::summarise(k_mean = mean(x)) %>%dplyr::ungroup() %>%dplyr::select(k, k_mean) %>%dplyr::inner_join(dat, by = 'k')grand_mean <- mean(dat$x)# calculate the residualsdat$y <- dat$x - dat$j_mean - dat$k_mean + grand_mean# put dat in order based upon y (which will match i in FUNOP)dat <- dat %>%dplyr::arrange(y) %>%dplyr::mutate(i = dplyr::row_number())# (a2)# apply FUNOP to the residualsfunop_residuals <- funop(dat$y, A_r, B_r)# (a4)# repeat until no y_{jk} deserves special attentionif (!any(funop_residuals$special)) {break}# (a3) modify x_{jk} for largest y_{jk} that deserves special attentionbig_y <- funop_residuals %>%dplyr::filter(special == TRUE) %>%dplyr::top_n(1, (abs(y)))# change x by an amount that's proportional to its# position in the distribution # here's why it's useful to have z be on same scale as the raw valuedelta_x <- big_y$z * big_y$a * change_factordat$x[which(dat$i == big_y$i)] <- big_y$y - delta_xdat$change_type[which(dat$i == big_y$i)] <- 1}# Done with FUNOR. To apply subsequent modifications we need # the following from the most recent FUNOPdat <- funop_residuals %>%dplyr::select(i, middle, a, z) %>%dplyr::inner_join(dat, by = 'i')z_split <- attr(funop_residuals, 'z_split')y_split <- attr(funop_residuals, 'y_split')############# FUNOM ############## (a5)# identify new interesting values based upon new A & B# start with threshold for extreme Bextreme_B <- B_m * z_splitdat <- dat %>%dplyr::mutate(interesting_values = ((middle == FALSE) &(z >= extreme_B)))# logical AND with threshold for extreme Aextreme_A <- A_m * z_splitdat$interesting_values <-ifelse(dat$interesting_values &(abs(dat$y - y_split) >= extreme_A), TRUE, FALSE)# (a6)# adjust just the interesting valuesdelta_x <- dat %>%dplyr::filter(interesting_values == TRUE) %>%dplyr::mutate(change_type = 2) %>%dplyr::mutate(delta_x = (z - extreme_B) * a) %>%dplyr::mutate(new_x = x - delta_x) %>%dplyr::select(-x, -delta_x) %>%dplyr::rename(x = new_x)# select undistinguied values from dat and recombine with# adjusted versions of the interesting valuesdat <- dat %>%dplyr::filter(interesting_values == FALSE) %>%dplyr::bind_rows(delta_x)# return data to original shapedat <- dat %>%dplyr::select(j, k, x, change_type) %>%dplyr::arrange(j, k)# reshape result into a table of the original shapematrix(dat$x, nrow = r, byrow = TRUE) }“傻瓜不要使用它!” (“Foolish not to use it!”)
After describing FUNOR-FUNOM, Tukey asserts that it serves a real need — one not previously addressed — and he invites people to begin using it, to explore its properties, even to develop competitors. In the meantime, he says, people would “…be foolish not to use it” (p 32).
在描述FUNOR-FUNOM之后,Tukey斷言它滿足了一項真正的需求-以前沒有解決過-并邀請人們開始使用它,探索其特性,甚至發(fā)展競爭對手。 他說,與此同時,人們“…… 不使用它會很愚蠢 ”(第32頁)。
Throughout his paper, Tukey uses italics to emphasize important points. Here he’s echoing an earlier point about arguments against the adoption of new techniques. He’d had colleagues suggest that no new technique should be published — much less used — before its power function was given. Tukey recognized the irony, because much of applied statistics depended upon Student’s t. In his paper, he points out,
在整個論文中,Tukey使用斜體字強調(diào)重點。 在這里,他回應了關于反對采用新技術的爭論的較早觀點。 他曾讓同事們建議,在給出冪函數(shù)之前,不應發(fā)布任何新技術,而應使用更少的新技術。 Tukey意識到了諷刺意味,因為許多應用統(tǒng)計數(shù)據(jù)都取決于Student的t 。 他在論文中指出,
“Surely the suggested amount of knowledge is not enough for anyone to guarantee either
“當然,建議的知識量不足以使任何人都能保證
“(c1) that the chance of error, when the procedure is applied to real data, corresponds precisely to the nominal levels of significance or confidence, or
“(c1)當將該程序應用于真實數(shù)據(jù)時,錯誤機會恰好與名義上的顯著性或置信度水平相對應,或者
“(c2) that the procedure, when applied to real data, will be optimal in any one specific sense.
“(c2)該程序應用于真實數(shù)據(jù)時,在任何一種特定意義上都是最佳的。
“BUT WE HAVE NEVER BEEN ABLE TO MAKE EITHER OF THESE STATEMENTS ABOUT Student’s t” (p 20).
“但是我們永遠無法做出關于學生t的所有這些陳述” (第20頁)。
This is Tukey’s only sentence in all caps. Clearly, he wanted to land the point.
這是圖基在所有大寫字母中唯一的一句話。 顯然,他想提出這一點。
And, clearly, FUNOR-FUNOM was not meant as an example of theoretically possible techniques. Tukey intended for it to be used.
而且,很明顯,FUNOR-FUNOM并不是作為理論上可行的技術的例子。 Tukey打算用于它。
吸塵器 (Vacuum cleaner)
FUNOR-FUNOM treats the outliers of a contingency table by identifying and minimizing outsized residuals, based upon the grand, row, and column means.
FUNOR-FUNOM通過基于大數(shù),行數(shù)和列數(shù)均值來識別和最小化超大殘差,從而處理列聯(lián)表的異常值。
Tukey takes these concepts further with his vacuum cleaner, whose output is a set of residuals, which can be used to better understand sources of variance in the data and enable more informed analysis.
Tukey的真空吸塵器進一步完善了這些概念,該真空吸塵器的輸出是一組殘差,這些殘差可以用來更好地理解數(shù)據(jù)中的差異來源并進行更明智的分析。
To isolate residuals, Tukey’s vacuum cleaner uses regression to break down the values from the contingency table into their constituent components (p 51):
為了隔離殘差,Tukey的真空吸塵器使用回歸將列聯(lián)表中的值分解為它們的組成部分(第51頁):
The idea is very similar to the one based upon the grand, row, and column means. In fact, the first stage of the vacuum cleaner produces the same result as subtracting the combined effect of the means from the original values.
這個想法與基于大寫,行和列方式的想法非常相似。 實際上,真空吸塵器的第一階段產(chǎn)生的結(jié)果與從原始值中減去均值的組合效果相同。
To do this, the vacuum cleaner needs to calculate regression coefficients for each row and column based upon the values in our table (y??) and a carrier — or regressor — for both rows (a?) and columns (b?). [Apologies for using “k” for columns, but Medium has its limitations.]
為此,真空吸塵器需要基于表( y?? )中的值以及行( a? )和列( b? )的載體(或回歸器)來計算每一行和每一列的回歸系數(shù)。 [為列使用“ k”表示歉意,但“中等”有其局限性。]
Below is the equation used to calculate regression coefficients for columns.
以下是用于計算列的回歸系數(shù)的方程式。
Conveniently, the equation will give us the mean of a column when we set a? ≡ 1:
方便地,當我們設置a?≡1時,該方程式將給我們列的平均值:
where n? is the number of rows. Effectively, the equation iterates through every row (Σ?), summing up the individual values in the same column (c) and dividing by the number of rows, the same as calculating the mean (y.?).
其中n?是行數(shù)。 有效地,該方程式遍歷每一行( Σ? ),將同一列( c )中的各個值相加,然后除以行數(shù),這與計算平均值( y.? )相同。
Note, however, that a? is a vector. So to set a? ≡ 1, we need our vector to satisfy this equation:
但是請注意, a?是向量。 因此,要設置a?≡1 ,我們需要向量來滿足以下等式:
For a vector of length n? we can simply assign every member the same value:
對于長度為n?的向量,我們可以簡單地為每個成員分配相同的值:
Our initial regressors end up being two sets of vectors, one for rows and one for columns, containing either √(1/n?) for rows or √(1/n?) for columns.
我們的初始回歸器最終是兩組向量,一組用于行,一組用于列,其中包含√(1 / n1 /)用于行或√(1 / n? )用于列。
Finally, in the same way that the mean of all row means or the mean of all column means can be used to calculate the grand mean, either the row coefficients or column coefficients can be used to calculate a dual-regression (or “grand”) coefficient:
最后,以所有行均值或所有列均值的均值可用于計算總體均值的方式,行系數(shù)或列系數(shù)均可以用于計算雙回歸(或“均值” )系數(shù):
The reason for calculating all of these coefficients, rather than simply subtracting the grand, row, and column means from our table’s original values, is that Tukey’s vacuum cleaner reuses the coefficients from this stage in the procedure as regressors in the next. (To ensure a? ≡ 1 and a? ≡ 1 for the next stage, we normalize both sets of new regressors.)
之所以要計算所有這些系數(shù),而不是簡單地從表的原始值中減去大,中,平均值,是因為Tukey的真空吸塵器在此過程中將這一階段的系數(shù)作為下一個回歸變量重復使用 。 (確保a≡1 并在下一階段使用a?≡1 ,我們將兩組新的回歸器標準化。)
The second phase is the real innovation here. It’s take an earlier idea of Tukey’s, one degree of freedom for non-additivity, and applies it separately to each row and column. This, Tukey tells us, “…extracts row-by-row regression upon ‘column mean minus grand mean’ and column-by-column regression on ‘row mean minus grand mean’” (p 53).
第二階段是這里的真正創(chuàng)新。 它采用了Tukey的早期思想,即非可加性的一種自由度,并將其分別應用于每一行和每一列。 Tukey告訴我們,“……提取'列均值減去總均值”的逐行回歸,并提取'行均值減去總均值'的逐列回歸”(第53頁)。
The result is a set of residuals, vacuum cleaned of systemic effects.
結(jié)果是一組殘留物,真空清除了系統(tǒng)影響。
Here’s the code for the entire procedure:
這是整個過程的代碼:
vacuum <- function(x) {input_table <- xr <- nrow(input_table)c <- ncol(input_table)# number of rows and columns must be at least 3 (p 53)if (r < 3 | c < 3)print('Insuffienct size')######################## Initial carriers ######################### sqrt(1/n)carrier_r <- rep(sqrt(1 / r), r)carrier_c <- rep(sqrt(1 / c), c)##################### Start of loop ###################### Tukey passes through the loop twice.# Suggests further passes possible in the "attachments" section (p 53)for (pass in 1:2) {#################### Coefficients ##################### Calculate the column coefficientscoef_c <- rep(NA, c)for (i in 1:c) {# loop through every column, summing every row# denominator is based upon the number of rows in the columncoef_c[i] <-sum(carrier_r * input_table[, i]) / sum(carrier_r ^ 2)}# Calculate the row coefficientscoef_r <- rep(NA, r)for (i in 1:r) {# loop through every row, summing every column# denominator is based upon the number of columns in the rowcoef_r[i] <-sum(carrier_c * input_table[i,]) / sum(carrier_c ^ 2)}############ y_ab ############# either one of thesey_ab <- sum(coef_c * carrier_c) / sum(carrier_c ^ 2)y_ab <- sum(carrier_r * coef_r) / sum(carrier_r ^ 2)########################## Apply subprocedure ########################### create a destination for the outputoutput_table <- input_tablefor (i in 1:r) {for (j in 1:c) {output_table[i, j] <- input_table[i, j] -carrier_r[i] * (coef_c[j] - carrier_c[j] * y_ab) -carrier_c[j] * (coef_r[i] - carrier_r[i] * y_ab) -y_ab * carrier_c[j] * carrier_r[i]}}# These are the coefficients that will get carried forwardcoef_c <- coef_c - carrier_c * y_abcoef_r <- coef_r - carrier_r * y_ab########################## Prep for next pass ########################### normalize coefficients because we want sqrt(sum(a^2)) == 1carrier_r <- coef_r / l2_norm(coef_r)carrier_c <- coef_c / l2_norm(coef_c)input_table <- output_table################### End of loop ###################}output_table}外賣 (Takeaways)
When I started this exercise, I honestly expected it to be something of an archaeological endeavor: I thought that I’d be digging through an artifact from 1962. Instead, I discovered some surprisingly innovative techniques.
當我開始這個練習時,老實說,我希望這是一項考古工作:我認為我會從1962年的一件文物中挖掘出來。相反,我發(fā)現(xiàn)了一些令人驚訝的創(chuàng)新技術。
However, none of the three procedures has survived in its original form. Not even Tukey mentions them in Exploratory Data Analysis, which he published 15 years later. That said, the book’s chapters on two-way tables contain the obvious successors to both FUNOR_FUNOM and the vacuum cleaner.
但是,這三個過程都沒有以其原始形式保留下來。 15年后,他在《 探索性數(shù)據(jù)分析》中甚至沒有提到圖基。 就是說,該書在雙向表上的章節(jié)包含F(xiàn)UNOR_FUNOM和真空吸塵器的明顯后繼。
Perhaps one reason they’ve faded from use is that the base procedure, FUNOP, requires the definition of two parameters, A and B. Tukey himself recognized “the choice of B? is going to be a matter of judgement” (p 47). When I tried applying FUNOR_FUNOM on other data sets, it was clear that use of the technique requires tuning.
他們退出使用的原因之一可能是基本過程FUNOP需要定義兩個參數(shù)A和B。 圖基親自認識到“選擇B?將是一個判斷問題”(第47頁)。 當我嘗試在其他數(shù)據(jù)集上應用FUNOR_FUNOM時,很明顯,使用該技術需要調(diào)整。
Another possibility is that these procedures have a blind spot, which the paper itself demonstrates. One of Tukey’s goals was to avoid “…errors as serious as ascribing the wrong sign to the resulting correlation or regression coefficients” (p 58). So it’s perhaps ironic that one of the values in Tukey’s example table of coefficients (Table 8, p 54) has an inverted sign.
另一種可能是這些程序有一個盲點,本文本身就證明了這一點。 Tukey的目標之一是避免出現(xiàn)“……錯誤,將錯誤的符號歸因于相關系數(shù)或回歸系數(shù)”(第58頁)。 因此,具有諷刺意味的是,Tukey的示例系數(shù)表(表8,第54頁)中的值之一具有反號。
I tested each of Tukey’s procedures, and none of them would have caught the typo: Both the error (-0.100) and the corrected value (0.100) are too close to the relevant medians and means to be noticeable. I found it only because the printed row and column means did not add up to ones that I calculated.
我測試了Tukey的每個過程,但都沒有錯字:誤差(-0.100)和校正值(0.100)都太接近相關的中位數(shù)和平均值,因此不明顯。 我發(fā)現(xiàn)它僅是因為打印的行和列均值不等于我計算出的值。
The flaw isn’t fatal. And, ultimately, the utility of these procedures is beside the point. My real goal with this article is simply to encourage people to read Tukey’s paper and to make that task a little easier by providing the intuitive explanations that I myself had wanted.
該缺陷不是致命的。 最終,這些程序的實用性已不重要。 我寫這篇文章的真正目的僅僅是鼓勵人們閱讀Tukey的論文,并通過提供我自己想要的直觀解釋使這項任務變得更容易。
To be clear, no one should mistake my explanations nor my implementations of Tukey’s techniques as a substitute for reading his paper. “The Future of Data Analysis” contains much more than I’ve covered here, and many of Tukey’s ideas remain just as fresh — and just as relevant — today, including his famous maxim: “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question” (pp 14–15).
明確地說,沒有人會誤解我的解釋或我對Tukey的技術的實現(xiàn),而不是代替他閱讀論文。 “數(shù)據(jù)分析的未來”包含的內(nèi)容遠遠超出了我在這里所討論的范圍,并且Tukey的許多思想在今天仍然一樣新鮮和相關,包括他著名的格言:“更好地為正確的問題提供近似答案,這通常比對錯誤問題的確切答案還模糊”(第14-15頁)。
翻譯自: https://towardsdatascience.com/back-to-the-future-of-data-analysis-a-tidy-implementation-of-tukeys-vacuum-87c561cdee18
tukey檢測
總結(jié)
以上是生活随笔為你收集整理的tukey检测_回到数据分析的未来:Tukey真空度的整洁实现的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 做梦梦到黑狗是什么意思
- 下一篇: 梦到朋友出车祸代表什么