数据库主从不同步_数据从不说什么
數據庫主從不同步
I’m bothered by statements like “Clearly, the data says…” or “It is obvious from the data that…”. Most of the time, it is neither clear nor obvious. We need the reasoning. The data cannot even talk — how can it possibly speak for itself?
我對諸如“數據顯然是……”或“從數據中顯而易見……”這樣的陳述感到不安。 在大多數情況下,它既不清晰也不明顯。 我們需要推理。 數據甚至無法說話-它怎么可能自己說話?
Wait, no, let me try that. It sounds fun. Ahem. The conclusion here is clear:
等等,不,讓我嘗試一下。 聽起來很有趣。 啊 這里的結論很清楚:
Tyler VigenTyler VigenI’ve had the displeasure of coming across supremacists who cite some statistics as “proof” that certain races or sexes are inferior than others. Surely, if the “data speaks for itself”, we should be agreeing with them? And yet, many of us don’t agree. Even those without a stats background can feel something off — the statistics do not capture the bigger picture, and there are complexities (strong unmeasured confounders or mediators) that prevent us from taking the numbers at face value. Wouldn’t it be nice if we can formalize this level of healthy skepticism?
我很不高興遇到至高無上的主義者,他們引用一些統計數據來“證明”某些種族或性別不如其他種族或性別低。 當然,如果“數據說明一切”,我們應該同意它們嗎? 但是,我們許多人不同意。 即使是沒有統計背景的人也可能會感到有些不舒服-統計數據無法捕捉到更大的景象,并且由于存在復雜性(強大的無法衡量的混雜因素或調解人),我們無法以實際價值來衡量數字。 如果我們可以將這種健康懷疑論的形式正式化,那不是很好嗎?
For the longest time, I had trouble articulating this idea until I came across Cassie’s article. She states:
在最長的時間內,直到我碰到卡西的文章時,我都很難闡明這個想法。 她說:
INFERENCE = DATA + ASSUMPTIONS
推論=數據+假設
It is our assumptions that give voice to the data. I want to decompose this further, because there is a subtle but important distinction:
我們的假設使數據具有說服力。 我想進一步分解,因為有一個微妙但重要的區別:
Inference = Data + Statistical Assumptions + Causal Assumptions
推論=數據+統計假設+因果假設
Why the split? Even with single parameter estimates like “the average height of the population is X” or “there are Y trees in the forest”, we need to assume some data generation mechanism like simple random sampling — a causal statement.
為什么要分裂? 即使使用諸如“人口的平均高度為X”或“森林中有Y棵樹木”之類的單參數估計,我們也需要假設一些數據生成機制,例如簡單的隨機抽樣-因果關系陳述。
The data scientists / data analysts might be experts in the statistical assumptions, but they might not know much about the causal assumptions. For instance, if you are a statistician who works with medical data, you’d be crazy to think that you know more about medicine than the MD or MPH. Some are experts in both the statistics and the scientific domain, but they are rare.
數據科學家/數據分析師可能是統計假設方面的專家,但他們可能對因果假設了解不多。 例如,如果您是一位處理醫學數據的統計學家,那么您會以為自己比MD或MPH更了解醫學,這會讓您發瘋。 有些是統計學和科學領域的專家,但很少見。
The quality of assumptions greatly influences the quality of insights. Yet, people rarely talk about assumptions. Teams would do well to hold meetings to discuss them.
假設的質量極大地影響了見解的質量。 但是,人們很少談論假設。 團隊最好召開會議來討論它們。
Working in inference means setting aside your ego. You have to show humility and defer to others’ expertise so you can incorporate their knowledge into the model. It is a team effort.
進行推理意味著拋開自我。 您必須表現出謙卑,并尊重他人的專業知識,以便將他們的知識整合到模型中。 這是團隊的努力。
辛普森悖論 (Simpson’s Paradox)
The UCB Admissions dataset is one of the simplest and best examples to illustrate the importance of causal assumptions. You might know it from Simpson’s Paradox. The data (column names not mine):
UCB入學數據集是說明因果假設重要性的最簡單,最好的例子之一。 您可能會從Simpson的悖論中了解到這一點。 數據(列名不是我的):
The question: is there sexism in UCB’s admission process?
問題: UCB的錄取過程中是否存在性別歧視?
Anna crunches some numbers. She adds up the counts, like so:
安娜處理一些數字。 她將計數加起來,如下所示:
And runs logistic regression:
并運行邏輯回歸:
Anna concludes “The coefficient for female is negative with a small p-value. Clearly, the data says that the admissions process is sexist.”
安娜得出結論:“女性系數為負,p值較小。 顯然,數據表明錄取過程是性別歧視。 ”
Barbara also crunched the numbers, but she kept the original data as is. She runs logistic regression, controlling for Dept:
芭芭拉(Barbara)還計算了數字,但她保留了原始數據。 她進行邏輯回歸,控制部門:
Barbara concludes “The coefficient for female is positive with a large p-value. Clearly, the data says that the admissions process is not sexist.”
芭芭拉得出結論:“女性系數為正,且p值較大。 顯然,數據表明錄取過程并非性別歧視。 ”
等一下 (Wait, what?)
They both use the same data. They both have reasonable statistical assumptions. Yet they come to wildly different conclusions because of different methodology. Who’s right?
它們都使用相同的數據。 他們都有合理的統計假設。 但是由于方法不同,他們得出了截然不同的結論。 誰是對的?
I reiterate what I wish were taught in stats courses:
我重申希望在統計課程中教授的內容:
Data + statistical assumptions are not enough to draw inference. We need causal assumptions.
數據+統計假設不足以進行推斷。 我們需要因果假設。
So how should we approach this? To simplify, assume that we have measured all the relevant variables. We lay out some causal assumptions:
那么我們應該如何處理呢? 為了簡化,假設我們已經測量了所有相關變量。 我們提出一些因果假設:
(1) is what we want to test. For formality, we assume (1) is true so that we can test it by putting it into the model.
(1)是我們要測試的。 為了形式,我們假設(1)是正確的,以便我們可以將其放入模型中進行測試。
What about (2) and (3)? If you ask me, both are reasonable assumptions, and we can assume them to be true.
那(2)和(3)呢? 如果您問我,這都是合理的假設,我們可以假設它們是正確的。
“Whoa, hold your horses!” some might say. “That is mighty uncomfortable! You can’t just assume those things!” But, but, but! We must make those choices; we cannot be agnostic. When I first learned about causal modeling, I had the same reservations and eventually warmed up to the idea. For any given pair of variables X and Y, we must pick one of:
“哇,抱著你的馬!” 有些人可能會說。 “那真讓人不舒服! 你不能只是假設那些事情!” 但是,但是,但是! 我們必須做出這些選擇; 我們不可能是不可知論的。 當我第一次了解因果建模時,我有相同的保留意見,并最終熱衷于這個想法。 對于任何給定的變量X和Y對,我們必須選擇以下一項:
- X directly influences Y X直接影響Y
- Y directly influences X Y直接影響X
- X and Y do not directly affect each other X和Y不會直接相互影響
In the case of (2), “admission rate influences department” doesn’t make sense while “department and admission rate are not directly related” is certainly false (depending on the other assumptions, it can mean “all departments have the same admission rate”). The only plausible thing that’s left is “department influences admission rate”.
對于(2),“入學率對部門的影響”沒有意義,而“部門與入學率沒有直接關系”肯定是錯誤的(根據其他假設,這可能意味著“所有部門具有相同的入學率”率”)。 唯一剩下的似乎是“部門影響錄取率”。
The choice of assumptions are implied by the analyst’s methodology even if not verbally stated, though one methodology can correspond to multiple sets of assumptions. Just because they’re unsaid does not mean those assumptions don’t exist. If anything, the unstated assumptions are often the ones that make the least sense (see my previous article). It’s almost like lying by omission for the sake of appearing “objective”.
分析人員的方法論隱含了假設的選擇,即使沒有口頭說明也是如此,盡管一種方法論可以對應于多組假設。 只說不說并不意味著這些假設不存在。 如果有的話,未陳述的假設通常是最沒有意義的假設(請參閱我的上一篇文章 )。 這幾乎就像是為了表現出“客觀”而疏忽撒謊。
source)來源 )The three assumptions taken together imply this causal diagram:
這三個假設合起來意味著此因果關系圖:
Why go through all this trouble? If you are not familiar with causal diagrams, I suggest reading the last third of my article on causality. As it turns out, the assumptions tell us how we should analyze the data:
為什么要經歷所有這些麻煩? 如果您不熟悉因果關系圖,建議您閱讀文章中有關因果關系的最后三分之一。 事實證明,這些假設告訴我們如何分析數據:
- (1) or (1) + (3) dictate that we aggregate all the data and use Gender as a predictor (1)或(1)+(3)指示我們匯總所有數據,并使用Gender作為預測變量
- (1) + (2) dictate that we use the raw data and use Gender as a predictor (1)+(2)規定我們使用原始數據,并使用Gender作為預測變量
(1) + (2) + (3) dictate that we use the raw data and use Gender and Dept as predictors
(1)+(2)+(3)指示我們使用原始數據,并使用性別和部門作為預測變量
Anna’s methodology assumes that departments and admission rate are not directly related. I highly doubt that, so her analysis is faulty. (Be careful. Even faulty analyses can yield correct conclusions by accident.)
Anna的方法假設部門和錄取率沒有直接關系。 我對此表示高度懷疑,因此她的分析是錯誤的。 (請注意。即使是錯誤的分析也可能會偶然得出正確的結論。)
Barbara’s methodology assumes (2) and (3). This is much more reasonable than Anna’s assumptions, so I’m more inclined to believe in her conclusion that the admission process is not sexist.
芭芭拉的方法假設(2)和(3)。 這比安娜的假設更為合理,因此我更傾向于相信她的結論,即錄取過程并非性別歧視。
因果模型 (Causal modeling)
“Controlling for X” is the same as “slicing and dicing by X”, except your variables do not have to be categorical.
“控制X”與“按X切片和切塊”相同,不同之處在于您不必對變量進行分類。
Causal modeling is the art of slicing and dicing data the “right” way.
因果建模是一種以“正確”方式對數據進行切片和切塊的藝術。
FiveThirtyEight made a fun interactive page where you can p-hack your way into statistical significance. Give it a try. If you regress Employment Rate on Governors, Republicans have a positive effect on the economy (p < 0.01). If you regress Employment Rate on Presidents, Republicans have a negative effect on the economy (p = 0.02). So what does the data say? Don’t confuse modeling choices with an inherent quality of the data.
FiveThirtyEight創建了一個有趣的交互式頁面 ,您可以在其中p破解自己的統計意義。 試試看。 如果將州長的就業率降低,共和黨人會對經濟產生積極影響(p <0.01)。 如果將總統的就業率降低,共和黨人會對經濟產生負面影響(p = 0.02)。 那么數據怎么說呢? 不要將建模選擇與數據的固有質量混淆。
An old joke goes “What do you get when you put 10 economists in a room? 11 opinions.” A more accurate joke would be “What do you get when you put 10 statisticians in a room? 10 estimates.”
一個古老的笑話說:“當你把10個經濟學家放在一個房間里時,你會得到什么? 11條意見。” 一個更準確的笑話是“將10位統計學家放在一個房間里,您會得到什么? 10個估算值。”
And therein lies the problem. People get different conclusions depending on how they choose to slice and dice the data. What is even the “correct” conclusion? Does it exist? Is it a mythical creature? A Platonic ideal?
問題就在這里。 人們根據如何選擇對數據進行切片和切塊而得出不同的結論。 什么是“正確”結論? 是否存在? 這是一個神話生物嗎? 柏拉圖式的理想?
Instead of arguing about our different conclusions, how about discussing causal assumptions? Perhaps we can get the entire team to agree on one way to slice and dice the data. And, hopefully, all parties involved have enough grace to accept the conclusion that comes from this single way of slicing and dicing. This, in essence, is what causal modeling is.
與其爭論我們的不同結論,不如討論因果假設? 也許我們可以讓整個團隊就一種將數據切片和切塊的方法達成共識。 并且,希望所有參與方都具有足夠的寬限度,可以接受來自這種單一切片和切塊方法的結論。 本質上,這就是因果建模。
MTSOfan on Flickr)MTSOfan在Flickr上拍攝的照片 )A/B tests are valuable because that’s the only time we fully know the causal diagram (caveat: not true if you have missing data, or if there is effect modification from other partly overlapping experiments, or if novelty effect is strong, or…). It is the only case where we do not have to make causal assumptions. We know exactly how to slice and dice the data correctly. The problem reduces purely to statistical assumptions.
A / B測試非常有價值,因為這是我們唯一完全了解因果關系圖的情況(警告:如果您缺少數據,或者其他部分重疊的實驗對效果進行了修改,或者新穎性效果很強,則為……) 。 這是唯一我們不必進行因果假設的情況。 我們確切地知道如何正確地對數據進行切片和切塊。 這個問題純粹是出于統計假設。
Of course, there are plenty of issues with causal modeling, such as this criticism. Yet, despite all its flaws and limitations, causal modeling is the best tool we’ve got.
當然,因果建模存在很多問題,例如這種批評 。 然而,盡管存在所有缺陷和局限性,因果建模還是我們擁有的最佳工具。
這會讓你成為一個男人嗎? (Will this make you a yes-man?)
In the spirit of Betteridge’s Law: no.
本著貝特里奇定律的精神: 不可以 。
In the UCB Admissions example, we can collectively agree on the assumptions (1) + (2) + (3) and yet someone objects to the results. We can ask: which assumption(s) do you have an issue with?
在UCB入學示例中,我們可以集體同意假設(1)+(2)+(3),但有人反對結果。 我們可以問:您對哪個假設有疑問?
For example, if they disagree with (3), as long as the team as a whole acts in good faith, other team members should chime up and say that (3) is the most plausible option. Assumptions are a group effort.
例如,如果他們不同意(3),則只要整個團隊以誠實的態度行事,其他團隊成員就應該保持警惕,并說(3)是最合理的選擇。 假設是集體努力。
If they agree with all the causal assumptions and still object, then perhaps the issue is with the statistical assumptions. This is an entire article on its own, so I’ll skip it.
如果他們同意所有因果假設并且仍然反對,那么問題可能出在統計假設上。 這是一篇完整的文章,因此我將跳過。
If they agree with all the causal and statistical assumptions but still doubt the conclusion, well, it’s easier for the team to see that this objection is irrational.
如果他們同意所有因果假設和統計假設,但仍對結論感到懷疑,那么,團隊更容易看出這一反對意見是不合理的。
And, just like how experiments can be improved by making them double-blind, the team can discuss the assumptions without anyone knowing what the result of those assumptions will be.
并且,就像如何通過使實驗成為雙盲來改進實驗一樣,團隊可以在不知道這些假設的結果的情況下討論這些假設。
I wonder how much happier and more collaborative the teams will be when assumption-driven analytics becomes the norm? Everyone has a say in the analysis instead of leaving it only to the stats folks, and by agreeing on the terms upfront, the outcomes and decisions will be judged as more fair.
我想知道當假設驅動的分析成為規范時,團隊將有多快樂和合作? 在分析中,每個人都有發言權,而不僅僅是將其留給統計人員,并且通過預先約定條件,結果和決定將被認為更加公平。
預測建模被高估 (Predictive modeling is overrated)
There, I said it.
在那里,我說了。
I always tell people, before working on anything, to ask themselves: prediction or inference? You cannot have both. (Well, okay, you can, but only sometimes, as a treat.) Unfortunately, many people are not even aware of the distinction because most of data science is concerned with prediction.
在做任何事情之前,我總是告訴人們問自己:預測還是推論? 你們不能兩者兼得 。 (嗯,好的,您可以,但僅在某些情況下可以作為一種對待。)不幸的是,許多人甚至沒有意識到這種區別,因為大多數數據科學都與預測有關。
For many applications, we only care about prediction — think putting models into production. The only conclusion we need to draw is “does this model make good predictions?” This is a privilege.
對于許多應用程序,我們只關心預測-考慮將模型投入生產。 我們需要得出的唯一結論是“該模型能否做出正確的預測?” 這是特權。
However, decision making from data requires inference. Drawing conclusion from data requires inference. Recommending action from data requires inference.
但是,根據數據進行決策需要推理。 從數據中得出結論需要推理。 從數據推薦操作需要推理。
Naturally, purely predictive models are bad for those tasks. Consider models like random forest with no semblance of interpretability other than LIME, which makes no statement outside a locale. What action should you take based on the variable importance plot? How does it aid decision making in any way?
自然地,純粹的預測模型對那些任務不利。 考慮像LIFE這樣的模型,除了LIME之外沒有其他可解釋性,它不會在語言環境外聲明。 根據可變重要性圖應采取什么措施? 它如何以任何方式幫助決策?
Dale Cruse on Flickr)Dale Cruse在Flickr上的照片 )Even if you use a more interpretable model like elasticnet, you cannot throw in everything, maximize predictive power, and expect to get good inferences. You might be conditioning on a collider and get spurious correlation, or you might be blocking a causal pathway and hide the effect (if you don’t understand what this means, please read the last third of my article on causality). An illustrative quote:
即使您使用諸如Elasticnet之類的更具可解釋性的模型,您也無法投入一切,無法最大化預測能力,并期望獲得良好的推論。 您可能正在對撞機進行調節并獲得虛假的相關性,或者您可能正在阻止因果關系并隱藏其影響(如果您不了解這是什么意思,請閱讀我的因果關系文章的后三分之一)。 一個說明性報價:
Controlling for barometric pressure, Mount Everest has the same altitude as the Dead Sea. (source)
通過控制氣壓,珠穆朗瑪峰的海拔高度與死海相同。 ( 來源 )
Surely that’s a silly conclusion, but people unknowingly draw that kind of conclusion all the time by “controlling for everything”. It’s rarely obvious. Or, equally bad, people conclude that turning sprinklers on will prevent rain. (Hey, it is an excellent predictor.)
當然,這是一個愚蠢的結論,但是人們總是在不知不覺中通過“控制一切”得出這樣的結論。 這很少見。 或者,同樣糟糕的是,人們得出的結論是,打開噴頭可以防止降雨。 (嘿,它是一個很好的預測指標。)
Causal modeling is a case of “we have the product that customers (business stakeholders) need, but the customers don’t know that this is what they need”.
因果模型是“我們擁有客戶(業務利益相關者)需要的產品,但客戶不知道這就是他們所需要的”的情況。
“Actionable insights” has become some kind of a buzzword in data science. Yet, people chase purely predictive models that cannot possibly yield “actionable insights” while they ignore inferential / causal modeling. It is strange. Causal modeling should be the heart of analytics.
“可行的見解”已成為數據科學中的一種流行語。 但是,人們追求純粹的預測模型,而忽略了推論/因果模型,就無法產生“可行的見解”。 它很奇怪。 因果模型應該是分析的核心。
在結束時 (In closing)
If someone tries to browbeat others with data by saying “The data clearly says X!”, try asking them what their causal assumptions are. It’s illuminating. If they claim there are no assumptions, either the data comes from an A/B test or it might as well be:
如果某人試圖通過說“數據清楚地用X!”來毆打他人,請嘗試詢問他們的因果假設是什么。 照亮了 如果他們聲稱沒有假設,那么數據要么來自A / B測試,要么可能是:
The more I work with data, the less certain I get about conclusions. The ones parading around a statistic like it is a glimpse of ThE oNe TrUtH might very well be under the Dunning-Kruger Effect: they make strong inferential statements without knowing the role that assumptions play in inference.
我處理數據越多,得出結論的把握就越少。 像這樣一則關于統計數據的概述很可能是鄧恩 -克魯格效應的 一種概括 :他們在不知道假設在推論中扮演什么角色的情況下做出了強有力的推論陳述。
We rarely know whether or not we are slicing and dicing the data the “right” way, though some ways are more believable than others. As much as we like to think that science and faith are polar opposites, much of science rests on unverifiable assumptions that require a leap of faith. Always be open to the possibility of being wrong.
盡管有些方法比其他方法更可信,但我們很少知道我們是否以“正確”的方式對數據進行切片和切塊。 盡管我們喜歡認為科學和信仰是兩極對立,但許多科學都建立在不可驗證的假設之上,這些假設需要信念的飛躍。 始終對犯錯的可能性持開放態度。
翻譯自: https://towardsdatascience.com/no-the-data-never-says-anything-510edbd9b43a
數據庫主從不同步
總結
以上是生活随笔為你收集整理的数据库主从不同步_数据从不说什么的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 摘要算法_摘要
- 下一篇: HMA-TL00是什么型号