Paper:《A Few Useful Things to Know About Machine Learning—关于机器学习的一些有用的知识》翻译与解读
Paper:《A Few Useful ?Things to ?Know About ?Machine ?Learning—關于機器學習的一些有用的知識》翻譯與解讀
目錄
《A Few Useful ?Things to ?Know About ?Machine ?Learning》翻譯與解讀了解機器學習的一些有用的東西
?key insights?重要見解
Learning = Representation + ?Evaluation + Optimization 學習=表示+評估+優化
It’s Generalization that Counts ?重要的是概括
Data Alone Is Not Enough ?僅數據不足
Intuition Fails in High Dimensions ?高維直覺失敗
Theoretical Guarantees ?Are Not What They Seem?理論上的保證不是他們所看到的
Feature Engineering Is The Key ?特征工程是關鍵
More Data Beats ?a Cleverer Algorithm ?智慧算法帶來更多數據優勢
Learn Many Models, Not Just One ?學習多種模型,而不僅僅是一種
Simplicity Does Not ?Imply Accuracy ?簡易性不準確
Representable Does Not ?Imply Learnable ?有代表性的不容易學習
Correlation Does Not ?Imply Causation ?關聯不表示因果關系
Conclusion ?結論
原文地址:https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
2012年10月
《A Few Useful ?Things to ?Know About ?Machine ?Learning》翻譯與解讀
了解機器學習的一些有用的東西
| Machine learning systems automatically learn ?programs from data. This is often a very attractive ?alternative to manually constructing them, and in the ?last decade the use of machine learning has spread ?rapidly throughout computer science and beyond. ?Machine learning is used in Web search, spam filters, ?recommender systems, ad placement, credit scoring, ?fraud detection, stock trading, drug design, and many ?other applications. A recent report from the McKinsey ?Global Institute asserts that machine learning (a.k.a. ?data mining or predictive analytics) will be the driver ?of the next big wave of innovation.15 Several fine ?textbooks are available to interested practitioners and ?researchers (for example, Mitchell16 and Witten et ?al.24). However, much of the “folk knowledge” that is needed to successfully develop ?machine learning applications is not ?readily available in them. As a result, ?many machine learning projects take ?much longer than necessary or wind ?up producing less-than-ideal results. ?Yet much of this folk knowledge is ?fairly easy to communicate. This is ?the purpose of this article. | 機器學習系統會自動從數據中學習程序。這通常是手動構建它們的一種非常有吸引力的替代方法,并且在過去的十年中,機器學習的使用已迅速遍及整個計算機科學及其他領域。機器學習用于Web搜索,垃圾郵件過濾器,推薦系統,廣告排名,信用評分,欺詐檢測,股票交易,藥物設計以及許多其他應用程序中。麥肯錫全球研究院最近的一份報告斷言,機器學習(又名數據挖掘或預測分析)將成為下一波創新浪潮的驅動力.15有興趣的從業者和研究人員可以使用幾本精美的教科書(例如Mitchell16和Witten等24)。但是,成功開發機器學習應用程序所需的許多``民間知識''尚不容易獲得。結果,許多機器學習項目花費的時間比必要的時間長得多,或者結束時產生的結果不理想。然而,許多民間知識非常容易交流。這就是本文的目的。 |
?key insights?重要見解
| Machine learning algorithms can figure ?out how to perform important tasks ?by generalizing from examples. This is ?often feasible and cost-effective where ?manual programming is not. As more ?data becomes available, more ambitious ?problems can be tackled. ? Machine learning is widely used in ?computer science and other fields. ?However, developing successful ?machine learning applications requires a ?substantial amount of “black art” that is ?difficult to find in textbooks. ? This article summarizes 12 key lessons ?that machine learning researchers and ?practitioners have learned. These include ?pitfalls to avoid, important issues to focus ?on, and answers to common questions. | 機器學習算法可以通過示例總結來弄清楚如何執行重要任務。在沒有手動編程的情況下,這通常是可行且具有成本效益的。隨著越來越多的數據可用,可以解決更多雄心勃勃的問題。機器學習廣泛應用于計算機科學和其他領域。但是,開發成功的機器學習應用程序需要大量的``妖術'',這在教科書中很難找到。本文總結了機器學習研究人員和從業人員所學的12項關鍵課程。這些包括要避免的陷阱,需要重點關注的重要問題以及常見問題的答案。 |
| Many different types of machine ?learning exist, but for illustration ?purposes I will focus on the most ?mature and widely used one: classification. ?Nevertheless, the issues I ?will discuss apply across all of machine ?learning. A classifier is a system ?that inputs (typically) a vector ?of discrete and/or continuous feature ?values and outputs a single discrete ?value, the class. For example, ?a spam filter classifies email messages ?into “spam” or “not spam,” ?and its input may be a Boolean vector ?x = (x1,…,xj,…,xd), where xj = 1 if ?the j ?th word in the dictionary appears ?in the email and xj = 0 otherwise. A ?learner inputs a training set of examples ?(xi, yi), where xi = (xi,1 , . . . , ?xi,d) is an observed input and yi is the ?corresponding output, and outputs ?a classifier. The test of the learner is ?whether this classifier produces the ?correct output yt for future examples ?xt (for example, whether the spam ?filter correctly classifies previously ?unseen email messages as spam or ?not spam). | 存在許多不同類型的機器學習,但出于說明目的,我將重點介紹最成熟且使用最廣泛的一種:分類。盡管如此,我將討論的問題適用于所有機器學習。分類器是一個系統,通常輸入離散和/或連續特征值的向量并輸出單個離散值的類。例如,垃圾郵件過濾器將電子郵件分類為“垃圾郵件”或“非垃圾郵件”,其輸入可能是布爾向量x =(x1,...,xj,...,xd),如果第j個單詞的話xj = 1字典中出現在電子郵件中,否則xj = 0。學習者輸入一組訓練示例(xi,yi),其中xi =(xi,1,。。。xi,d)是觀察到的輸入,yi是相應的輸出,并輸出分類器。學習者的考驗是此分類器是否為將來的示例xt生成正確的輸出yt(例如,垃圾郵件過濾器是否將先前未見過的電子郵件正確分類為垃圾郵件或非垃圾郵件)。 |
Learning = Representation + ?Evaluation + Optimization 學習=表示+評估+優化
| ?Suppose you have an application that ?you think machine learning might be ?good for. The first problem facing you ?is the bewildering variety of learning algorithms ?available. Which one to use? ?There are literally thousands available, ?and hundreds more are published each ?year. The key to not getting lost in this ?huge space is to realize that it consists ?of combinations of just three components. ?The components are: ? Representation. A classifier must ?be represented in some formal language ?that the computer can handle. ?Conversely, choosing a representation ?for a learner is tantamount to ?choosing the set of classifiers that it ?can possibly learn. This set is called ?the hypothesis space of the learner. ?If a classifier is not in the hypothesis ?space, it cannot be learned. A related ?question, that I address later, is how ?to represent the input, in other words, ?what features to use. Optimization. Finally, we need ?a method to search among the classifiers ?in the language for the highest-scoring ?one. The choice of optimization ?technique is key to the ?efficiency of the learner, and also ?helps determine the classifier produced ?if the evaluation function has ?more than one optimum. It is common ?for new learners to start out using ?off-the-shelf optimizers, which are later ?replaced by custom-designed ones. | 假設您有一個您認為機器學習可能很適合的應用程序。您面臨的第一個問題是可用的學習算法種類繁多。使用哪一個?實際上有數千種可用,每年都會發布數百種。在這個巨大空間中不迷路的關鍵是要認識到它僅由三個組成部分組成。這些組件是: 表示。分類器必須以計算機可以處理的某種正式語言表示。相反,為學習者選擇表示形式等同于選擇其可能學習的分類器集合。該集合稱為學習者的假設空間。如果分類器不在假設空間中,則無法學習。我稍后將解決的一個相關問題是如何表示輸入,換句話說,要使用的功能。 優化。最后,我們需要一種在語言中的分類器中搜索得分最高的方法。優化技術的選擇是學習者效率的關鍵,并且如果評估函數具有多個最優值,則有助于確定所生成的分類器。對于新學習者來說,通常首先使用現成的優化器,然后由定制設計的優化器代替。 |
| The accompanying table shows ?common examples of each of these ?three components. For example, knearest ?neighbor classifies a test example ?by finding the k most similar ?training examples and predicting the ?majority class among them. HyperI ?plane-based methods form a linear combination of the features per class ?and predict the class with the highest-valued ?combination. Decision ?trees test one feature at each internal ?node, with one branch for each feature ?value, and have class predictions ?at the leaves. Algorithm 1 (above) ?shows a bare-bones decision tree ?learner for Boolean domains, using ?information gain and greedy search.20 ?InfoGain(xj, y) is the mutual information ?between feature xj and the class y. ?MakeNode(x,c0,c1) returns a node that ?tests feature x and has c0 as the child ?for x = 0 and c1 as the child for x = 1. ? Of course, not all combinations of ?one component from each column of ?the table make equal sense. For example, ?discrete representations naturally ?go with combinatorial optimization, ?and continuous ones with continuous ?optimization. Nevertheless, many ?learners have both discrete and continuous ?components, and in fact the?day may not be far when every single ?possible combination has appeared in ?some learner! ? Most textbooks are organized by ?representation, and it is easy to overlook ?the fact that the other components ?are equally important. There is ?no simple recipe for choosing each ?component, but I will touch on some ?of the key issues here. As we will see, ?some choices in a machine learning ?project may be even more important ?than the choice of learner. | 下表顯示了這三個組件中每個組件的通用示例。例如,knearest鄰居通過找到k個最相似的訓練示例并預測其中的大多數類別來對測試示例進行分類。基于HyperI的基于平面的方法形成每個類別的特征的線性組合,并以最高價值的組合來預測類別。決策樹在每個內部節點上測試一個功能,每個功能值具有一個分支,并在樹葉上進行類預測。上面的算法1顯示了使用信息增益和貪婪搜索的布爾域的基本決策樹學習器.20 InfoGain(xj,y)是特征xj和類別y之間的互信息。 MakeNode(x,c0,c1)返回一個測試特征x的節點,對于x = 0,將c0作為子節點,對于x = 1,將c1作為其子節點。 當然,并非表中每一列的一個組件的所有組合都具有同等意義。例如,離散表示自然會進行組合優化,而連續表示則會進行連續優化。盡管如此,許多學習者具有離散和連續的組成部分,實際上,當每個單獨的可能組合出現在某個學習者中的日子可能并不遙遠! 大多數教科書都是按代表形式組織的,很容易忽略其他組成部分同等重要的事實。沒有選擇每個組件的簡單方法,但是我將在這里介紹一些關鍵問題。正如我們將看到的那樣,機器學習項目中的某些選擇可能比學習者的選擇更為重要。 |
It’s Generalization that Counts ?重要的是概括
| The fundamental goal of machine ?learning is to generalize beyond the ?examples in the training set. This is ?because, no matter how much data ?we have, it is very unlikely that we will ?see those exact examples again at test ?time. (Notice that, if there are 100,000 ?words in the dictionary, the spam filter ?described above has 2100,000 possible different inputs.) Doing well on ?the training set is easy (just memorize ?the examples). The most common ?mistake among machine learning beginners ?is to test on the training data ?and have the illusion of success. If the ?chosen classifier is then tested on new ?data, it is often no better than random ?guessing. So, if you hire someone ?to build a classifier, be sure to keep ?some of the data to yourself and test ?the classifier they give you on it. Conversely, ?if you have been hired to build ?a classifier, set some of the data aside ?from the beginning, and only use it to ?test your chosen classifier at the very ?end, followed by learning your final ?classifier on the whole data. | 機器學習的基本目標是超越訓練集中的示例進行概括。這是因為,無論我們擁有多少數據,我們都不太可能在測試時再次看到這些確切的例子。 (請注意,如果字典中有100,000個單詞,則上述垃圾郵件過濾器可能有2100,000個不同的輸入。)在訓練集上做得很好很容易(請記住示例)。機器學習初學者中最常見的錯誤是對訓練數據進行測試并產生成功的幻覺。如果選擇的分類器隨后在新數據上進行測試,則通常不會比隨機猜測更好。因此,如果您雇用某人來構建分類器,請確保將一些數據保留給自己并測試他們在其中提供給您的分類器。相反,如果您被雇用來構建分類器,請從一開始就保留一些數據,并僅在最后使用它來測試所選的分類器,然后再對整個數據學習最終的分類器。 |
| Contamination of your classifier by ?test data can occur in insidious ways, ?for example, if you use test data to ?tune parameters and do a lot of tuning. ?(Machine learning algorithms ?have lots of knobs, and success often ?comes from twiddling them a lot, ?so this is a real concern.) Of course, ?holding out data reduces the amount ?available for training. This can be mitigated ?by doing cross-validation: randomly ?dividing your training data into ?(say) 10 subsets, holding out each one ?while training on the rest, testing each ?learned classifier on the examples it ?did not see, and averaging the results ?to see how well the particular parameter ?setting does. ? In the early days of machine learning, ?the need to keep training and test ?data separate was not widely appreciated. ?This was partly because, if the ?learner has a very limited representation ?(for example, hyperplanes), the ?difference between training and test ?error may not be large. But with very ?flexible classifiers (for example, decision ?trees), or even with linear classifiers ?with a lot of features, strict separation ?is mandatory. | 測試數據對分類器的污染可能以陰險的方式發生,例如,如果您使用測試數據來調整參數并進行大量調整。 (機器學習算法有很多旋鈕,而成功往往來自于大量的糾纏,因此這是一個真正的問題。)當然,保留數據會減少可用于訓練的數量。可以通過交叉驗證來緩解這種情況:將您的訓練數據隨機分為10個子集(例如10個子集),在其余部分進行訓練時堅持每個子集,在未看到的示例上測試每個學習的分類器,然后平均結果以查看特定參數設置的效果如何。 在機器學習的早期,對訓練和測試數據保持分開的需求并未得到廣泛認可。這部分是因為,如果學習者的表示形式非常有限(例如,超平面),則訓練與測試錯誤之間的差異可能不會很大。但是對于非常靈活的分類器(例如決策樹),甚至對于具有很多功能的線性分類器來說,嚴格的分隔是強制性的。 |
| Notice that generalization being ?the goal has an interesting consequence ?for machine learning. Unlike ?in most other optimization problems, ?we do not have access to the function ?we want to optimize! We have to use ?training error as a surrogate for test ?error, and this is fraught with danger. ?(How to deal with it is addressed ?later.) On the positive side, since the ?objective function is only a proxy for ?the true goal, we may not need to fully optimize it; in fact, a local optimum ?returned by simple greedy search may ?be better than the global optimum. | 請注意,泛化是機器學習的目標產生了有趣的結果。 與大多數其他優化問題不同,我們無權訪問我們要優化的功能! 我們必須使用訓練錯誤作為測試錯誤的替代品,這充滿了危險。 從積極的方面來看,由于目標函數只是真實目標的代理,因此我們可能不需要完全優化它; 實際上,通過簡單的貪婪搜索返回的局部最優值可能要好于全局最優值。 |
Data Alone Is Not Enough ?僅數據不足
| Generalization being the goal has another ?major consequence: Data alone ?is not enough, no matter how much ?of it you have. Consider learning a ?Boolean function of (say) 100 variables ?from a million examples. There ?are 2100 ? 106 ? examples whose classes ?you do not know. How do you figure ?out what those classes are? In the absence ?of further information, there is ?just no way to do this that beats flipping ?a coin. This observation was first ?made (in somewhat different form) by ?the philosopher David Hume over 200 ?years ago, but even today many mistakes ?in machine learning stem from ?failing to appreciate it. Every learner ?must embody some knowledge or assumptions ?beyond the data it is given ?in order to generalize beyond it. This ?notion was formalized by Wolpert in ?his famous “no free lunch” theorems, ?according to which no learner can ?beat random guessing over all possible ?functions to be learned.25. This seems like rather depressing ?news. How then can we ever hope to ?learn anything? Luckily, the functions ?we want to learn in the real world are ?not drawn uniformly from the set of all ?mathematically possible functions! In ?fact, very general assumptions—like ?smoothness, similar examples having ?similar classes, limited dependences, ?or limited complexity—are ?often enough to do very well, and this ?is a large part of why machine learning ?has been so successful. Like deduction, ?induction (what learners do) ?is a knowledge lever: it turns a small ?amount of input knowledge into a ?large amount of output knowledge. ?Induction is a vastly more powerful ?lever than deduction, requiring much ?less input knowledge to produce useful ?results, but it still needs more than ?zero input knowledge to work. And, as ?with any lever, the more we put in, the ?more we can get out. | 泛化是目標的另一個主要后果:數據量不夠,無論您擁有多少數據。考慮從一百萬個示例中學習一個(布爾)100個變量的布爾函數。有2100 ? 106個您不知道其類的示例。您如何弄清楚這些類是什么?在沒有更多信息的情況下,根本沒有辦法像拋硬幣一樣做到這一點。這種觀察是200多年前哲學家戴維·休((David Hume)首次提出的(形式有所不同),但直到今天,機器學習中的許多錯誤仍然源于對它的欣賞。每個學習者都必須在給出的數據之外體現一些知識或假設,以便對其進行概括。 Wolpert在他著名的“免費午餐”定理中正式化了這個概念,根據該定理,任何學習者都無法對將要學習的所有可能功能進行隨機猜測。25。 這似乎令人沮喪。那我們怎么能希望學到什么呢?幸運的是,我們要在現實世界中學習的功能并不是從所有數學上可能的功能集中統一得出的!實際上,非常平滑的假設(例如平滑度,具有相似類,有限依賴項或有限復雜性的類似示例)通常足以很好地完成工作,這是機器學習如此成功的很大一部分。就像演繹一樣,歸納(學習者的工作)是一種知識杠桿:它將少量的輸入知識變成大量的輸出知識。歸納比推論具有更強大的杠桿作用,需要更少的輸入知識才能產生有用的結果,但它仍然需要超過零的輸入知識才能起作用。而且,就像使用任何杠桿一樣,我們投入的越多,我們越能脫身。 |
| A corollary of this is that one of the ?key criteria for choosing a representation ?is which kinds of knowledge are ?easily expressed in it. For example, if ?we have a lot of knowledge about what ?makes examples similar in our domain, instance-based methods may ?be a good choice. If we have knowledge ?about probabilistic dependencies, ?graphical models are a good fit. ?And if we have knowledge about what ?kinds of preconditions are required by ?each class, “IF . . . THEN . . .” rules may ?be the best option. The most useful ?learners in this regard are those that ?do not just have assumptions hardwired ?into them, but allow us to state ?them explicitly, vary them widely, and ?incorporate them automatically into ?the learning (for example, using firstorder ?logic21 or grammars6 ?). In retrospect, the need for knowledge ?in learning should not be surprising. ?Machine learning is not ?magic; it cannot get something from ?nothing. What it does is get more ?from less. Programming, like all engineering, ?is a lot of work: we have to ?build everything from scratch. Learning ?is more like farming, which lets ?nature do most of the work. Farmers ?combine seeds with nutrients to grow ?crops. Learners combine knowledge ?with data to grow programs. | 一個必然的推論是選擇一種表示形式的關鍵標準之一就是在其中容易表達哪種知識。例如,如果我們對使示例在我們的領域中變得相似有很多了解,那么基于實例的方法可能是一個不錯的選擇。如果我們了解有關概率依賴性的知識,則圖形模型非常適合。如果我們了解每個類都需要哪些先決條件,則“ IF。 。 。然后 。 。 。”規則可能是最佳選擇。在這方面最有用的學習者是那些不僅將假設硬性地扎入其中的假設,而且使我們能夠明確地陳述它們,進行廣泛的變化并將它們自動地納入學習中(例如使用一階logic21或grammars6)。 回想起來,學習中知識的需求不足為奇。機器學習不是魔術;它一無所獲。它所做的就是從更少獲得更多。像所有工程學一樣,編程工作量很大:我們必須從頭開始構建所有內容。學習更像是耕種,讓自然完成大部分工作。農民將種子與養分結合起來種植農作物。學習者將知識與數據相結合以開發程序。 |
| Overfitting Has Many Faces ?What if the knowledge and data we ?have are not sufficient to completely ?determine the correct classifier? Then ?we run the risk of just hallucinating ?a classifier (or parts of it) that is not ?grounded in reality, and is simply encoding ?random quirks in the data. ?This problem is called overfitting, and ?is the bugbear of machine learning. ?When your learner outputs a classifier ?that is 100% accurate on the training ?data but only 50% accurate on test ?data, when in fact it could have output one that is 75% accurate on both, it ?has overfit. ? Everyone in machine learning ?knows about overfitting, but it comes ?in many forms that are not immediately ?obvious. One way to understand ?overfitting is by decomposing generalization ?error into bias and variance. ?9 ?Bias is a learner’s tendency to consistently ?learn the same wrong thing. ?Variance is the tendency to learn random ?things irrespective of the real signal. ?Figure 1 illustrates this by an analogy ?with throwing darts at a board. A ?linear learner has high bias, because ?when the frontier between two classes ?is not a hyperplane the learner is unable ?to induce it. Decision trees do not ?have this problem because they can ?represent any Boolean function, but ?on the other hand they can suffer from ?high variance: decision trees learned ?on different training sets generated by ?the same phenomenon are often very ?different, when in fact they should be the same. Similar reasoning applies ?to the choice of optimization method: ?beam search has lower bias than ?greedy search, but higher variance, because ?it tries more hypotheses. Thus, ?contrary to intuition, a more powerful ?learner is not necessarily better than a ?less powerful one. ? Figure 2 illustrates this.a ? Even ?though the true classifier is a set of ?rules, with up to 1,000 examples naive ?Bayes is more accurate than a ?rule learner. This happens despite ?naive Bayes’s false assumption that ?the frontier is linear! Situations like ?this are common in machine learning: ?strong false assumptions can be ?better than weak true ones, because ?a learner with the latter needs more ?data to avoid overfitting. | 過度擬合有很多面孔如果我們所掌握的知識和數據不足以完全確定正確的分類器怎么辦?然后,我們冒著使幻化一個分類器(或部分分類器)的風險,而這個分類器(或部分分類器)實際上并沒有扎根,只是在數據中編碼了隨機的怪癖。這個問題稱為過擬合,是機器學習的負擔。當您的學習者輸出的分類數據在訓練數據上準確度為100%但在測試數據上僅準確度為50%時,實際上它可能在兩個數據上都輸出準確度為75%的分類器,這是過擬合的。 機器學習中的每個人都知道過擬合,但是它以許多形式出現,但并不是立即顯而易見的。一種理解過度擬合的方法是將泛化誤差分解為偏差和方差。 9偏見是學習者始終如一地學習同一錯誤事物的傾向。方差是學習隨機事物的趨勢,與真實信號無關。圖1通過在板上扔飛鏢的類比說明了這一點。線性學習者具有較高的偏見,因為當兩類之間的邊界不是超平面時,學習者無法誘導它。決策樹沒有這個問題,因為它們可以表示任何布爾函數,但是另一方面,它們可能遭受高方差:在相同現象產生的不同訓練集上學習的決策樹通常非常不同,而實際上它們應該是相同。類似的推理適用于優化方法的選擇:光束搜索比貪婪搜索具有更低的偏差,但方差更高,因為它會嘗試更多的假設。因此,與直覺相反,更強大的學習者不一定比不那么強大的學習者更好。 圖2對此進行了說明。a盡管真正的分類器是一組規則,但多達1000個示例的樸素貝葉斯比一個規則學習者更準確。盡管天真貝葉斯錯誤地認為邊界是線性的,但仍會發生這種情況!像這樣的情況在機器學習中很常見:強錯誤的假設可能比弱真實的假設更好,因為擁有后者的學習者需要更多數據來避免過擬合。 |
| Cross-validation can help to combat ?overfitting, for example by using it ?to choose the best size of decision tree ?to learn. But it is no panacea, since if ?we use it to make too many parameter ?choices it can itself start to overfit.17 ? Besides cross-validation, there ?are many methods to combat overfitting. ?The most popular one is adding ?a regularization term to the evaluation ?function. This can, for example, penalize ?classifiers with more structure, ?thereby favoring smaller ones with ?less room to overfit. Another option ?is to perform a statistical significance ?test like chi-square before adding new ?structure, to decide whether the distribution ?of the class really is different ?with and without this structure. ?These techniques are particularly useful ?when data is very scarce. Nevertheless, ?you should be skeptical of claims ?that a particular technique “solves” ?the overfitting problem. It is easy to ?avoid overfitting (variance) by falling ?into the opposite error of underfitting ?(bias). Simultaneously avoiding both ?requires learning a perfect classifier, ?and short of knowing it in advance ?there is no single technique that will ?always do best (no free lunch). ? A common misconception about ?overfitting is that it is caused by noise,like training examples labeled with ?the wrong class. This can indeed aggravate ?overfitting, by making the ?learner draw a capricious frontier to ?keep those examples on what it thinks ?is the right side. But severe overfitting ?can occur even in the absence of noise. ?For instance, suppose we learn a Boolean ?classifier that is just the disjunction ?of the examples labeled “true” ?in the training set. (In other words, ?the classifier is a Boolean formula in ?disjunctive normal form, where each ?term is the conjunction of the feature ?values of one specific training example.) ?This classifier gets all the training ?examples right and every positive test ?example wrong, regardless of whether ?the training data is noisy or not. ? The problem of multiple testing13 is ?closely related to overfitting. Standard ?statistical tests assume that only one ?hypothesis is being tested, but modern ?learners can easily test millions ?before they are done. As a result what ?looks significant may in fact not be. ?For example, a mutual fund that beats ?the market 10 years in a row looks very ?impressive, until you realize that, if ?there are 1,000 funds and each has a ?50% chance of beating the market on ?any given year, it is quite likely that ?one will succeed all 10 times just by ?luck. This problem can be combatted ?by correcting the significance tests to ?take the number of hypotheses into ?account, but this can also lead to underfitting. ?A better approach is to control ?the fraction of falsely accepted ?non-null hypotheses, known as the ?false discovery rate. ?3 | 交叉驗證可以幫助克服過度擬合,例如通過使用交叉驗證來選擇要學習的最佳決策樹大小。但這不是萬能藥,因為如果我們使用它進行過多的參數選擇,它本身可能會開始過度適應.17 除了交叉驗證外,還有許多方法可以防止過度擬合。最受歡迎的一種是在評估函數中添加正則化項。例如,這可能會懲罰具有更多結構的分類器,從而偏向于具有較小空間以適合過度的較小分類器。另一種選擇是在添加新結構之前執行卡方檢驗等統計顯著性檢驗,以判斷使用和不使用此結構時類的分布是否確實不同。當數據非常稀缺時,這些技術特別有用。盡管如此,您應該對特定技術可以``解決''過擬合問題的說法持懷疑態度。通過陷入欠擬合(bias)的相反誤差很容易避免過擬合(variance)。同時避免同時需要學習一個完美的分類器和既不事先知道它又沒有一種技術會永遠做到最好(沒有免費的午餐)。 關于過度擬合的一個常見誤解是它是由噪聲引起的,例如帶有錯誤課程的訓練示例。通過使學習者畫出一個反復無常的疆界以使那些例子保持正確的觀點,確實可以加劇過度擬合。但是即使沒有噪音也會發生嚴重的過擬合。例如,假設我們學習了一個布爾分類器,它只是訓練集中標注為``true''的示例的分離。 (換句話說,分類器是布爾正則形式的布爾公式,其中每個術語是一個特定訓練示例的特征值的合取)。該分類器獲得正確的所有訓練示例,每個陽性檢驗示例都正確,無論訓練數據是否嘈雜。 多次測試的問題13與過度擬合密切相關。標準統計測試假設僅對一種假設進行了測試,但是現代學習者可以在完成之前輕松地測試數百萬個假設。因此,看似重要的事實實際上可能并非如此。例如,一個連續十年擊敗市場的共同基金看起來非常令人印象深刻,直到您意識到,如果有1,000只基金,并且每種都有在任何給定年份擊敗市場的50%的機會,那么很可能僅靠運氣,一個人就能成功十次。可以通過校正顯著性檢驗以將假設的數量納入考慮范圍來解決此問題,但這也可能導致擬合不足。更好的方法是控制被錯誤接受的非零假設的比例,即錯誤發現率。 3 |
Intuition Fails in High Dimensions ?高維直覺失敗
| After overfitting, the biggest problem ?in machine learning is the curse of ?dimensionality. This expression was ?coined by Bellman in 1961 to refer ?to the fact that many algorithms that ?work fine in low dimensions become ?intractable when the input is highdimensional. ?But in machine learning ?it refers to much more. Generalizing ?correctly becomes exponentially ?harder as the dimensionality (number ?of features) of the examples grows, because ?a fixed-size training set covers a ?dwindling fraction of the input space. ?Even with a moderate dimension of ?100 and a huge training set of a trillion ?examples, the latter covers only a fraction of about 10?18 of the input space. ?This is what makes machine learning ?both necessary and hard. ? More seriously, the similaritybased ?reasoning that machine learning ?algorithms depend on (explicitly ?or implicitly) breaks down in high dimensions. ?Consider a nearest neighbor ?classifier with Hamming distance ?as the similarity measure, and suppose ?the class is just x1 ∧ x2. If there ?are no other features, this is an easy ?problem. But if there are 98 irrelevant ?features x3,..., x100, the noise from ?them completely swamps the signal in ?x1 and x2, and nearest neighbor effectively ?makes random predictions. ? Even more disturbing is that nearest ?neighbor still has a problem even ?if all 100 features are relevant! This ?is because in high dimensions all ?examples look alike. Suppose, for ?instance, that examples are laid out ?on a regular grid, and consider a test ?example xt. If the grid is d-dimensional, ?xt’s 2d nearest examples are ?all at the same distance from it. So as ?the dimensionality increases, more ?and more examples become nearest ?neighbors of xt, until the choice of ?nearest neighbor (and therefore of ?class) is effectively random. ? | 經過過度擬合后,機器學習中最大的問題就是維度的詛咒。該表達式由Bellman于1961年創造,是指在輸入為高維輸入時許多在低維運行良好的算法變得棘手的事實。但是在機器學習中,它涉及的更多。隨著示例維數(特征數量)的增長,正確地進行概括變得越來越困難,因為固定大小的訓練集覆蓋了輸入空間的縮小部分。即使具有100的適度范圍和數以萬億計的示例的龐大訓練集,后者僅覆蓋了約10-18的輸入空間的一小部分。這就是使機器學習既必要又困難的原因。 更嚴重的是,機器學習算法所依賴的基于相似度的原因(明確或隱含地)在高維度上被分解。考慮具有漢明距離的最近鄰居分類器作為相似性度量,并假設該類僅為x1∧x2。如果沒有其他功能,這是一個簡單的問題。但是,如果x3,...,x100有98個不相關的功能,則來自它們的噪聲會完全淹沒x1和x2中的信號,并且最近的鄰居會有效地進行隨機預測。 更令人不安的是,即使所有100個功能都相關,最近的鄰居仍然有問題!這是因為在高維度上所有示例看起來都是相似的。例如,假設示例被放置在規則的網格上,并考慮一個測試示例xt。如果網格是d維的,則xt的2d最接近的示例都與網格距離相同。因此,隨著維數的增加,越來越多的示例成為xt的最接近鄰居,直到最近鄰居(以及類別)的選擇實際上是隨機的。 |
| This is only one instance of a more ?general problem with high dimensions: ?our intuitions, which come ?from a three-dimensional world, often ?do not apply in high-dimensional ?ones. In high dimensions, most of the ?mass of a multivariate Gaussian distribution ?is not near the mean, but in ?an increasingly distant “shell” around ?it; and most of the volume of a highdimensional ?orange is in the skin, not ?the pulp. If a constant number of examples ?is distributed uniformly in a ?high-dimensional hypercube, beyond ?some dimensionality most examples ?are closer to a face of the hypercube ?than to their nearest neighbor. And if ?we approximate a hypersphere by inscribing ?it in a hypercube, in high dimensions ?almost all the volume of the ?hypercube is outside the hypersphere. ?This is bad news for machine learning, ?where shapes of one type are often approximated ?by shapes of another. ? Building a classifier in two or three ?dimensions is easy; we can find a reasonable ?frontier between examples ?of different classes just by visual inspection. (It has even been said that if ?people could see in high dimensions ?machine learning would not be necessary.) ?But in high dimensions it is difficult ?to understand what is happening. ?This in turn makes it difficult to ?design a good classifier. Naively, one ?might think that gathering more features ?never hurts, since at worst they ?provide no new information about the ?class. But in fact their benefits may ?be outweighed by the curse of dimensionality.? ?Fortunately, there is an effect that ?partly counteracts the curse, which ?might be called the “blessing of nonuniformity.” ?In most applications ?examples are not spread uniformly ?throughout the instance space, but ?are concentrated on or near a lowerdimensional ?manifold. For example, ?k-nearest neighbor works quite well ?for handwritten digit recognition ?even though images of digits have ?one dimension per pixel, because the ?space of digit images is much smaller ?than the space of all possible images. ?Learners can implicitly take advantage ?of this lower effective dimension, ?or algorithms for explicitly reducing ?the dimensionality can be used (for ?example, Tenenbaum22). | 這只是一個更高維度的一般性問題的一個例子:我們的直覺來自三維世界,通常不適用于高維度的直覺。在高維中,多元高斯分布的大部分質量都不在均值附近,而是在其周圍越來越遠的``殼''中;高維橙的大部分體積在皮膚中,而不是果肉中。如果恒定數量的示例均勻分布在一個高維超立方體中,則除了某些維之外,大多數示例比其最近的鄰居更靠近超立方體的一面。并且,如果我們通過將其記錄在超立方體中來近似超球面,則在高維中,幾乎所有超立方體的體積都在超球面之外。這對于機器學習來說是個壞消息,其中一種類型的形狀通常被另一種形狀的形狀近似。 在兩個或三個維度中建立分類器很容易;我們可以通過目視檢查在不同類別的示例之間找到合理的邊界。 (甚至有人說,如果人們可以在高維度上看到機器學習是沒有必要的。)但是在高維度上,很難理解正在發生的事情。反過來,這使得設計好的分類器變得困難。天真的,一個人可能認為收集更多功能永遠不會有害,因為在最壞的情況下,它們不提供有關該類的新信息。但是實際上,它們的好處可能會因維數的詛咒而被抵消。 幸運的是,有一種效果可以部分抵消這種詛咒,這種詛咒可能被稱為“不均勻的祝福”。在大多數應用程序中,示例并非均勻分布在整個實例空間中,而是集中在低維流形上或附近。例如,即使數字圖像每像素具有一維尺寸,k近鄰也能很好地用于手寫數字識別,因為數字圖像的空間比所有可能圖像的空間小得多。學習者可以隱式地利用此較低的有效維度,或者可以使用顯式降低維度的算法(例如Tenenbaum22)。 |
Theoretical Guarantees ?Are Not What They Seem?理論上的保證不是他們所看到的
| One of the major ?developments of ?recent decades has ?been the realization ?that we can have ?guarantees on the ?results of induction, ?particularly if we ?are willing to settle ?for probabilistic ?guarantees. | 近幾十年來的主要發展之一是認識到我們可以對歸納結果進行保證,特別是如果我們愿意為概率保證定居的話。 |
| Machine learning papers are full of ?theoretical guarantees. The most common ?type is a bound on the number of ?examples needed to ensure good generalization. ?What should you make of ?these guarantees? First of all, it is remarkable ?that they are even possible. ?Induction is traditionally contrasted ?with deduction: in deduction you can ?guarantee that the conclusions are ?correct; in induction all bets are off. ?Or such was the conventional wisdom ?for many centuries. One of the major ?developments of recent decades has ?been the realization that in fact we can ?have guarantees on the results of induction, ?particularly if we are willing ?to settle for probabilistic guarantees. ? The basic argument is remarkably ?simple.5 ? Let’s say a classifier is bad ?if its true error rate is greater than ε. ?Then the probability that a bad classifier ?is consistent with n random, independent ?training examples is less ?than (1 ? ε) ?n ?. Let b be the number of bad classifiers in the learner’s hypothesis ?space H. The probability that at ?least one of them is consistent is less ?than b(1 ? ε) ?n ?, by the union bound. Assuming ?the learner always returns a ?consistent classifier, the probability ?that this classifier is bad is then less ?than |H|(1 ? ε) ?n ?, where we have used ?the fact that b ≤ |H|. So if we want this ?probability to be less than δ, it suffices ?to make n > ln(δ/|H|)/ ln(1 ? ε) ≥ 1/ε (ln ?|H| + ln 1/δ). ? | 機器學習論文充滿了理論上的保證。最常見的類型是確保良好泛化所需的示例數量的界限。這些保證應怎么做?首先,令人驚訝的是它們甚至是可能的。傳統上將歸納法與推論進行對比:在推論中,您可以保證結論是正確的;歸納所有賭注都關閉了。或多個世紀以來的傳統智慧就是如此。認識到實際上我們可以對歸納的結果提供保證,這是近幾十年來的主要發展之一,特別是如果我們愿意解決概率保證的話。 |
| Unfortunately, guarantees of this ?type have to be taken with a large grain ?of salt. This is because the bounds obtained ?in this way are usually extremely ?loose. The wonderful feature of the ?bound above is that the required number ?of examples only grows logarithmically ?with |H| and 1/δ. Unfortunately, ?most interesting hypothesis spaces ?are doubly exponential in the number ?of features d, which still leaves us ?needing a number of examples exponential ?in d. For example, consider ?the space of Boolean functions of d ?Boolean variables. If there are e possible ?different examples, there are ?2e ? possible different functions, so ?since there are 2d ? possible examples, ?the total number of functions is 22d ?. ?And even for hypothesis spaces that ?are “merely” exponential, the bound ?is still very loose, because the union ?bound is very pessimistic. For example, ?if there are 100 Boolean features ?and the hypothesis space is decision ?trees with up to 10 levels, to guarantee ?δ = ε = 1% in the bound above we need ?half a million examples. But in practice ?a small fraction of this suffices for ?accurate learning. ? Further, we have to be careful ?about what a bound like this means. ?For instance, it does not say that, if ?your learner returned a hypothesis ?consistent with a particular training ?set, then this hypothesis probably ?generalizes well. What it says is that, ?given a large enough training set, with ?high probability your learner will either ?return a hypothesis that generalizes ?well or be unable to find a consistent ?hypothesis. The bound also says ?nothing about how to select a good ?hypothesis space. It only tells us that, ?if the hypothesis space contains the ?true classifier, then the probability ?that the learner outputs a bad classifier ?decreases with training set size.If we shrink the hypothesis space, the ?bound improves, but the chances that ?it contains the true classifier shrink ?also. (There are bounds for the case ?where the true classifier is not in the ?hypothesis space, but similar considerations ?apply to them.) Another common type of theoretical ?guarantee is asymptotic: given infinite ?data, the learner is guaranteed ?to output the correct classifier. This ?is reassuring, but it would be rash to ?choose one learner over another because ?of its asymptotic guarantees. In ?practice, we are seldom in the asymptotic ?regime (also known as “asymptopia”). ?And, because of the bias-variance ?trade-off I discussed earlier, if ?learner A is better than learner B given ?infinite data, B is often better than A ?given finite data. ? The main role of theoretical guarantees ?in machine learning is not as ?a criterion for practical decisions, ?but as a source of understanding and ?driving force for algorithm design. In ?this capacity, they are quite useful; indeed, ?the close interplay of theory and ?practice is one of the main reasons ?machine learning has made so much ?progress over the years. But caveat ?emptor: learning is a complex phenomenon, ?and just because a learner ?has a theoretical justification and ?works in practice does not mean the ?former is the reason for the latter. | 不幸的是,這種類型的保證必須與大顆粒的鹽一起使用。這是因為以這種方式獲得的邊界通常非常松散。上面綁定的一個奇妙功能是,所需的示例數僅與| H |成對數增長。和1 /δ。不幸的是,最有趣的假設空間在特征d的數量上是雙倍的,這仍然使我們在d中需要大量的指數實例。例如,考慮d布爾變量的布爾函數的空間。如果有可能的不同示例,則可能有2e種不同的功能,因此,因為有2d種可能的示例,所以功能總數為22d。即使對于“僅”指數空間的假設空間,界限仍然非常寬松,因為聯合界限非常悲觀。例如,如果有100個布爾特征,并且假設空間是具有最多10個級別的決策樹,為保證δ=ε= 1%在上面的范圍內,我們需要半百萬個示例。但是實際上,一小部分就足以進行準確的學習。 此外,我們必須注意這種限制的含義。例如,它并不表示,如果您的學習者返回了與特定訓練集一致的假設,那么該假設可能會很好地概括。它的意思是,給定足夠大的訓練集,您的學習者很有可能會返回一個可以很好地推廣的假設,或者無法找到一致的假設。邊界也沒有說明如何選擇一個好的假設空間。它只告訴我們,如果假設空間包含真實分類器,則學習者輸出不良分類器的概率會隨著訓練集大小的減小而減少。如果我們縮小假設空間,邊界會有所改善,但是包含真實分類器的機會會有所增加。分類器收縮也。 (對于真正的分類器不在假設空間中,但適用于它們的類似考慮因素有一定的局限性。) 理論保證的另一種常見類型是漸進的:給定無限數據,可以保證學習者輸出正確的分類器。這令人放心,但是由于其漸近保證,選擇一個學習者而不是另一個學習者會很輕率。在實踐中,我們很少采用漸近體制(也稱為“漸近”)。并且,由于我之前討論過偏差偏差的折衷,如果給定無限數據,學習者A優于學習者B,那么在有限數據下,學習者B通常優于學習者B. 理論保證在機器學習中的主要作用不是作為實際決策的標準,而是作為算法設計的理解和推動力的來源。以這種身份,它們非常有用;確實,理論和實踐之間的緊密相互作用是機器學習多年來取得如此巨大進步的主要原因之一。但是需要警告的是:學習者是一個復雜的現象,僅因為學習者具有理論上的依據并且在實踐中起作用并不意味著前者是后者的原因。 |
Feature Engineering Is The Key ?特征工程是關鍵
| A dumb algorithm ?with lots and lots ?of data beats ?a clever one ?with modest ?amounts of it. | 具有大量數據的愚蠢算法擊敗了數量適中的聰明算法。 |
| At the end of the day, some machine ?learning projects succeed and some ?fail. What makes the difference? Easily ?the most important factor is the ?features used. Learning is easy if you ?have many independent features that ?each correlate well with the class. On ?the other hand, if the class is a very ?complex function of the features, you ?may not be able to learn it. Often, the ?raw data is not in a form that is amenable ?to learning, but you can construct ?features from it that are. This ?is typically where most of the effort in ?a machine learning project goes. It is ?often also one of the most interesting ?parts, where intuition, creativity and ?“black art” are as important as the ?technical stuff. ? First-timers are often surprised by ?how little time in a machine learning ?project is spent actually doing machine learning. But it makes sense if ?you consider how time-consuming it ?is to gather data, integrate it, clean it ?and preprocess it, and how much trial ?and error can go into feature design. ?Also, machine learning is not a oneshot ?process of building a dataset and ?running a learner, but rather an iterative ?process of running the learner, ?analyzing the results, modifying the ?data and/or the learner, and repeating. ?Learning is often the quickest ?part of this, but that is because we ?have already mastered it pretty well! ?Feature engineering is more difficult ?because it is domain-specific, ?while learners can be largely general ?purpose. However, there is no sharp ?frontier between the two, and this is ?another reason the most useful learners ?are those that facilitate incorporating ?knowledge. ? Of course, one of the holy grails ?of machine learning is to automate ?more and more of the feature engineering ?process. One way this is often ?done today is by automatically generating ?large numbers of candidate features ?and selecting the best by (say) ?their information gain with respect ?to the class. But bear in mind that ?features that look irrelevant in isolation ?may be relevant in combination. ?For example, if the class is an XOR of ?k input features, each of them by itself ?carries no information about the ?class. (If you want to annoy machine ?learners, bring up XOR.) On the other ?hand, running a learner with a very ?large number of features to find out ?which ones are useful in combination ?may be too time-consuming, or cause ?overfitting. So there is ultimately no ?replacement for the smarts you put ?into feature engineering. | 最終,一些機器學習項目成功了而有些失敗了。有什么區別?最重要的因素很容易就是所使用的功能。如果您具有許多與班級緊密相關的獨立功能,則學習將很容易。另一方面,如果該類是功能的非常復雜的功能,則您可能無法學習它。通常,原始數據的形式不適合學習,但您可以從中構造特征。這通常是機器學習項目中大部分工作的去向。它通常也是最有趣的部分之一,直覺,創造力和“妖術”與技術同樣重要。初學者通常會對機器學習項目中實際用于機器學習的時間很少感到驚訝。但是,如果您考慮收集數據,集成,清理和預處理數據要花多長時間,以及可以在功能設計中進行多少試驗和錯誤,這是有道理的。此外,機器學習不是構建數據集和運行學習者的一站式過程,而是運行學習者,分析結果,修改數據和/或學習者并重復的迭代過程。學習通常是其中最快的部分,但這是因為我們已經很好地掌握了它!特征工程更加困難,因為它是特定于領域的,而學習者在很大程度上可能是通用的。但是,兩者之間沒有敏銳的疆界,這是最有用的學習者是那些有助于整合知識的學習者的另一個原因。當然,機器學習的圣地之一是使越來越多的特征工程過程自動化。今天通常這樣做的一種方式是通過自動生成大量候選特征并通過(比如說)它們相對于類的信息增益來選擇最佳特征。但是請記住,孤立地看起來無關緊要的功能可能會組合在一起使用。例如,如果類別是k個輸入要素的XOR,則每個類別本身都不攜帶有關類別的信息。 (如果要惹惱機器學習者,請調出XOR。)另一方面,運行具有大量功能的學習器以找出哪些功能組合在一起可能會非常耗時,或導致過度擬合。因此,您投入功能工程的智能最終無法替代。 |
More Data Beats ?a Cleverer Algorithm ?智慧算法帶來更多數據優勢
| Suppose you have constructed the ?best set of features you can, but the ?classifiers you receive are still not accurate ?enough. What can you do now? ?There are two main choices: design a ?better learning algorithm, or gather ?more data (more examples, and possibly ?more raw features, subject to ?the curse of dimensionality). Machine ?learning researchers are mainly concerned ?with the former, but pragmatically ?the quickest path to success is often to just get more data. As a rule ?of thumb, a dumb algorithm with lots ?and lots of data beats a clever one with ?modest amounts of it. (After all, machine ?learning is all about letting data ?do the heavy lifting.) ?This does bring up another problem, ?however: scalability. In most of ?computer science, the two main limited ?resources are time and memory. ?In machine learning, there is a third ?one: training data. Which one is the ?bottleneck has changed from decade ?to decade. In the 1980s it tended to ?be data. Today it is often time. Enormous ?mountains of data are available, ?but there is not enough time ?to process it, so it goes unused. This ?leads to a paradox: even though in ?principle more data means that more ?complex classifiers can be learned, in ?practice simpler classifiers wind up ?being used, because complex ones ?take too long to learn. Part of the answer ?is to come up with fast ways to ?learn complex classifiers, and indeed ?there has been remarkable progress ?in this direction (for example, Hulten ?and Domingos11). ? Part of the reason using cleverer ?algorithms has a smaller payoff than ?you might expect is that, to a first approximation, ?they all do the same. ?This is surprising when you consider ?representations as different as, say, ?sets of rules and neural networks. But ?in fact propositional rules are readily ?encoded as neural networks, and similar ?relationships hold between other ?representations. All learners essentially ?work by grouping nearby examples ?into the same class; the key difference ?is in the meaning of “nearby.” ?With nonuniformly distributed data, ?learners can produce widely different ?frontiers while still making the same ?predictions in the regions that matter ?(those with a substantial number of ?training examples, and therefore also ?where most test examples are likely to ?appear). This also helps explain why ?powerful learners can be unstable but ?still accurate. Figure 3 illustrates this ?in 2D; the effect is much stronger in ?high dimensions. ? | 假設您已經構建了最好的功能集,但是收到的分類器仍然不夠準確。你現在可以做什么?有兩個主要選擇:設計更好的學習算法,或收集更多數據(更多示例,可能還有更多原始特征,這取決于維度的詛咒)。機器學習研究人員主要與前者有關,但務實地,成功的最快途徑通常是獲取更多數據。根據經驗,具有大量數據的愚蠢算法要擊敗數量適中的聰明算法。 (畢竟,機器學習只不過是讓數據繁重而已。)但這確實帶來了另一個問題:可伸縮性。在大多數計算機科學中,兩個主要的有限資源是時間和內存。在機器學習中,有三分之一是訓練數據。哪個瓶頸已經從十年改變到了十年。在1980年代,它傾向于成為數據。今天是時候了。有大量的數據可用,但是沒有足夠的時間來處理它,因此它沒有被使用。這導致了一個悖論:盡管在原則上更多的數據意味著可以學習更多的復雜分類器,但在實踐中卻使用了更簡單的分類器,因為復雜的分類器學習時間太長。答案的一部分是想出一種快速的方法來學習復雜的分類器,實際上在這個方向上已經取得了顯著的進展(例如Hulten和Domingos11)。 |
| As a rule, it pays to try the simplest ?learners first (for example, na?ve Bayes ?before logistic regression, k-nearest ?neighbor before support vector machines). ?More sophisticated learn ers are seductive, but they are usually ?harder to use, because they have more ?knobs you need to turn to get good results, ?and because their internals are ?more opaque. ? Learners can be divided into two ?major types: those whose representation ?has a fixed size, like linear classifiers, ?and those whose representation ?can grow with the data, like decision ?trees. (The latter are sometimes called ?nonparametric learners, but this is ?somewhat unfortunate, since they ?usually wind up learning many more ?parameters than parametric ones.) ?Fixed-size learners can only take advantage ?of so much data. (Notice how ?the accuracy of naive Bayes asymptotes ?at around 70% in Figure 2.) Variablesize ?learners can in principle learn any ?function given sufficient data, but in ?practice they may not, because of limitations ?of the algorithm (for example, ?greedy search falls into local optima) ?or computational cost. Also, because ?of the curse of dimensionality, no existing ?amount of data may be enough. ?For these reasons, clever algorithms— ?those that make the most of the data ?and computing resources available— ?often pay off in the end, provided you ?are willing to put in the effort. There ?is no sharp frontier between designing ?learners and learning classifiers; ?rather, any given piece of knowledge ?could be encoded in the learner or ?learned from data. So machine learning ?projects often wind up having a ?significant component of learner design, ?and practitioners need to have ?some expertise in it.12 In the end, the biggest bottleneck is not data or CPU cycles, but human cycles. In research papers, learners ?are typically compared on measures ?of accuracy and computational cost. ?But human effort saved and insight ?gained, although harder to measure, ?are often more important. This favors ?learners that produce human-understandable ?output (for example, rule ?sets). And the organizations that make ?the most of machine learning are ?those that have in place an infrastructure ?that makes experimenting with ?many different learners, data sources, ?and learning problems easy and efficient, ?and where there is a close collaboration ?between machine learning ?experts and application domain ones. | 通常,首先嘗試最簡單的學習者是值得的(例如,邏輯回歸之前的樸素貝葉斯,支持向量機之前的k近鄰)。經驗豐富的學習者很誘人,但它們通常更難使用,因為它們具有更多的旋鈕,您需要轉向以獲得良好的效果,并且它們的內部更加不透明。 學習者可以分為兩種主要類型:那些具有固定大小的表示形式(如線性分類器)和那些隨著數據增長的表示形式(如決策樹)。 (后者有時被稱為非參數學習者,但這有點不幸,因為他們通常要比參數學習更多的參數。)固定大小的學習者只能利用這么多數據。 (請注意,圖2中樸素的貝葉斯漸近線的準確度約為70%)。可變大小的學習者原則上可以在給定足夠數據的情況下學習任何函數,但由于算法的限制,實際上它們可能無法學習任何函數(例如,貪婪搜索下降轉化為局部最優值)或計算成本。同樣,由于維數的詛咒,現有的數據量可能不足。由于這些原因,只要您愿意付出努力,聰明的算法-那些可以充分利用可用數據和計算資源的算法通常會最終獲得回報。設計學習者和學習分類器之間沒有前沿的界限;相反,任何給定的知識都可以在學習者中進行編碼或從數據中學習。因此,機器學習項目通常會包含學習者設計的重要組成部分,并且從業者需要在其中擁有一些專業知識.12 最后,最大的瓶頸不是數據或CPU周期,而是人員周期。在研究論文中,通常會比較學習者的準確性和計算成本。但是,盡管難以衡量,但節省了人力并獲得見識通常更重要。這有利于產生人類可理解的輸出(例如規則集)的學習者。充分利用機器學習的組織是那些擁有適當基礎設施的組織,這些基礎設施使對許多不同的學習者,數據源和學習問題的實驗變得容易而高效,并且機器學習專家和應用程序領域之間存在密切的協作那些。 |
Learn Many Models, Not Just One ?學習多種模型,而不僅僅是一種
| In the early days of machine learning, ?everyone had a favorite learner, ?together with some a priori reasons ?to believe in its superiority. Most effort ?went into trying many variations ?of it and selecting the best one. Then ?systematic empirical comparisons ?showed that the best learner varies ?from application to application, and ?systems containing many different ?learners started to appear. Effort now ?went into trying many variations of ?many learners, and still selecting just ?the best one. But then researchers ?noticed that, if instead of selecting ?the best variation found, we combine ?many variations, the results are better—often ?much better—and at little ?extra effort for the user. ? Creating such model ensembles is ?now standard.1 ? In the simplest technique, ?called bagging, we simply generate ?random variations of the training ?set by resampling, learn a classifier ?on each, and combine the results by ?voting. This works because it greatly ?reduces variance while only slightly ?increasing bias. In boosting, training ?examples have weights, and these are ?varied so that each new classifier focuses ?on the examples the previous ?ones tended to get wrong. In stacking, ?the outputs of individual classifiers ?become the inputs of a “higher-level” ?learner that figures out how best to ?combine them.? | 在機器學習的早期,每個人都有一個喜歡的學習者,加上一些先驗的理由相信它的優越性。最努力的嘗試是嘗試它的多種變體并選擇最佳的一種。然后系統的經驗比較表明,最佳學習者因應用程序而異,并且包含許多不同學習者的系統開始出現。現在,我們努力嘗試許多學習者的許多變體,但仍然只選擇最好的一個。但是隨后研究人員注意到,如果我們不選擇發現的最佳變體,而是結合許多變體,則結果會更好-通常更好得多-并且對用戶來說幾乎沒有額外的精力。 創建這樣的模型集成現在是標準的.1。在最簡單的技術(稱為裝袋)中,我們只需通過重新采樣就可以生成訓練集的隨機變化,在每個學習一個分類器,然后通過投票合并結果。之所以行之有效,是因為它大大減少了方差,而偏差卻稍有增加。在增強方面,訓練示例具有權重,并且這些權重是可變的,因此每個新分類器都將重點放在示例上,而先前的那些往往會出錯。在堆疊中,單個分類器的輸出成為一個``高級''學習器的輸入,該學習器找出了如何最好地組合它們。 |
| Many other techniques exist, and ?the trend is toward larger and larger ?ensembles. In the Netflix prize, teams ?from all over the world competed to ?build the best video recommender system (http://netflixprize.com). As ?the competition progressed, teams ?found they obtained the best results ?by combining their learners with other ?teams’, and merged into larger and ?larger teams. The winner and runnerup ?were both stacked ensembles of ?over 100 learners, and combining the ?two ensembles further improved the ?results. Doubtless we will see even ?larger ones in the future. ? Model ensembles should not be ?confused with Bayesian model averaging ?(BMA)—the theoretically ?optimal approach to learning.4 ? In ?BMA, predictions on new examples ?are made by averaging the individual ?predictions of all classifiers in the ?hypothesis space, weighted by how ?well the classifiers explain the training ?data and how much we believe ?in them a priori. Despite their superficial ?similarities, ensembles and ?BMA are very different. Ensembles ?change the hypothesis space (for example, ?from single decision trees to ?linear combinations of them), and ?can take a wide variety of forms. BMA ?assigns weights to the hypotheses in ?the original space according to a fixed ?formula. BMA weights are extremely ?different from those produced by ?(say) bagging or boosting: the latter ?are fairly even, while the former are ?extremely skewed, to the point where ?the single highest-weight classifier ?usually dominates, making BMA effectively ?equivalent to just selecting ?it.8 ? A practical consequence of this is ?that, while model ensembles are a key ?part of the machine learning toolkit, ?BMA is seldom worth the trouble. | 存在許多其他技術,并且趨勢正在越來越大。在Netflix獎項中,來自世界各地的團隊競爭建立了最佳的視頻推薦系統(http://netflixprize.com)。隨著比賽的進行,團隊發現他們通過將學習者與其他``團隊''結合在一起而獲得了最佳成績,并合并為越來越大的團隊。獲勝者和亞軍都是超過100名學習者的堆疊樂團,并且將這兩個樂團結合起來可以進一步改善結果。毫無疑問,我們將來會看到更大的機型。 模型集合不應與貝葉斯模型平均(BMA)混淆-理論上的最佳學習方法.4在BMA中,對新示例的預測是通過對假設空間中所有分類器的各個預測取平均,并通過對分類器加權的方式得出的解釋訓練數據以及我們對它們有先驗的信任度。盡管它們在表面上有相似之處,但合奏和BMA卻有很大不同。集合更改假設空間(例如,從單個決策樹更改為它們的線性組合),并且可以采用多種形式。 BMA根據固定公式將權重分配給原始空間中的假設。 BMA權重與(例如)裝袋或提升產生的權重極為不同:后者相當均勻,而前者則極為偏斜,以至于單個最高權重的分類器通常占主導地位,這使得BMA有效地等同于僅選擇它.8的實際結果是,雖然模型集成是機器學習工具包的關鍵部分,但BMA很少值得為此煩惱。 |
Simplicity Does Not ?Imply Accuracy ?簡易性不準確
| Just because ?a function can ?be represented ?does not mean ?it can be learned. | 僅僅因為可以表示一個函數并不意味著可以學習它。 |
| Occam’s razor famously states that ?entities should not be multiplied beyond ?necessity. In machine learning, ?this is often taken to mean that, given ?two classifiers with the same training ?error, the simpler of the two will likely ?have the lowest test error. Purported ?proofs of this claim appear regularly ?in the literature, but in fact there are ?many counterexamples to it, and the ?“no free lunch” theorems imply it cannot ?be true. ?We saw one counterexample previously: ?model ensembles. The generalization ?error of a boosted ensemble continues to improve by adding classifiers ?even after the training error has ?reached zero. Another counterexample ?is support vector machines, which ?can effectively have an infinite number ?of parameters without overfitting. ?Conversely, the function sign(sin(ax)) ?can discriminate an arbitrarily large, ?arbitrarily labeled set of points on the ?x axis, even though it has only one parameter.23 ?Thus, contrary to intuition, ?there is no necessary connection between ?the number of parameters of a ?model and its tendency to overfit. ? | 奧卡姆(Occam)的剃刀著名地指出,實體不應超出必需的數量。在機器學習中,這通常是指給定兩個具有相同訓練誤差的分類器,兩個簡單的分類器可能具有最低的測試誤差。據稱該主張的證據在文獻中經常出現,但實際上有很多反例,并且``無免費午餐''定理表明它不成立。之前我們看到了一個反例:模型集成。甚至在訓練誤差達到零之后,通過添加分類器,仍可以提高增強的合奏的泛化誤差。另一個反例是支持向量機,它可以有效地具有無限數量的參數而不會過度擬合。相反,功能符號(sin(ax))可以區分x軸上任意大的,帶有標簽的點集,盡管它只有一個參數.23因此,與直覺相反,數字之間沒有必要的聯系模型的參數及其過度擬合的趨勢。 |
| A more sophisticated view instead ?equates complexity with the size of ?the hypothesis space, on the basis that ?smaller spaces allow hypotheses to be ?represented by shorter codes. Bounds ?like the one in the section on theoretical ?guarantees might then be viewed ?as implying that shorter hypotheses ?generalize better. This can be further ?refined by assigning shorter codes to ?the hypotheses in the space we have ?some a priori preference for. But ?viewing this as “proof” of a trade-off ?between accuracy and simplicity is ?circular reasoning: we made the hypotheses ?we prefer simpler by design, ?and if they are accurate it is because ?our preferences are accurate, not because ?the hypotheses are “simple” in ?the representation we chose. ? A further complication arises from ?the fact that few learners search their ?hypothesis space exhaustively. A ?learner with a larger hypothesis space ?that tries fewer hypotheses from it ?is less likely to overfit than one that ?tries more hypotheses from a smaller ?space. As Pearl18 points out, the size of ?the hypothesis space is only a rough ?guide to what really matters for relating ?training and test error: the procedure ?by which a hypothesis is chosen. ? Domingos7 ? surveys the main arguments ?and evidence on the issue of ?Occam’s razor in machine learning. ?The conclusion is that simpler hypotheses ?should be preferred because ?simplicity is a virtue in its own right, ?not because of a hypothetical connection ?with accuracy. This is probably ?what Occam meant in the first place. | 相反,一個更復雜的視圖將復雜度與假設空間的大小等同起來,其依據是較小的空間允許用較短的代碼表示假設。像理論保證一節中所述的界限可能會被認為暗示著較短的假設通常會更好。可以通過為我們具有某些先驗偏好的空間中的假設分配較短的代碼來進一步完善。但是,將其視為在準確性和簡單性之間進行權衡的``證明''是循環推理:我們通過設計使假設變得更簡單,如果假設是準確的,那是因為我們的偏好是準確的,而不是因為假設是``簡單的''在我們選擇的表示形式中。 更為復雜的是由于幾乎沒有學習者詳盡地搜索其假設空間這一事實。擁有較大假設空間的學習者從中嘗試較少的假設的可能性比從較小空間嘗試更多假設的學習者的過擬合可能性小。正如Pearl18所指出的那樣,假設空間的大小僅是對與訓練和測試誤差有關的真正重要性的粗略指導:選擇假設的過程。 Domingos7調查了有關機器學習中Occam剃刀問題的主要論點和證據。結論是,應采用更簡單的假設,因為簡單性本身就是一種美德,而不是因為假設與準確性之間的聯系。這可能是Occam首先的意思。 |
Representable Does Not ?Imply Learnable ?有代表性的不容易學習
| Essentially all representations used in ?variable-size learners have associated theorems of the form “Every function ?can be represented, or approximated ?arbitrarily closely, using this representation.” ?Reassured by this, fans of ?the representation often proceed to ?ignore all others. However, just because ?a function can be represented ?does not mean it can be learned. For ?example, standard decision tree learners ?cannot learn trees with more leaves ?than there are training examples. In ?continuous spaces, representing even ?simple functions using a fixed set of ?primitives often requires an infinite ?number of components. Further, if ?the hypothesis space has many local ?optima of the evaluation function, as ?is often the case, the learner may not ?find the true function even if it is representable. ?Given finite data, time and ?memory, standard learners can learn ?only a tiny subset of all possible functions, ?and these subsets are different ?for learners with different representations. ?Therefore the key question is ?not “Can it be represented?” to which ?the answer is often trivial, but “Can it ?be learned?” And it pays to try different ?learners (and possibly combine them). ? Some representations are exponentially ?more compact than others for ?some functions. As a result, they may ?also require exponentially less data to ?learn those functions. Many learners ?work by forming linear combinations ?of simple basis functions. For example, ?support vector machines form ?combinations of kernels centered at ?some of the training examples (the ?support vectors). Representing parity ?of n bits in this way requires 2n ? basis ?functions. But using a representation ?with more layers (that is, more steps ?between input and output), parity can ?be encoded in a linear-size classifier. ?Finding methods to learn these deeper ?representations is one of the major research ?frontiers in machine learning.2 | 基本上,可變大小學習器中使用的所有表示形式都有相關的定理,形式為``可以使用該表示形式來表示或任意近似地逼近每個函數''。對此感到放心的是,代表制的支持者經常會忽略其他所有人。但是,僅僅因為可以表示一個函數并不意味著可以學習它。例如,標準決策樹學習者無法學習葉子多于訓練實例的樹。在連續空間中,使用一組固定的基元表示甚至簡單的函數通常需要無限數量的組件。此外,如果假設空間具有很多評估函數的局部最優值(通常是這樣),則學習者即使可以表示,也可能找不到真正的函數。給定有限的數據,時間和內存,標準學習者只能學習所有可能功能的一小部分,而對于具有不同表示形式的學習者來說,這些子集是不同的。因此,關鍵問題不是“能否代表?”答案通常是微不足道的,但是“可以學習嗎?”嘗試不同的學習者(并可能將他們合并)是值得的。 對于某些功能,某些表示形式比其他表示形式更緊湊。結果,他們可能還需要以指數形式減少的數據來學習這些功能。許多學習者通過形成簡單基函數的線性組合來工作。例如,支持向量機形成了以一些訓練示例(支持向量)為中心的內核組合。以這種方式表示n位的奇偶校驗需要2n個基函數。但是使用具有更多層的表示(即輸入和輸出之間的更多步驟),可以將奇偶校驗編碼為線性大小的分類器。尋找學習這些更深層表示的方法是機器學習的主要研究領域之一.2 |
Correlation Does Not ?Imply Causation ?關聯不表示因果關系
| The point that correlation does not ?imply causation is made so often that ?it is perhaps not worth belaboring. ?But, even though learners of the kind ?we have been discussing can only ?learn correlations, their results are ?often treated as representing causal ?relations. Isn’t this wrong? If so, then ?why do people do it? More often than not, the goal ?of learning predictive models is to ?use them as guides to action. If we ?find that beer and diapers are often ?bought together at the supermarket, ?then perhaps putting beer next ?to the diaper section will increase ?sales. (This is a famous example in ?the world of data mining.) But short ?of actually doing the experiment it is ?difficult to tell. Machine learning is ?usually applied to observational data, ?where the predictive variables are not ?under the control of the learner, as ?opposed to experimental data, where ?they are. Some learning algorithms ?can potentially extract causal information ?from observational data, but ?their applicability is rather restricted.19 ?On the other hand, correlation ?is a sign of a potential causal connection, ?and we can use it as a guide to ?further investigation (for example, ?trying to understand what the causal ?chain might be). ? | 關聯并不意味著因果關系如此頻繁,這一點也許值得我們去研究。但是,即使我們一直在討論的那種學習者只能學習相關性,他們的結果也經常被視為代表因果關系。這不是錯嗎?如果是這樣,那么人們為什么這樣做? 通常,學習預測模型的目的是將其用作行動指南。如果我們發現啤酒和尿布經常是在超市一起買的,那么也許把啤酒放在尿布下面會增加銷量。 (這是數據挖掘世界中的一個著名示例。)但是實際上很難進行實驗,很難說清楚。機器學習通常應用于觀測數據,而預測變量不在學習者的控制之下,而與實驗數據相反。一些學習算法可能會從觀測數據中提取因果信息,但其適用性受到限制.19另一方面,相關性是潛在因果關系的標志,我們可以將其用作進一步調查的指南(例如,試圖了解原因鏈可能是什么)。 |
| Many researchers believe that causality ?is only a convenient fiction. For ?example, there is no notion of causality ?in physical laws. Whether or not ?causality really exists is a deep philosophical ?question with no definitive ?answer in sight, but there are two ?practical points for machine learners. ?First, whether or not we call them ?“causal,” we would like to predict the ?effects of our actions, not just correlations ?between observable variables. ?Second, if you can obtain experimental ?data (for example by randomly assigning ?visitors to different versions of ?a Web site), then by all means do so.14 ? | 許多研究人員認為因果關系只是一種方便的小說。 例如,物理定律中沒有因果關系的概念。 因果關系是否真的存在是一個深刻的哲學問題,沒有明確的答案,但是對于機器學習者來說有兩個實踐點。 首先,無論我們是否稱其為``因果關系'',我們都希望預測行為的影響,而不僅僅是可觀察變量之間的相關性。 其次,如果您可以獲得實驗數據(例如通過將訪問者隨機分配給網站的不同版本),則一定要這樣做.14 |
Conclusion ?結論
| Like any discipline, machine learning ?has a lot of “folk wisdom” that can ?be difficult to come by, but is crucial ?for success. This article summarized ?some of the most salient items. Of ?course, it is only a complement to the ?more conventional study of machine ?learning. Check out http://www. ?cs.washington.edu/homes/pedrod/ ?class for a complete online machine ?learning course that combines formal ?and informal aspects. There is also a ?treasure trove of machine learning ?lectures at http://www.videolectures. ?net. A good open source machine ?learning toolkit is Weka.24 ? Happy learning! | 像任何學科一樣,機器學習具有許多難以獲得的``民間智慧'',但對于成功至關重要。 本文總結了一些最突出的項目。 當然,它只是對機器學習的更常規研究的補充。 查看http:// www。 cs.washington.edu/homes/pedrod/上一堂完整的在線機器學習課程,該課程結合了正式和非正式方面。 在http://www.videolectures上還有一個機器學習寶庫。 凈。 一個很好的開源機器學習工具包是Weka.24祝您學習愉快! |
總結
以上是生活随笔為你收集整理的Paper:《A Few Useful Things to Know About Machine Learning—关于机器学习的一些有用的知识》翻译与解读的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 成功解决ValueError: (‘Un
- 下一篇: ML之FE:基于BigMartSales