成像数据更好的展示_为什么更多的数据并不总是更好
成像數(shù)據(jù)更好的展示
Over the past few years, there has been a growing consensus that the more data one has, the better the eventual analysis will be.
在過(guò)去的幾年中,越來(lái)越多的共識(shí)是,數(shù)據(jù)越多,最終的分析就越好。
However, just as humans can become overwhelmed by too much information — so can machine learning models.
但是,就像人類會(huì)被太多的信息所淹沒(méi)一樣,機(jī)器學(xué)習(xí)模型也是如此。
以酒店取消為例 (Hotel Cancellations as an example)
I was thinking about this issue recently when reflecting on a side project I have been working on for the past year — Predicting Hotel Cancellations with Machine Learning.
最近,當(dāng)我反思過(guò)去一年來(lái)一直在做的一個(gè)輔助項(xiàng)目時(shí),我正在考慮這個(gè)問(wèn)題- 通過(guò)機(jī)器學(xué)習(xí)預(yù)測(cè)酒店取消 。
Having written numerous articles about the topic on Medium — it is clear that the landscape for the hospitality industry has changed fundamentally in the past year.
在“媒介”這個(gè)主題上寫了許多文章之后,很顯然,在過(guò)去一年中,酒店業(yè)的格局發(fā)生了根本變化。
With a growing emphasis on “staycations”, or local holidays — this fundamentally changes the assumptions that any machine learning model should make when predicting hotel cancellations.
隨著人們?cè)絹?lái)越重視“住宿”或當(dāng)?shù)丶倨?#xff0c;這從根本上改變了任何機(jī)器學(xué)習(xí)模型在預(yù)測(cè)酒店取消時(shí)都應(yīng)做出的假設(shè)。
The original data from Antonio, Almeida and Nunes (2016) used datasets from Portuguese hotels with a response variable indicating whether the customer had cancelled their booking or not, along with other information on that customer such as country of origin, market segment, etc.
Antonio,Almeida和Nunes(2016)的原始數(shù)據(jù)使用了來(lái)自葡萄牙酒店的數(shù)據(jù)集,其響應(yīng)變量指示客戶是否取消了預(yù)訂,以及該客戶的其他信息,例如原籍國(guó),細(xì)分市場(chǎng)等。
In the two datasets in question, approximately 55-60% of all customers were international customers.
在上述兩個(gè)數(shù)據(jù)集中,大約55-60%的客戶是國(guó)際客戶。
However, let’s assume this scenario for a moment. This time next year — hotel occupancy is back to normal levels — but the vast majority of customers are domestic, in this case from Portugal. For the purposes of this example, let’s assume the extreme scenario that 100% of customers are domestic.
但是,讓我們暫時(shí)假設(shè)這種情況。 明年的這個(gè)時(shí)候-酒店入住率恢復(fù)到正常水平-但絕大多數(shù)客戶來(lái)自國(guó)內(nèi),在這種情況下來(lái)自葡萄牙。 出于本示例的目的,我們假設(shè)100%的國(guó)內(nèi)客戶是極端情況。
Such an assumption will radically affect the ability of any previously trained model to accurately forecast cancellations. Let’s take an example.
這樣的假設(shè)將從根本上影響任何先前訓(xùn)練的模型準(zhǔn)確預(yù)測(cè)取消的能力。 讓我們舉個(gè)例子。
使用SVM模型進(jìn)行分類 (Classification using SVM Model)
An SVM model was originally used to predict hotel cancellations — with the model being trained on one dataset (H1) and the predictions then compared to a test set (H2) using the feature data from that test set. The response variable is categorical (1 = booking cancelled by customer, 0 = booking not cancelled by customer).
SVM模型最初用于預(yù)測(cè)酒店的取消情況-在一個(gè)數(shù)據(jù)集(H1)上對(duì)該模型進(jìn)行訓(xùn)練,然后使用來(lái)自該測(cè)試集的特征數(shù)據(jù)將該預(yù)測(cè)與測(cè)試集(H2)進(jìn)行比較。 響應(yīng)變量是分類變量(1 =客戶取消預(yù)訂,0 =客戶未取消預(yù)訂)。
Here are the results as displayed by a confusion matrix across three different scenarios.
這是在三種不同情況下的混淆矩陣顯示的結(jié)果。
方案1:在H1(完整數(shù)據(jù)集)上訓(xùn)練,在H2(完整數(shù)據(jù)集)上測(cè)試 (Scenario 1: Trained on H1 (full dataset), tested on H2 (full dataset))
[[25217 21011][ 8436 24666]]
precision recall f1-score support
0 0.75 0.55 0.63 46228
1 0.54 0.75 0.63 33102
accuracy 0.63 79330
macro avg 0.64 0.65 0.63 79330
weighted avg 0.66 0.63 0.63 79330
Overall accuracy comes in at 63%, while recall for the positive class (cancellations) came in at 75%. To clarify, recall in this instance means that of all the cancellation incidences — the model correctly identifies 75% of them.
總體準(zhǔn)確度為63%,而正面評(píng)價(jià)(取消)的查全率為75%。 為了明確起見(jiàn),在這種情況下,召回意味著所有取消事件-該模型正確地識(shí)別了其中的75%。
Now let’s see what happens when we train the SVM model on the full training set, but only include domestic customers from Portugal in our test set.
現(xiàn)在,讓我們看看在完整的訓(xùn)練集上訓(xùn)練SVM模型但僅在測(cè)試集中包括葡萄牙的國(guó)內(nèi)客戶時(shí)會(huì)發(fā)生什么。
方案2:在H1(完整數(shù)據(jù)集)上進(jìn)行培訓(xùn),在H2(僅適用于本地)上進(jìn)行了測(cè)試 (Scenario 2: Trained on H1 (full dataset), tested on H2 (domestic only))
[[10879 0][20081 0]]
precision recall f1-score support
0 0.35 1.00 0.52 10879
1 0.00 0.00 0.00 20081
accuracy 0.35 30960
macro avg 0.18 0.50 0.26 30960
weighted avg 0.12 0.35 0.18 30960
Accuracy has dropped dramatically to 35%, while recall for the cancellation class has dropped to 0% (meaning the model has not predicted any of the cancellation incidences in the test set). The performance in this instance is clearly very poor.
準(zhǔn)確性急劇下降到35%,而取消類的召回率下降到0%(這意味著模型尚未預(yù)測(cè)測(cè)試集中的任何取消發(fā)生率)。 在這種情況下,性能顯然很差。
方案3:在H1(僅限于國(guó)內(nèi))上受過(guò)培訓(xùn),并在H2(僅限于國(guó)內(nèi))上進(jìn)行了測(cè)試 (Scenario 3: Trained on H1 (domestic only), tested on H2 (domestic only))
However, what if the training set was modified to only include customers from Portugal and the model trained once again?
但是,如果將培訓(xùn)集修改為僅包括來(lái)自葡萄牙的客戶,并且再次對(duì)模型進(jìn)行了培訓(xùn),該怎么辦?
[[ 8274 2605][ 6240 13841]]
precision recall f1-score support
0 0.57 0.76 0.65 10879
1 0.84 0.69 0.76 20081
accuracy 0.71 30960
macro avg 0.71 0.72 0.70 30960
weighted avg 0.75 0.71 0.72 30960
Accuracy is back up to 71%, while recall is at 69%. Using less, but more relevant data in the training set has allowed for the SVM model to predict cancellations across the test set much more accurately.
準(zhǔn)確率回升到71%,而召回率則達(dá)到69%。 在訓(xùn)練集中使用更少但更相關(guān)的數(shù)據(jù)可以使SVM模型更加準(zhǔn)確地預(yù)測(cè)整個(gè)測(cè)試集中的取消情況。
如果數(shù)據(jù)錯(cuò)誤,則模型結(jié)果也將錯(cuò)誤 (If The Data Is Wrong, Model Results Will Also Be Wrong)
More data is not better if much of that data is irrelevant to what you are trying to predict. Even machine learning models can be misled if the training set is not representative of reality.
如果很多數(shù)據(jù)與您要預(yù)測(cè)的內(nèi)容無(wú)關(guān),則更多的數(shù)據(jù)并不會(huì)更好。 如果訓(xùn)練集不能代表現(xiàn)實(shí),那么甚至?xí)`導(dǎo)機(jī)器學(xué)習(xí)模型。
This was cited by a Columbia Business School study as an issue in the 2016 U.S. Presidential Elections, where the polls had put Clinton on a firm lead against Trump. However, it turned out that there were many “secret Trump voters” who had not been accounted for in the polls — and this had skewed the results towards a predicted Clinton win.
哥倫比亞大學(xué)商學(xué)院的一項(xiàng)研究將其引用為2016年美國(guó)總統(tǒng)大選的一個(gè)問(wèn)題,民意測(cè)驗(yàn)使克林頓在對(duì)抗特朗普方面處于堅(jiān)決領(lǐng)先地位。 然而,事實(shí)證明,有許多“秘密特朗普選民”并未在民意調(diào)查中得到解釋,這使結(jié)果偏向了預(yù)期的克林頓勝利。
I’m non-U.S. and neutral on the subject by the way — I simply use this as an example to illustrate that even data we often think of as “big” can still contain inherent biases and may not be representative of what is actually going on.
順便說(shuō)一下,我不是美國(guó)人,對(duì)這個(gè)問(wèn)題持中立態(tài)度-我僅以此為例來(lái)說(shuō)明,即使我們經(jīng)常認(rèn)為“大”的數(shù)據(jù)也可能包含固有偏差,并且可能無(wú)法代表實(shí)際情況上。
Instead, the choice of data needs to be scrutinised as much as model selection, if not more so. Is inclusion of certain data relevant to the problem that we are trying to solve?
取而代之的是,數(shù)據(jù)選擇需要與模型選擇一樣仔細(xì)檢查,如果不是更多的話。 是否包含與我們要解決的問(wèn)題相關(guān)的某些數(shù)據(jù)?
Going back to the hotel example, inclusion of international customer data in the training set did not enhance our model when our goal is to predict cancellations across the domestic customer base.
回到酒店的例子,當(dāng)我們的目標(biāo)是預(yù)測(cè)整個(gè)國(guó)內(nèi)客戶群的取消時(shí),將國(guó)際客戶數(shù)據(jù)包含在培訓(xùn)集中并不能改善我們的模型。
結(jié)論 (Conclusion)
There is increasingly a push to gather more data across all domains. While more data in and of itself is not a bad thing — it should not be assumed that blindly introducing more data into a model will improve its accuracy.
越來(lái)越多的人要求跨所有域收集更多數(shù)據(jù)。 盡管更多的數(shù)據(jù)本身并不是一件壞事,但不應(yīng)認(rèn)為盲目地將更多數(shù)據(jù)引入模型可以提高其準(zhǔn)確性。
Rather, data scientists still need the ability to determine the relevance of such data to the problem at hand. From this point of view, model selection becomes somewhat of an afterthought. If the data is representative of the problem that you are trying to solve in the first instance, then even the more simple machine learning models will generate strong predictive results.
而是,數(shù)據(jù)科學(xué)家仍然需要能夠確定此類數(shù)據(jù)與當(dāng)前問(wèn)題的相關(guān)性。 從這個(gè)角度來(lái)看,模型選擇變得有些事后思考。 如果數(shù)據(jù)代表您首先要解決的問(wèn)題,那么即使是更簡(jiǎn)單的機(jī)器學(xué)習(xí)模型也將產(chǎn)生強(qiáng)大的預(yù)測(cè)結(jié)果。
Many thanks for reading, and feel free to leave any questions or feedback in the comments below.
非常感謝您的閱讀,并隨時(shí)在下面的評(píng)論中留下任何問(wèn)題或反饋。
If you are interested in taking a deeper look at the hotel cancellation example, you can find my GitHub repository here.
如果您想深入了解酒店取消示例,則可以在此處找到我的GitHub存儲(chǔ)庫(kù) 。
Disclaimer: This article is written on an “as is” basis and without warranty. It was written with the intention of providing an overview of data science concepts, and should not be interpreted as professional advice in any way.
免責(zé)聲明:本文按“原樣”撰寫,不作任何擔(dān)保。 它旨在提供數(shù)據(jù)科學(xué)概念的概述,并且不應(yīng)以任何方式解釋為專業(yè)建議。
翻譯自: https://towardsdatascience.com/why-more-data-is-not-always-better-de96723d1499
成像數(shù)據(jù)更好的展示
總結(jié)
以上是生活随笔為你收集整理的成像数据更好的展示_为什么更多的数据并不总是更好的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 梦到偷狗的人怎么办
- 下一篇: 做梦梦到包饺子是什么意思