DengAI —数据预处理
了解機器學習 (Understanding ML)
This article is based on my entry into DengAI competition on the DrivenData platform. I’ve managed to score within 0.2% (14/9069 as on 02 Jun 2020).
本文基于我 對DrivenData平臺上的DengAI競賽的參與 。 我的得分一直在0.2%以內(截至2020年6月2日,得分為14/9069)。
In this article, I assume that you’re already familiar with DengAI — EDA. You don’t have to read it to understand everything here, but it would be a lot easier if you do.
在本文中,我假設您已經熟悉DengAI — EDA 。 您無需閱讀即可理解此處的所有內容,但是如果您這樣做,它會容易得多。
為什么我們必須預處理數據? (Why do we have to preprocess data?)
When designing ML models we have to remember that some of them are based on the gradient method. The problem with the gradient is that it performs better on normalized/scaled data. Let me show an example:
在設計ML模型時,我們必須記住,其中一些是基于漸變方法的。 梯度的問題在于它在歸一化/定標的數據上表現更好。 讓我舉一個例子:
On the left side, we have a dataset that consists of two features and one of them has a larger scale than the other. In both cases, the gradient method works, but it takes a lot fewer steps to reach optimum when features lie on similar scales (right image).
在左側,我們有一個包含兩個特征的數據集,其中一個具有比另一個更大的比例。 在這兩種情況下,梯度方法都可以使用,但是當要素位于相似的比例(正確的圖像)時,只需很少的步驟即可達到最佳效果。
什么是規范化,什么是縮放? (What is a Normalization and what is Scaling?)
正常化 (Normalization)
In the standard sense, normalization refers to the process of adjusting the value distribution range to fit into <-1, 1> (id doesn’t have to be exact -1 to 1 but within the same order of magnitude so the range ). Standard normalization is done by subtracting mean value from each value in the set and dividing the result by the standard deviation.
在標準意義上,歸一化是指調整值分布范圍以適合<-1,1>的過程 (id不必精確地為-1到1,而是在相同的數量級內,因此范圍為)。 通過從集合中的每個值中減去平均值并將結果除以標準偏差來完成標準歸一化。
縮放比例 (Scaling)
You can see it called “min-max normalization” but scaling is another value adjustment to fit in range, but this time range is <0, 1>.
您可以看到它稱為“最小-最大歸一化”,但是縮放是另一個值調整以適合范圍,但是此時間范圍是<0,1> 。
歸一化還是縮放? (Normalization or Scaling?)
There are two types of operations you can perform on the feature. You can either normalize or scale its values. Which one you choose depends on the feature itself. If you consider features that have some positive and negative values and that values are important, you should perform normalization. On the feature where negative values make no sense, you should apply scaling.
您可以對功能執行兩種操作。 您可以規范化或縮放其值。 選擇哪一個取決于功能本身。 如果您考慮具有一些正負值且這些值很重要的要素,則應執行規范化。 在負值沒有意義的功能上,應應用縮放。
It’s not always black and white. Let’s consider a feature like a temperature. Depends on which scale you choose (Kelvin or Celsius/Fahrenheit) there might be different interpretations of what that temperature could be. Kelvin scale is an absolute thermodynamic temperature scale (starts with absolute zero and cannot go below that). On the other hand, we have scales used IRL where negative numbers are meaningful for us. When the temperature drops below 0 Celsius, water freezes. The same goes for the Fahrenheit scale, its 0 degrees describe the freezing point of the brine (concentrated solution of salt in water). The straight forward choice would be to scale Kelvins and normalize Celsius and Fahrenheit. That does not always work. We can show it on DengAI’s dataset:
它并不總是黑白的。 讓我們考慮一個像溫度這樣的特征。 根據您選擇的溫度等級(開爾文或攝氏/華氏溫度),溫度可能會有不同的解釋。 開爾文刻度是絕對的熱力學溫度刻度 (從絕對零開始,并且不能低于該數值)。 另一方面,我們有使用IRL的量表,其中負數對我們有意義。 當溫度降到0攝氏度以下時,水凍結。 華氏刻度也是如此,其0度表示鹽水的凝固點(鹽在水中的濃縮溶液)。 直接的選擇是縮放開爾文(Kelvins)并標準化攝氏(Celsius)和華氏(Fahrenheit)。 這并不總是有效。 我們可以在DengAI的數據集上顯示它:
Some of the temperatures are on the Kelvin scale, and some on the Celsius scale. That’s not what is important here. If you look closely you should be able to group those temperatures by type:
一些溫度在開爾文標度上,而一些溫度在攝氏標度上。 這不是很重要。 如果仔細觀察,您應該能夠按類型對這些溫度進行分組:
- temperature with absolute minimum value 絕對最小值的溫度
- temperature without absolute minimum value (can be negative) 沒有絕對最小值的溫度(可以為負)
An example of the first one is station_diur_temp_rng_c. This is something called Diurnal temperature variation and defines a variation between minimum and maximum temperature withing some period. That value cannot have negative values (because the difference between minimum and maximum cannot be lower than 0). That’s where we should use scaling instead of normalization.
第一個示例是station_diur_temp_rng_c 。 這就是所謂的晝夜溫度變化 ,它定義了一段時間內最低溫度和最高溫度之間的變化。 該值不能為負值(因為最小值和最大值之差不能小于0)。 那是我們應該使用縮放而不是標準化的地方。
Another example is reanalysis_air_temp_k. It is the air temperature and important feature. We cannot define a minimum value that temperature could get. If we really want there is an arbitrary minimum temperature for each city that we should never get below but that’s not what we want to do. Things like the temperature in problems like ours might have another meaning when training models. There could be some positive and negative impacts of the temperature value. In this case, it might be that temperatures below 298K positively affecting a number of cases (fewer mosquitos). That’s why we should use normalization for this one.
另一個示例是reanalysis_air_temp_k 。 它是氣溫和重要特征。 我們無法定義溫度可以達到的最小值。 如果我們真的希望每個城市都有一個最低溫度,那么我們永遠都不能低于這個最低溫度,但這不是我們想要做的。 訓練模型時,像我們這樣的問題中的溫度之類的事物可能具有另一種含義。 溫度值可能會有正面和負面影響。 在這種情況下,低于298K的溫度可能會積極影響許多病例(蚊子少)。 這就是為什么我們應該對此使用歸一化。
After checking an entire dataset we can come up with the list of features to normalize, scale and copy from our list of features:
在檢查了整個數據集之后,我們可以提出功能列表以從我們的功能列表中進行規范化,縮放和復制:
歸一化特征 (Normalized features)
'reanalysis_air_temp_k''reanalysis_avg_temp_k'
'reanalysis_dew_point_temp_k'
'reanalysis_max_air_temp_k'
'reanalysis_min_air_temp_k'
'station_avg_temp_c'
'station_max_temp_c'
'station_min_temp_c'
縮放功能 (Scaled features)
'station_diur_temp_rng_c''reanalysis_tdtr_k'
'precipitation_amt_mm'
'reanalysis_precip_amt_kg_per_m2'
'reanalysis_relative_humidity_percent'
'reanalysis_sat_precip_amt_mm'
'reanalysis_specific_humidity_g_per_kg'
'station_precip_mm'
'year'
'weekofyear'
復制的功能 (Copied features)
'ndvi_ne''ndvi_nw'
'ndvi_se'
'ndvi_sw'
為什么要復制? (Why Copy?)
If we look at the definition of the NDVI index, we can decide there is no reason for scaling or normalizing those values. NDVI values are already in <-1, 1> range. Sometimes we might want to copy values directly like that. Especially when original values are within the same order of magnitude as our normalized features. It might be <0,2> or <1,4>, but it shouldn’t cause a problem for the model.
如果我們看一下NDVI指數的定義,我們可以決定沒有理由縮放或標準化這些值。 NDVI值已經在<-1,1>范圍內。 有時我們可能想要直接復制值。 尤其是當原始值與我們的歸一化特征處于同一數量級時。 它可能是<0,2>或<1,4>,但不會對模型造成問題。
代碼 (The code)
Now we have to write some code to preprocess our data. We’re going to use StandardScaler and MinMaxScaler from sklearn library.
現在,我們必須編寫一些代碼來預處理數據。 我們將使用sklearn庫中的StandardScaler和MinMaxScaler 。
As an input to our function, we expect to send 3 or 4 variables. When dealing with the training set we’re sending 3 variables:
作為函數輸入,我們希望發送3或4個變量。 處理訓練集時,我們將發送3個變量:
- training dataset (as pandas Dataframe) 訓練數據集(作為pandas數據框)
- list of columns to normalize 要規范化的列列表
- list of columns to scale 要縮放的列列表
When we’re processing training data we have to define the dataset for the scaling/normalization process. This dataset is used to get values like mean or standard deviation. Because at the point of processing the training dataset we don’t have any external datasets, we’re using the training dataset. At line 19 we’re normalizing selected columns using StandardScaler():
在處理訓練數據時,我們必須定義縮放/標準化過程的數據集。 該數據集用于獲取平均值或標準偏差之類的值。 因為在處理訓練數據集時,我們沒有任何外部數據集,所以我們正在使用訓練數據集。 在第19行,我們使用StandardScaler()標準化選定的列:
StandardScaler doesn’t require any parameters when initializing, but it requires scale dataset to fit to. We could just past the new_data twice and it would work but then we need to create another preprocessing for the test dataset.
StandardScaler初始化時不需要任何參數,但是需要比例數據集適合。 我們可以將兩次new_data兩次,它可以工作,但是接下來我們需要為測試數據集創建另一個預處理。
Next, we’re doing the same thing but with MinMaxScaler().
接下來,我們使用MinMaxScaler()進行相同的操作。
This time we’re passing one parameter called feature_range to be sure that our scale is in range <0,1>. As in the previous example, we're passing the scaling dataset to fit to and transform selected columns.
這次,我們傳遞了一個稱為feature_range參數,以確保比例尺在<0,1>范圍內。 與前面的示例一樣,我們傳遞縮放數據集以適合并轉換選定的列。
In the end, we’re returning transformed new_data and additionally train_scale for further preprocessing. But wait the second! What further preprocessing? Remember that we're dealing not only with the training dataset but also with the test dataset. We have to apply the same data processing for both of them to have the same input for the model. If we would simply use preproc_data() in the same way for the test dataset, we would apply completely different normalization and scaling. The reason why is because normalization and scaling are done by the .fit() method and this method uses some given dataset to calculate mean and other required values. If you use a test dataset that might have a different range of values (there was a hot summer because of global warming etc.) your value of 28C in the test dataset will be normalized with different parameters. Let me show you an example:
最后,我們將返回轉換后的new_data和train_scale以進行進一步的預處理。 但是等一下! 哪些進一步的預處理? 請記住,我們不僅要處理訓練數據集,還要處理測試數據集。 我們必須對兩個應用相同的數據處理,以使模型具有相同的輸入。 如果我們preproc_data()相同的方式對測試數據集使用preproc_data() ,則將應用完全不同的規范化和縮放。 之所以這樣,是因為歸一化和縮放是通過.fit()方法完成的,并且該方法使用一些給定的數據集來計算均值和其他所需值。 如果您使用的測試數據集可能具有不同的值范圍(由于全球變暖等原因導致夏季炎熱),則將使用不同的參數對測試數據集中的28C值進行標準化。 讓我給你看一個例子:
Training Dataset:
訓練數據集:
Testing Dataset:
測試數據集:
Normalizing Testing Dataset using mean and SD from Training Dataset gives us:
使用均值和訓練數據集的SD對測試數據集進行歸一化可以為我們提供:
[0.11,0.11,0.91,1.71,0.91,0.11,1.71]
[0.11,0.11,0.91,1.71,0.91,0.11,1.71]
But if you use mean and SD from the test dataset you’ll end up with:
但是,如果您使用測試數據集中的均值和標準差,則會得到:
[?1.04,?1.04,0.17,1.37,0.17,?1.04,1.37]
[?1.04,?1.04,0.17,1.37,0.17,?1.04,1.37]
You might think that the second one is better describing the dataset but that’s only true when dealing with only the testing dataset.
你可能會認為,第二個是更好的描述數據集,但僅與測試數據集打交道時,只有真實的。
That’s why when building our model we have to execute it like that:
這就是為什么在構建模型時我們必須像這樣執行它:
結論 (Conclusion)
We’ve just gone through a quite standard normalization process for our dataset. It is important to understand the difference between normalization and scaling. Another thing which might be even more important is feature selection for normalization (example with different temperature features), you should always try to understand your features, not only apply some hardcoded rules from the internet.
我們剛剛為數據集完成了一個非常標準的標準化過程。 了解規范化和縮放之間的區別很重要。 可能更重要的另一件事是標準化的特征選擇(例如,具有不同溫度特征的特征),您應始終嘗試了解自己的特征,不僅要應用互聯網上的一些硬編碼規則。
The last thing that I have to mention (and you’ve probably already thought about it) is the difference between data range in training and testing dataset. You know that normalization of the testing data should be done with the variables from training data, but shouldn’t we adjust the process to fit into a different range? Let’s say the training dataset has a temperature range between 15C and 23C and the testing dataset has a range between 18C and 28C. Isn’t that a problem with our model? Actually it isn’t :) Models don’t really care about small changes like that because they are approximating continuous functions (or distributions) and unless your range differs a lot (it’s from different distribution) you shouldn’t have any issues with it.
我不得不提的最后一件事(您可能已經考慮過)是訓練和測試數據集中的數據范圍之間的差異。 您知道應該使用訓練數據中的變量對測試數據進行歸一化,但是我們是否應該調整流程以適應不同的范圍? 假設訓練數據集的溫度范圍在15C和23C之間,而測試數據集的溫度范圍在18C和28C之間。 這不是我們模型的問題嗎? 實際上不是:)模型實際上并不關心這樣的小變化,因為它們近似于連續函數(或分布),并且除非您的范圍相差很大(來自不同的分布),否則您應該不會有任何問題。
Originally published at https://erdem.pl.
最初發布在 https://erdem.pl 。
翻譯自: https://towardsdatascience.com/dengai-data-preprocessing-28fc541c9470
總結
以上是生活随笔為你收集整理的DengAI —数据预处理的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 梯度下降优化方法'原理_优化梯度下降的新
- 下一篇: cpu流水线工作原理_嵌入式工作原理(处