详尽kmp_详尽的分步指南,用于数据准备
詳盡kmp
表中的內(nèi)容 (Table of Content)
介紹 (Introduction)
Before we get into this, I want to make it clear that there is no rigid process when it comes to data preparation. How you prepare one set of data will most likely be different from how you prepare another set of data. Therefore this guide aims to provide an overarching guide that you can refer to when preparing any particular set of data.
在開始討論之前,我想澄清一下,在數(shù)據(jù)準(zhǔn)備方面沒有嚴(yán)格的過程。 您準(zhǔn)備一組數(shù)據(jù)的方式與準(zhǔn)備另一組數(shù)據(jù)的方式很可能會(huì)有所不同。 因此,本指南旨在提供在準(zhǔn)備任何特定數(shù)據(jù)集時(shí)可以參考的總體指南。
Before we get into the guide, I should probably go over what Data Preparation is…
在進(jìn)入指南之前,我可能應(yīng)該回顧一下數(shù)據(jù)準(zhǔn)備是什么……
什么是數(shù)據(jù)準(zhǔn)備? (What is Data Preparation?)
Data preparation is the step after data collection in the machine learning life cycle and it’s the process of cleaning and transforming the raw data you collected. By doing so, you’ll have a much easier time when it comes to analyzing and modeling your data.
數(shù)據(jù)準(zhǔn)備是機(jī)器學(xué)習(xí)生命周期中數(shù)據(jù)收集之后的步驟,并且是清理和轉(zhuǎn)換您收集的原始數(shù)據(jù)的過程。 這樣,您就可以輕松地分析和建模數(shù)據(jù)。
There are three main parts to data preparation that I’ll go over in this article:
我將在本文中介紹數(shù)據(jù)準(zhǔn)備的三個(gè)主要部分:
1.探索性數(shù)據(jù)分析(EDA) (1. Exploratory Data Analysis (EDA))
Exploratory data analysis, or EDA for short, is exactly what it sounds like, exploring your data. In this step, you’re simply getting an understanding of the data that you’re working with. In the real world, datasets are not as clean or intuitive as Kaggle datasets.
探索性數(shù)據(jù)分析(簡(jiǎn)稱EDA)聽起來像是探索數(shù)據(jù)。 在此步驟中,您僅需要了解正在使用的數(shù)據(jù)。 在現(xiàn)實(shí)世界中,數(shù)據(jù)集不如Kaggle數(shù)據(jù)集那么干凈或直觀。
The more you explore and understand the data you’re working with, the easier it’ll be when it comes to data preprocessing.
您越探索和了解正在使用的數(shù)據(jù),就越容易進(jìn)行數(shù)據(jù)預(yù)處理。
Below is a list of things that you should consider in this step:
下面是在此步驟中應(yīng)考慮的事項(xiàng)列表:
特征和目標(biāo)變量 (Feature and Target Variables)
Determine what the feature (input) variables are and what the target variable is. Don’t worry about determining what the final input variables are, but make sure you can identify both types of variables.
確定什么是要素(輸入)變量,什么是目標(biāo)變量。 不必?fù)?dān)心確定最終輸入變量是什么,但是請(qǐng)確保可以識(shí)別兩種類型的變量。
資料類型 (Data Types)
Figure out what type of data you’re working with. Are they categorical, numerical, or neither? This is especially important for the target variable, as the data type will narrow what machine learning model you may want to use. Pandas functions like df.describe() and df.dtypes are useful here.
找出要使用的數(shù)據(jù)類型。 它們是分類的,數(shù)字的還是兩者都不是? 這對(duì)于目標(biāo)變量尤其重要,因?yàn)閿?shù)據(jù)類型將縮小您可能要使用的機(jī)器學(xué)習(xí)模型。 諸如df.describe()和df.dtypes之類的熊貓函數(shù)在這里很有用。
檢查異常值 (Check for Outliers)
An outlier is a data point that differs significantly from other observations. In this step, you’ll want to identify outliers and try to understand why they’re in the data. Depending on why they’re in the data, you may decide to remove them from the dataset or keep them. There are a couple of ways to identify outliers:
離群值是與其他觀察值有顯著差異的數(shù)據(jù)點(diǎn)。 在此步驟中,您將要確定異常值,并嘗試了解它們?yōu)楹卧跀?shù)據(jù)中。 根據(jù)它們?cè)跀?shù)據(jù)中的原因,您可以決定從數(shù)據(jù)集中刪除它們或保留它們。 有兩種方法可以識(shí)別異常值:
Z-score/standard deviations: if we know that 99.7% of data in a data set lie within three standard deviations, then we can calculate the size of one standard deviation, multiply it by 3, and identify the data points that are outside of this range. Likewise, we can calculate the z-score of a given point, and if it’s equal to +/- 3, then it’s an outlier. Note: that there are a few contingencies that need to be considered when using this method; the data must be normally distributed, this is not applicable for small data sets, and the presence of too many outliers can throw off z-score.
Z分?jǐn)?shù)/標(biāo)準(zhǔn)差 :如果我們知道數(shù)據(jù)集中99.7%的數(shù)據(jù)位于三個(gè)標(biāo)準(zhǔn)差之內(nèi),那么我們可以計(jì)算一個(gè)標(biāo)準(zhǔn)差的大小,將其乘以3,然后識(shí)別出不在該范圍內(nèi)的數(shù)據(jù)點(diǎn)這個(gè)范圍。 同樣,我們可以計(jì)算給定點(diǎn)的z得分,如果它等于+/- 3,則這是一個(gè)離群值。 注意:使用此方法時(shí)需要考慮一些意外情況; 數(shù)據(jù)必須是正態(tài)分布的,不適用于小型數(shù)據(jù)集,并且存在太多異常值可能會(huì)使z得分下降。
Interquartile Range (IQR): IQR, the concept used to build boxplots, can also be used to identify outliers. The IQR is equal to the difference between the 3rd quartile and the 1st quartile. You can then identify if a point is an outlier if it is less than Q1–1.5*IRQ or greater than Q3 + 1.5*IQR. This comes to approximately 2.698 standard deviations.
四分位數(shù)間距(IQR) :IQR是用于構(gòu)建箱形圖的概念,也可以用于識(shí)別異常值。 IQR等于第三四分位數(shù)和第一四分位數(shù)之間的差。 然后,如果該點(diǎn)小于Q1-1.5 * IRQ或大于Q3 + 1.5 * IQR,則可以確定該點(diǎn)是否為離群值。 這大約是2.698標(biāo)準(zhǔn)偏差。
問問題 (Ask Questions)
There’s no doubt that you’ll most likely have questions regarding the data that you’re working with, especially for a dataset that is outside of your domain knowledge. For example, Kaggle had a competition on NFL analytics and injuries, I had to do some research and understand what the different positions were and what their function served for the team.
毫無疑問,您極有可能對(duì)正在使用的數(shù)據(jù)有疑問, 尤其是對(duì)于您領(lǐng)域知識(shí)以外的數(shù)據(jù)集。 例如,Kaggle參加了NFL分析和受傷比賽,我必須做一些研究,了解不同職位的位置以及他們對(duì)團(tuán)隊(duì)的作用。
2.數(shù)據(jù)預(yù)處理 (2. Data Preprocessing)
Once you understand your data, a majority of your time spent as a data scientist is on this step, data preprocessing. This is when you spend your time manipulating the data so that it can be modeled properly. Like I said before, there is no universal way to go about this. HOWEVER, there are a number of essential things that you should consider which we’ll go through below.
一旦了解了數(shù)據(jù),作為數(shù)據(jù)科學(xué)家所花費(fèi)的大部分時(shí)間都將放在此步驟上,即數(shù)據(jù)預(yù)處理。 這是您花時(shí)間處理數(shù)據(jù)以便可以對(duì)其進(jìn)行正確建模的時(shí)候。 就像我之前說過的那樣,沒有通用的方法可以做到這一點(diǎn)。 但是,您應(yīng)該考慮一些基本事項(xiàng),我們將在下面進(jìn)行介紹。
特征插補(bǔ) (Feature Imputation)
Feature Imputation is the process of filling missing values. This is important because most machine learning models don’t work when there are missing data in the dataset.
特征插補(bǔ)是填充缺失值的過程。 這很重要,因?yàn)楫?dāng)數(shù)據(jù)集中缺少數(shù)據(jù)時(shí),大多數(shù)機(jī)器學(xué)習(xí)模型將無法正常工作。
One of the main reasons that I wanted to write this guide is specifically for this step. Many articles say that you should default to filling missing values with the mean or simply remove the row, and this is not necessarily true.
我要編寫本指南的主要原因之一就是專門針對(duì)此步驟的。 許多文章說,您應(yīng)該默認(rèn)使用均值填充缺失值或簡(jiǎn)單地刪除行,而這不一定是正確的 。
Ideally, you want to choose the method that makes the most sense — for example, if you were modeling people’s age and income, it wouldn’t make sense for a 14-year-old to be making a national average salary.
理想情況下,您想選擇最有意義的方法-例如,如果您在模擬人們的年齡和收入,則對(duì)于14歲的人來說,獲得全國(guó)平均工資是沒有意義的。
All things considered, there are a number of ways you can deal with missing values:
考慮所有因素,您可以通過多種方式處理缺失值:
Single value imputation: replacing missing values with the mean, median, or mode of a column
單值插補(bǔ) :用列的均值,中位數(shù)或眾數(shù)替換缺失值
Multiple value imputation: modeling features that have missing data and imputing missing data with what your model finds.
多值插補(bǔ):對(duì)具有缺失數(shù)據(jù)的特征進(jìn)行建模,并使用模型找到的內(nèi)容插補(bǔ)缺失數(shù)據(jù)。
K-Nearest neighbor: filling data with a value from another similar sample
K最近鄰居:使用另一個(gè)相似樣本中的值填充數(shù)據(jù)
Deleting the row: this isn’t an imputation technique, but it tends to be okay when there’s an extremely large sample size where you can afford to.
刪除行:這不是一種插補(bǔ)技術(shù),但是當(dāng)您可以負(fù)擔(dān)得起非常大的樣本量時(shí),它往往可以。
- Others include: random imputation, moving window, most frequent, etc… 其他包括:隨機(jī)歸因,移動(dòng)窗口,最頻繁出現(xiàn)等等。
功能編碼 (Feature Encoding)
Feature encoding is the process of turning values (i.e. strings) into numbers. This is because a machine learning model requires all values to be numbers.
特征編碼是將值(即字符串)轉(zhuǎn)換為數(shù)字的過程。 這是因?yàn)闄C(jī)器學(xué)習(xí)模型要求所有值都是數(shù)字。
There are a few ways that you can go about this:
有幾種方法可以解決此問題:
Label Encoding: Label encoding simply converts a feature’s non-numerical values into numerical values, whether the feature is ordinal or not. For example, if a feature called car_colour had distinct values of red, green, and blue, then label encoding would convert these values to 1, 2, and 3 respectively. Be wary when using this method because while some ML models will be able to make sense of the encoding, others won’t.
標(biāo)簽編碼:標(biāo)簽編碼只是將要素的非數(shù)字值轉(zhuǎn)換為數(shù)值,無論該要素是否為序數(shù)。 例如,如果一個(gè)名為car_colour的特征具有紅色,綠色和藍(lán)色的不同值,則標(biāo)簽編碼會(huì)將這些值分別轉(zhuǎn)換為1、2和3。 使用此方法時(shí)要當(dāng)心,因?yàn)殡m然某些ML模型將能夠理解編碼,但其他ML模型卻無法。
One Hot Encoding (aka. get_dummies): One hot encoding works by creating a binary feature (1, 0) for each non-numerical value of a given feature. Reusing the example above, if we had a feature called car_colour, then one hot encoding would create three features called car_colour_red, car_colour_green, car_colour_blue, and would have a 1 or 0 indicating whether it is or isn’t.
一種熱編碼(aka。get_dummies):一種熱編碼通過為給定特征的每個(gè)非數(shù)字值創(chuàng)建一個(gè)二進(jìn)制特征(1、0)來工作。 重用上面的示例,如果我們有一個(gè)名為car_colour的特征,那么一個(gè)熱編碼將創(chuàng)建三個(gè)名為car_colour_red,car_colour_green,car_colour_blue的特征,并具有1或0指示是否存在。
特征歸一化 (Feature Normalization)
When numerical values are on different scales, eg. height in centimeters and weight in lbs, most machine learning algorithms don’t perform well. The k-nearest neighbors algorithm is a prime example where features with different scales do not work well. Thus normalizing or standardizing the data can help with this problem.
當(dāng)數(shù)值處于不同比例時(shí),例如。 身高(厘米)和體重(磅),大多數(shù)機(jī)器學(xué)習(xí)算法的效果都不好。 k近鄰算法是一個(gè)主要示例,其中具有不同比例的要素?zé)o法很好地工作。 因此,對(duì)數(shù)據(jù)進(jìn)行標(biāo)準(zhǔn)化或標(biāo)準(zhǔn)化可以幫助解決此問題。
Feature normalization rescales the values so that they’re within a range of [0,1]/
特征歸一化會(huì)重新調(diào)整值的范圍,使其在[0,1] /
Feature standardization rescales the data to have a mean of 0 and a standard deviation of one.
特征標(biāo)準(zhǔn)化會(huì)將數(shù)據(jù)重新縮放為平均值為0,標(biāo)準(zhǔn)差為1。
特征工程 (Feature Engineering)
Feature engineering is the process of transforming raw data into features that better represent the underlying problem that one is trying to solve. There’s no specific way to go about this step but here are some things that you can consider:
特征工程是將原始數(shù)據(jù)轉(zhuǎn)換為更好地表示人們正在試圖解決的潛在問題的特征的過程。 沒有具體的方法可以執(zhí)行此步驟,但是您可以考慮以下幾點(diǎn):
- Converting a DateTime variable to extract just the day of the week, the month of the year, etc… 轉(zhuǎn)換DateTime變量以僅提取一周中的一天,一年中的月份等。
- Creating bins or buckets for a variable. (eg. for a height variable, can have 100–149cm, 150–199cm, 200–249cm, etc.) 為變量創(chuàng)建箱或桶。 (例如,對(duì)于高度變量,可以為100–149cm,150–199cm,200–249cm等)
- Combining multiple features and/or values to create a new one. For example, one of the most accurate models for the titanic challenge engineered a new variable called “Is_women_or_child” which was True if the person was a woman or a child and false otherwise. 組合多個(gè)功能和/或值以創(chuàng)建一個(gè)新功能。 例如,針對(duì)泰坦尼克號(hào)挑戰(zhàn)的最準(zhǔn)確模型之一設(shè)計(jì)了一個(gè)新變量“ Is_women_or_child”,如果該人是女人還是孩子,則為True,否則為false。
功能選擇 (Feature Selection)
Next is feature selection, which is choosing the most relevant/valuable features of your dataset. There are a few methods that I like to use that you can leverage to help you with selecting your features:
接下來是要素選擇,即選擇數(shù)據(jù)集中最相關(guān)/最有價(jià)值的要素。 我喜歡使用幾種方法來幫助您選擇功能:
Feature importance: some algorithms like random forests or XGBoost allow you to determine which features were the most “important” in predicting the target variable’s value. By quickly creating one of these models and conducting feature importance, you’ll get an understanding of which variables are more useful than others.
功能重要性:一些算法(例如隨機(jī)森林或XGBoost)可讓您確定哪些功能在預(yù)測(cè)目標(biāo)變量的值時(shí)最“重要”。 通過快速創(chuàng)建這些模型之一并進(jìn)行功能重要性,您將了解哪些變量比其他變量更有用。
Dimensionality reduction: One of the most common dimensionality reduction techniques, Principal Component Analysis (PCA) takes a large number of features and uses linear algebra to reduce them to fewer features.
降維 :主成分分析(PCA)是最常見的降維技術(shù)之一,它具有大量特征,并使用線性代數(shù)將其簡(jiǎn)化為更少的特征。
處理數(shù)據(jù)不平衡 (Dealing with Data Imbalances)
One other thing that you’ll want to consider is data imbalances. For example, if there are 5,000 examples of one class (eg. not fraudulent) but only 50 examples for another class (eg. fraudulent), then you’ll want to consider one of a few things:
您還要考慮的另一件事是數(shù)據(jù)不平衡。 例如,如果一個(gè)類別有5,000個(gè)示例(例如,欺詐性的),而另一類別中只有50個(gè)示例(例如,欺詐性的),那么您將要考慮以下幾項(xiàng)之一:
- Collecting more data — this always works in your favor but is usually not possible or too expensive. 收集更多數(shù)據(jù)-這總是對(duì)您有利,但通常是不可能或太昂貴的。
- You can over or undersample the data using the scikit-learn-contrib Python package. 您可以使用scikit-learn-contrib Python軟件包對(duì)數(shù)據(jù)進(jìn)行過度或欠采樣。
3.數(shù)據(jù)分割 (3. Data Splitting)
Last comes splitting your data. I’m just going to give a very generic framework that you can use here, that is generally agreed upon.
最后是拆分?jǐn)?shù)據(jù)。 我將要給出一個(gè)非常通用的框架,您可以在這里使用它,這是普遍同意的。
Typically you’ll want to split your data into three sets:
通常,您需要將數(shù)據(jù)分為三組:
Training Set (70–80%): this is what the models learns on
訓(xùn)練集 (70–80%):這是模型學(xué)習(xí)的內(nèi)容
Validation Set (10–15%): the model’s hyperparameters are tuned on this set
驗(yàn)證集 (10%到15%):在該集合上調(diào)整模型的超參數(shù)
Test set (10–15%): finally, the model’s final performance is evaluated on this. If you’ve prepared the data correctly, the results from the test set should give a good indication of how the model will perform in the real world.
測(cè)試集 (10%到15%):最后,以此評(píng)估模型的最終性能。 如果您正確地準(zhǔn)備了數(shù)據(jù),那么測(cè)試集的結(jié)果應(yīng)該可以很好地表明模型在現(xiàn)實(shí)世界中的表現(xiàn)。
謝謝閱讀! (Thanks for Reading!)
I hope you’ve learned a thing or two from this. By reading this, you should now have a general framework in mind when it comes to data preparation. There are many things to consider, but having resources like this to remind you is always helpful.
希望您從中學(xué)到了一兩個(gè)東西。 通過閱讀本文,您現(xiàn)在應(yīng)該在準(zhǔn)備數(shù)據(jù)時(shí)牢記一個(gè)通用框架。 有很多事情要考慮,但是擁有類似的資源來提醒您總是很有幫助的。
If you follow these steps and keep these things in mind, you’ll definitely have your data better prepared, and you’ll ultimately be able to develop a more accurate model!
如果您遵循這些步驟并牢記這些內(nèi)容,那么您肯定會(huì)為數(shù)據(jù)做好更好的準(zhǔn)備,最終您將能夠開發(fā)出更準(zhǔn)確的模型!
特倫斯·辛 (Terence Shin)
Check out my free data science resource with new material every week!
每周 查看 我的免費(fèi)數(shù)據(jù)科學(xué)資源 以及新材料!
If you enjoyed this, follow me on Medium for more
如果您喜歡這個(gè),請(qǐng) 在Medium上關(guān)注我以 了解更多
Let’s connect on LinkedIn
讓我們?cè)?LinkedIn上建立聯(lián)系
翻譯自: https://towardsdatascience.com/an-extensive-step-by-step-guide-for-data-preparation-aee4a109051d
詳盡kmp
總結(jié)
以上是生活随笔為你收集整理的详尽kmp_详尽的分步指南,用于数据准备的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 梦到生吞蛇预示着什么
- 下一篇: 梦到黑熊是胎梦吗