机器学习数据倾斜的解决方法_机器学习并不总是解决数据问题的方法
機器學習數據傾斜的解決方法
總覽 (Overview)
I was given a large dataset of files, what some would like to call big data and told to come up with a solution to the data problem. People often associate big data with machine learning and automatically jump to a machine learning solution. However, after working with this dataset, I realised that machine learning was not the solution. The dataset was provided to me as a case study that I had to complete as part of a 4-step interview process.
我得到了一個很大的文件數據集,有人希望將其稱為大數據,并告訴他們提出解決數據問題的方法。 人們通常將大數據與機器學習相關聯(lián),并自動跳轉到機器學習解決方案。 但是,在使用該數據集之后,我意識到機器學習不是解決方案。 數據集是作為案例研究提供給我的,作為4步訪談過程的一部分,我必須完成。
任務描述 (Task description)
The dataset consists of a set of files tracking email activity across multiple construction projects. All data has been anonymised. The task was to explore the dataset and report back with any insights. I was informed that clients are concerned with project duration and the number of site instructions and variations seen on a project as these typically cost money.
數據集由一組文件組成,這些文件跟蹤跨多個建設項目的電子郵件活動。 所有數據均已匿名。 任務是探索數據集并報告任何見解。 我獲悉,客戶擔心項目的工期以及在項目中看到的現場指示和變更的數量,因為這些通常需要花錢。
數據ETL (Data ETL)
Firstly, correspondence data was read in and appended together. The data was checked for duplicates. No duplicates were found.
首先,讀取對應數據并將其附加在一起。 檢查數據是否重復。 找不到重復項。
As clients are concerned with project duration, the difference between response required by date and sent date was calculated in days. However, there were quite a few missing values for response required by date and some for sent date. These records were excluded; so the dataset was reduced from 20,006,768 records for 7 variables to 3,895,037. Then, the correspondence data was combined with mail types file to determine whether the type of correspondence has an impact on duration. Finally, a file containing the number of records for each project was combined.
由于客戶擔心項目的工期,因此按日期要求的響應時間與發(fā)送日期之間的差異以天為單位。 但是,日期要求的響應缺少很多值,而發(fā)送日期則缺少一些值。 這些記錄被排除在外; 因此數據集從7個變量的20,006,768條記錄減少到3,895,037條。 然后,將對應關系數據與郵件類型文件合并,以確定對應關系的類型是否對持續(xù)時間有影響。 最后,合并了一個包含每個項目的記錄數的文件。
Usually, it is not good practice to exclude data without a valid reason; however, as this dataset was assigned to me as part of a job application process I did not have the opportunity to better understand the dataset in order to impute the dates.
通常,在沒有正當理由的情況下排除數據不是一種好習慣; 但是,由于此數據集是在求職過程中分配給我的,因此我沒有機會更好地理解該數據集以估算日期。
As you can see from the below code to import the dataset, we do not have much information on the emails other than projectID, number of records, typeID and typeName.
從下面的代碼中可以看到,導入數據集的過程中,除了projectID,記錄數,typeID和typeName之外,我們在電子郵件中沒有太多其他信息。
As a large number of .csv files had to be read in from a single directory, I used lapply with the fread function to read in the files and appended the list of files using rbindlist.
由于必須從單個目錄中讀取大量.csv文件,因此我將lapply與fread函數一起使用來讀取文件,并使用rbindlist附加了文件列表。
It is good practice to save your consolidated dataset object as an R object to avoid having to rerun the import code at a later date as this process can be quite consuming depending on the number of files.
優(yōu)良作法是將合并的數據集對象另存為R對象,以避免以后不得不重新運行導入代碼,因為此過程可能很耗時,具體取決于文件數。
特征工程 (Feature engineering)
New variables were created to understand project duration. These are duration in days and whether the project was submitted after response required by date. If yes, then it was late, otherwise, it was early or on time.
創(chuàng)建了新變量以了解項目工期。 這些是工期,以天為單位,是否按日期要求在響應之后提交了項目。 如果是,則為時已晚,否則為時早或準時。
The unique correspondence dataset was then joined to the mail types file on correspondence type ID (primary key). This was later joined to the main file on project ID.
然后,將唯一的對應數據集添加到對應類型ID(主鍵)上的郵件類型文件中。 后來將其加入到項目ID上的主文件中。
My initial insights after the joins are as follows:
加入后,我的初步見解如下:
- Correspondence id is unique so no aggregation is required and will not assist analysis. 通訊ID是唯一的,因此不需要匯總,也不會有助于分析。
- There are too many organisation ids and there is no sensible way to group them so will exclude as predictor. 組織ID過多,沒有明智的分組方法,因此將其排除在預測范圍之外。
- There are too many userIDs — not a useful predictor especially as some only have a frequency count of 1. Also excluded as a predictor. 用戶ID太多-不是有用的預測變量,特別是因為某些用戶ID的頻率計數僅為1。因此也不能用作預測變量。
To create a unique row for each correspondence ID, project ID and typeName, I needed to aggregate the other features. I did this by calculating frequencies or counts for the number of late submissions, number of early submissions and by calculating summary statistics such as maximum, minimum, and mean for duration in days.
要為每個對應ID,項目ID和typeName創(chuàng)建唯一的行,我需要匯總其他功能。 我通過計算延遲提交次數的頻率或計數,早期提交次數以及計算摘要統(tǒng)計信息(例如最大,最小和持續(xù)時間(以天為單位))來完成此任務。
造型 (Modelling)
Grouping the data reduced the dataset size to 51,156 observations for 7 variables. The sample size at the moment is too small, however, a large majority of the data was reduced due to missing response date and including every single record per project ID would be a comparison per organisation and user ID. It appears that the client is interested in site instructions and variations which can probably be found in correspondence type ID and typeName and having the data at project ID would be too granular for the task requirements.
將數據分組將7個變量的數據集大小減少到51,156個觀察值。 目前的樣本量太小,但是,由于缺少響應日期,因此減少了大部分數據,并且每個項目ID的每條記錄都將是每個組織和用戶ID的比較。 看來客戶對站點說明和變體很感興趣,這可能會在對應類型ID和typeName中找到,并且項目ID處的數據對于任務要求而言過于精細。
We do have an issue where we do not know what each ID stands for and whether it is important.
我們確實有一個問題,我們不知道每個ID代表什么以及它是否重要。
Two types of models were run:
運行了兩種類型的模型:
GLM (Gaussian) was carried out to determine the linear combination of best predictors that are likely to have an impact on the increase or decrease of average duration days.
進行GLM(高斯)來確定最佳預測變量的線性組合,這些線性預測可能會影響平均持續(xù)時間的增加或減少。
GBM (Gaussian) was run to again identify the top predictors and how they are related to average project duration (days)
運行GBM(高斯)來再次確定最重要的預測指標以及它們與平均項目工期(天)的關系
It is good practice to run multiple models in order to ensure that the model with greater accuracy and interpretability is selected.
優(yōu)良作法是運行多個模型,以確保選擇了具有更高準確性和可解釋性的模型。
I first partitioned the dataset into a training (70%) and test sets (30%) using a random split. In retrospect, I could have split the model by date in order to determine how my model does at predicting average duration in the future.
我首先使用隨機拆分將數據集分為訓練(70%)和測試集(30%)。 回想起來,我可以按日期劃分模型,以便確定模型在預測未來的平均持續(xù)時間方面的表現。
I used the glm and gbm functions from the H2O package to make use of the parallel processing cloud server provided by this package as my laptop PC was quite slow. I ran 5-fold cross validation of my training set. In this method, the training set is partitioned into 5 equivalent sets. In each round, one set is used as the test set and all the other are used for training. The model is then trained on the remaining four sets. In the next round, another set is used as the test set and the remaining sets are used for training. The model then calculates the accuracy of each of the five models on the test set and provides an average model accuracy metric.
我使用了H2O軟件包中的glm和gbm函數來利用此軟件包提供的并行處理云服務器,因為我的筆記本電腦運行速度很慢。 我對訓練集進行了5次交叉驗證。 在這種方法中,訓練集被分為5個等效集。 在每個回合中,一組用作測試集,所有其他用于訓練。 然后在其余四組上訓練模型。 在下一輪中,將另一組用作測試組,其余的組用于訓練。 然后,該模型計算測試集上五個模型中每個模型的準確性,并提供平均模型準確性度量。
Output from the glm model is shown below. For continuous target variables, RMSE and R-squared are commonly used as accuracy metrics. We want the RMSE to be as close to 0 as possible and the R-squared value to be as close to 1. In our case, we can see that our model accuracy is awful for both metrics.
glm模型的輸出如下所示。 對于連續(xù)目標變量,RMSE和R平方通常用作精度指標。 我們希望RMSE盡可能接近0,R平方值盡可能接近1。在我們的案例中,我們可以看到我們的模型精度對于這兩個指標都非常糟糕。
Now, let’s look at the output from the gbm. The results from this model are marginally better but nothing to rave about.
現在,讓我們看一下gbm的輸出。 該模型的結果略好一些,但沒有什么值得贊揚的。
Results show that of the 1500 predictors entered in the model (1500 due to binarisation of categorical variables) there are 672 predictors that have some degree of influence. Only, 2 iterations were run despite the model having a choice of multiple runs to produce the best output. The reason for this is because the best value of lambda was reached after two iterations giving a poor model accuracy score (goodness of fit score) of 0.38% R-squared.
結果表明,在模型中輸入的1500個預測變量中有1500個(由于分類變量的二值化),有672個預測變量具有一定程度的影響。 盡管模型可以選擇多次運行以產生最佳輸出,但僅運行了2次迭代。 這樣做的原因是,經過兩次迭代后, λ值達到了最佳值,模型準確度得分(擬合優(yōu)度得分)為0.38%R-平方,較差。
Though mean number of records did not come out as a significant predictor in the glm, it appears to be the most important from the gbm, followed by correspondence type and type Name.
盡管平均記錄數并沒有作為glm的重要預測指標,但從gbm來看,它似乎是最重要的,其次是信函類型和Name類型。
Due to the low accuracy for both models, I wouldn’t want to make any deductions from the output.
由于兩個模型的準確性都較低,因此我不想從輸出中進行任何推斷。
I decided to further investigate the output by plotting the top predictors. From the box plots below, we can see that average duration is longest for a PM request for approval sample with the highest variation too compared to email and fax that typically have duration close to zero.
我決定通過繪制頂級預測變量來進一步調查輸出。 從下面的方框圖中,我們可以看出,與通常持續(xù)時間接近于零的電子郵件和傳真相比,PM批準樣本的平均持續(xù)時間最長,變化也最大。
In the box plots below, we can see that the average duration is higher for payment claim than design query and non-conformance notice.
在下面的方框圖中,我們可以看到,與設計查詢和不合格通知相比,付款索賠的平均持續(xù)時間更長。
結論 (Conclusion)
The quality of the model depends on the data quality and its features. In this case, we started off with a very small number features and had a very poor understanding of the dataset which led to the removal of a large proportion of the dataset and poor model accuracy despite some feature engineering.
模型的質量取決于數據質量及其功能。 在這種情況下,我們從很少的特征開始,對數據集的理解很差,這導致移除了大部分數據集,盡管進行了一些特征工程,但模型精度卻很差。
In this example, we found that a mere exploration of the dataset would have answered the question posed by the business around what impacts project duration where would found that payment duration and PM request for approval sample can lead to an increase in duration.
在此示例中,我們發(fā)現僅對數據集進行探索就可以回答企業(yè)提出的有關對項目工期有何影響的問題,從而發(fā)現支付工期和PM批準樣本會導致工期增加。
The number of site instructions and variations per project could have again be explored by calculating frequency by project ID and typename.
通過根據項目ID和類型名稱計算頻率,可以再次探索每個項目的現場指示和變更的數量。
Here is an example where a predictive model was not required. In order to build a model with better accuracy, additional features/datasets are required along with a better understanding of the dataset.
這是一個不需要預測模型的示例。 為了以更高的精度構建模型,需要附加的功能/數據集以及對數據集的更好理解。
I would love to hear what you think and whether I could have approached this problem differently! :)
我很想聽聽您的想法以及我是否可以以不同的方式解決這個問題! :)
All code can be found here: https://github.com/shedoesdatascience/email_analysis/blob/master/email_analysis_documentation.Rmd
所有代碼都可以在這里找到: https : //github.com/shedoesdatascience/email_analysis/blob/master/email_analysis_documentation.Rmd
翻譯自: https://towardsdatascience.com/machine-learning-is-not-always-the-solution-to-a-data-problem-7f07c000f15
機器學習數據傾斜的解決方法
總結
以上是生活随笔為你收集整理的机器学习数据倾斜的解决方法_机器学习并不总是解决数据问题的方法的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 手机中的马龙?一加中国区总裁称Ace2是
- 下一篇: 让 ChatGPT 多飞一会儿