利用PyCaret的力量
PyCaret is an open-source, low-code machine learning library in Python that aims to reduce the cycle time and allows you to go from preparing your data to deploying your model within seconds using your choice of notebook environment.
P yCaret是Python中的一種開放源代碼,低代碼的機(jī)器學(xué)習(xí)庫(kù),旨在減少周期時(shí)間,并允許您使用選擇的筆記本環(huán)境從準(zhǔn)備數(shù)據(jù)到在幾秒鐘內(nèi)部署模型。
This article is aimed at someone who is familiar with machine learning concepts, and also knows how to implement the various Machine Learning algorithms using different libraries such as Scikit-Learn. The perfect reader is aware of the need for automation and doesn’t want to spend so much time seeking the optimal algorithm and its hyperparameters.
本文針對(duì)的對(duì)象是熟悉機(jī)器學(xué)習(xí)概念的人,并且知道如何使用不同的庫(kù)(例如Scikit-Learn)來實(shí)現(xiàn)各種機(jī)器學(xué)習(xí)算法。 完美的讀者已經(jīng)意識(shí)到了自動(dòng)化的必要性,并且不想花太多時(shí)間尋找最佳算法及其超參數(shù)。
As machine learning practitioners, we know that there are several steps involved in the life cycle of a complete Data Science project and these include Data Preprocessing — missing value treatment, null value treatment, changing the data types, encoding techniques for categorical features, data transformation — log, box cox transformations, feature engineering, Exploratory Data Analysis (EDA), etc. before we can actually start the model building, evaluation and prediction. So we use various libraries such as numpy, pandas, matplotlib scikit-learn, etc in python for accomplishing these tasks. So Pycaret is a very powerful library that helps us in the automation of the process.
作為機(jī)器學(xué)習(xí)的從業(yè)者,我們知道一個(gè)完整的數(shù)據(jù)科學(xué)項(xiàng)目的生命周期涉及幾個(gè)步驟,其中包括數(shù)據(jù)預(yù)處理-缺失值處理,空值處理,更改數(shù)據(jù)類型,分類特征的編碼技術(shù),數(shù)據(jù)轉(zhuǎn)換—日志,Box Cox轉(zhuǎn)換,功能工程,探索性數(shù)據(jù)分析(EDA)等,然后我們才能真正開始模型的建立,評(píng)估和預(yù)測(cè)。 因此,我們?cè)趐ython中使用了各種庫(kù)(例如numpy,pandas,matplotlib scikit-learn等)來完成這些任務(wù)。 因此,Pycaret是一個(gè)非常強(qiáng)大的庫(kù),可以幫助我們實(shí)現(xiàn)流程的自動(dòng)化。
安裝Pycaret (Installing Pycaret)
!pip install pycaret==2.0Once Pycaret is installed, we are ready to go! I am going to discuss a regression problem here and Pycaret can be used for many problems such as classification, anomaly detection, clustering, Natural Language Processing.
一旦安裝了Pycaret,我們就可以開始了! 我將在這里討論回歸問題,Pycaret可以用于許多問題,例如分類,異常檢測(cè),聚類,自然語(yǔ)言處理。
I am going to use the Laptop Prices dataset here which I have obtained from scraping Flipkart website.
我將在這里使用筆記本電腦價(jià)格數(shù)據(jù)集 我是從抓取Flipkart網(wǎng)站獲得的。
df = pd.read_csv('changed.csv') # Reading the datasetdf.head()from pycaret.regression import *
reg = setup(data = df, target = 'Price')
The setup() function of Pycaret does most of the correction, which is normally done with many lines of code — is done in a single line of code! That’s the beauty of this amazing library!
Pycaret的setup()函數(shù)進(jìn)行了大部分校正,這通常是用多行代碼完成的—只需一行代碼即可完成! 這就是這個(gè)令人驚嘆的圖書館的美!
We use the setup variable, and in the target, we mention the feature name (dependent variable)-here we want to predict the Price of the laptop so that becomes the dependent variable.
我們使用設(shè)置變量,在目標(biāo)中,我們提到功能名稱(因變量),此處我們要預(yù)測(cè)筆記本電腦的價(jià)格,以使其成為因變量。
X = df.drop('Price',axis=1)Y = df['Price']
Y = pd.DataFrame(Y)
Comparing all the regression models
比較所有回歸模型
compare_models()Training all the regression models. So after this, we can create any model-either CatBoost or else XGBoost regressor model, and then we can perform hyperparameter tuning.
訓(xùn)練所有回歸模型。 因此,在此之后,我們可以創(chuàng)建任何模型-CatBoost或XGBoost回歸模型,然后執(zhí)行超參數(shù)調(diào)整。
We can see that our Gradient Boosting Regressor (GBR) model has performed relatively better when compared to all the other models. But I have performed the analysis using the XGBoost model as well, and this model performed better than the GBR Model.
我們可以看到,與所有其他模型相比,我們的Gradient Boosting Regressor(GBR)模型的性能相對(duì)較好。 但是我也使用XGBoost模型進(jìn)行了分析,并且該模型的性能優(yōu)于GBR模型。
Error using Gradient Boosting Regressor model使用梯度提升回歸模型時(shí)出錯(cuò)As we have identified the best model to be XGBoost so we create xgboost model with the help of create_model function and mention the max_depth(number of iteration for which the model will run)
由于我們已確定最佳模型為XGBoost,因此我們?cè)赾reate_model函數(shù)的幫助下創(chuàng)建了xgboost模型,并提到了max_depth(該模型將針對(duì)其運(yùn)行的迭代次數(shù))
Creating the model
建立模型
xgboost = create_model('xgboost', max_depth = 10)Error using XGBoost model使用XGBoost模型時(shí)出錯(cuò)So after creating the model with a depth of 10, it runs 10 iterations and calculates the MAE(Mean Absolute Error), MSE (Mean Squared Error), RMSE (Root Mean Squared Error), R2(R2_score-R squared value), MAPE (Mean Absolute Percentage Error) in every iteration. Finally, it displays the mean and standard deviation of all the errors in these 10 iterations. Lesser the error better is the machine learning model! So in order to reduce the error, we try to find out the hyperparameters which can minimize the error.
因此,在創(chuàng)建深度為10的模型之后,它將運(yùn)行10次迭代并計(jì)算MAE(均值絕對(duì)誤差),MSE(均方誤差),RMSE(均方根誤差),R2(R2_score-R平方值),MAPE (平均絕對(duì)百分比誤差)。 最后,它顯示了這10次迭代中所有誤差的平均值和標(biāo)準(zhǔn)偏差。 誤差越小,機(jī)器學(xué)習(xí)模型就越好! 因此,為了減少錯(cuò)誤,我們嘗試找出可以使錯(cuò)誤最小化的超參數(shù)。
For this purpose, we apply the tune_model function and apply K-fold cross-validation to find out the best hyperparameters.
為此,我們應(yīng)用tune_model函數(shù)并應(yīng)用K折交叉驗(yàn)證以找出最佳的超參數(shù)。
Hyper tuning of the model
超調(diào)模型
xgboost = tune_model(xgboost, fold=5)Errors after hyper tuning超調(diào)后的錯(cuò)誤The model runs 5 iterations and gives us the mean and standard deviation of all the errors. The mean value of MAE after 5 iterations was almost the same for both GBR and XGBoost models, but after hyper tuning and making the predictions, the XGBoost model had less error and performed better than the GBR model.
該模型運(yùn)行5次迭代,并為我們提供所有誤差的均值和標(biāo)準(zhǔn)差。 對(duì)于GBR和XGBoost模型,經(jīng)過5次迭代后,MAE的平均值幾乎相同,但是經(jīng)過超調(diào)和做出預(yù)測(cè)之后,XGBoost模型的誤差較小,并且性能優(yōu)于GBR模型。
Making predictions using the best model
使用最佳模型進(jìn)行預(yù)測(cè)
predict_model(xgboost)Making the Predictions做出預(yù)測(cè)Checking the scores after applying Cross Validation (we mainly need the Mean Absolute Error). Here we can see that the MAE for the best model has come down to 10847.2257 so the Mean Absolute Error is approximately 10,000.
應(yīng)用交叉驗(yàn)證后檢查分?jǐn)?shù)(我們主要需要平均絕對(duì)誤差)。 在這里,我們可以看到最佳模型的MAE已降至10847.2257,因此平均絕對(duì)誤差約為10,000。
Checking all the parameters of the xgboost model
檢查xgboost模型的所有參數(shù)
print(xgboost)Checking the hyperparameters檢查超參數(shù)XGBoost model hyperparamaters
XGBoost模型超參數(shù)
plot_model(xgboost, plot='parameter')Checking the hyperparameters檢查超參數(shù)Residuals Plot
殘差圖
The distances (errors) between the actual and predicted values
實(shí)際值與預(yù)測(cè)值之間的距離(誤差)
plot_model(xgboost, plot='residuals')Residuals Plot殘差圖We can clearly see that my model is overfitting as the R squared for training set is 0.999 and test set is 0.843. This is actually not surprising because my dataset contains a total of only 168 rows! But the main point here is to highlight the excellent features of Pycaret as you can create plots and curves with just one line of code!
我們可以清楚地看到,我的模型過度擬合,因?yàn)橛?xùn)練集的R平方為0.999,而測(cè)試集的R平方為0.843。 這實(shí)際上不足為奇,因?yàn)槲业臄?shù)據(jù)集總共僅包含168行! 但是這里的重點(diǎn)是要突出Pycaret的出色功能,因?yàn)槟恍枰恍写a就可以創(chuàng)建繪圖和曲線!
Plotting the Prediction Error
繪制預(yù)測(cè)誤差
plot_model(xgboost, plot='error')Prediction Error預(yù)測(cè)誤差The value of R squared for the model is 0.843.
該模型的R平方的值為0.843。
Cooks Distance Plot
廚師距離圖
plot_model(xgboost, plot='cooks')Cooks Distance Plot廚師距離圖Learning Curve
學(xué)習(xí)曲線
plot_model(xgboost, plot='learning')Learning Curve學(xué)習(xí)曲線Validation Curve
驗(yàn)證曲線
plot_model(xgboost, plot='vc')Validation Curve驗(yàn)證曲線These 2 plots also show us that the model is clearly overfitting!
這兩個(gè)圖也向我們顯示該模型顯然過擬合!
Plot of Feature Importance
特征重要性圖
plot_model(xgboost, plot='feature')Feature Importance功能重要性By this plot, we can see that Processor_Type_i9 (i9 CPU) is a very important feature for determining the price of the laptop.
通過此圖,我們可以看到Processor_Type_i9(i9 CPU)是確定筆記本電腦價(jià)格的非常重要的功能。
Splitting the dataset into training and testing set
將數(shù)據(jù)集分為訓(xùn)練和測(cè)試集
from sklearn.model_selection import train_test_splitX_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2)
Final XGBoost parameters for deployment
最終的XGBoost部署參數(shù)
final_xgboost = finalize_model(xgboost)Final parameters of the XGB modelXGB模型的最終參數(shù)Making the prediction on the unseen data ( Test set data)
對(duì)看不見的數(shù)據(jù)(測(cè)試集數(shù)據(jù))進(jìn)行預(yù)測(cè)
new_predictions = predict_model(xgboost, data=X_test)new_predictions.head()Predictions on the test set對(duì)測(cè)試集的預(yù)測(cè)
Saving the transformation pipeline and model
保存轉(zhuǎn)換流程和模型
save_model(xgboost, model_name = 'deployment_08082020')Transformation Pipeline and Model Succesfully Saveddeployment_08082020 = load_model('deployment_08082020')Transformation Pipeline and Model Sucessfully Loadeddeployment_08082020Final Machine Learning Model最終機(jī)器學(xué)習(xí)模型
So this is the final Machine Learning model that can be used for deployment.
因此,這是可用于部署的最終機(jī)器學(xué)習(xí)模型。
The model is saved in the pickle format!
模型以pickle格式保存!
For more info, check the documentation here
有關(guān)更多信息,請(qǐng)查看文檔 這里
In this article, I have not discussed everything in detail. But you can always refer to my GitHub Repository for the whole code. My conclusion from this article is that don’t expect a perfect model, but expect something you can use in your own company/project today!
在本文中,我沒有詳細(xì)討論所有內(nèi)容。 但是您始終可以參考我的GitHub存儲(chǔ)庫(kù)以獲取整個(gè)代碼。 我從本文得出的結(jié)論是,不要期望一個(gè)完美的模型,而是希望您今天可以在自己的公司/項(xiàng)目中使用某些東西!
Shout out to Moez Ali for this absolutely brilliant library!
為這個(gè)絕對(duì)出色的圖書館大喊Moez Ali !
Connect with me on LinkedIn here
在此處通過LinkedIn與我聯(lián)系
The bottom line is that the automation lowers the risk of human error and adds some intelligence to the enterprise system. — Stephen Elliot
最重要的是,自動(dòng)化降低了人為錯(cuò)誤的風(fēng)險(xiǎn),并為企業(yè)系統(tǒng)增加了一些智能。 —斯蒂芬·艾略特(Stephen Elliot)
I hope you found the article insightful. I would love to hear feedback to improvise it and come back with better content.
我希望您發(fā)現(xiàn)這篇文章很有見地。 我很想聽聽反饋以即興創(chuàng)作,并以更好的內(nèi)容回來。
Thank you so much for reading!
非常感謝您的閱讀!
翻譯自: https://towardsdatascience.com/leverage-the-power-of-pycaret-d5c3da3adb9b
總結(jié)
以上是生活随笔為你收集整理的利用PyCaret的力量的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 线上线下合理布局 2022年vivo位居
- 下一篇: 有变!油价下周五将迎今年第二涨 预计上调