當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

超参数优化贝叶斯优化框架_mlmachine-使用贝叶斯优化进行超参数调整

發布時間：2023/12/15 编程问答 25 豆豆

生活随笔收集整理的這篇文章主要介紹了超参数优化贝叶斯优化框架_mlmachine-使用贝叶斯优化进行超参数调整小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

超參數優化貝葉斯優化框架

機器 (mlmachine)

TL; DR (TL;DR)

mlmachine is a Python library that organizes and accelerates notebook-based machine learning experiments.

mlmachine是一個Python庫，可組織和加速基于筆記本的機器學習實驗。

In this article, we use mlmachine to accomplish actions that would otherwise take considerable coding and effort, including:

在本文中，我們使用mlmachine來完成原本需要大量編碼和精力的操作，包括：

Bayesian Optimization for Multiple Estimators in One Shot
一擊中多個估計的貝葉斯優化
Results Analysis
結果分析
Model Reinstantiation
模型實例化

Check out the Jupyter Notebook for this article.

查看本文的Jupyter Notebook 。

Check out the project on GitHub.

在GitHub上檢查項目。

And check out past mlmachine articles:

并查看過去的mlmachine文章：

一擊中多個估計的貝葉斯優化 (Bayesian Optimization for Multiple Estimators in One Shot)

Bayesian optimization is typically described as an advancement beyond exhaustive grid searches, and rightfully so. This hyperparameter tuning strategy succeeds by using prior information to inform future parameter selection for a given estimator. Check out Will Koehrsen’s article on Medium for an excellent overview of the package.

貝葉斯優化通常被描述為超越窮舉網格搜索的一種進步，理所當然的。通過使用先驗信息來通知給定估計量的將來參數選擇，此超參數調整策略成功完成。請查看Will Koehrsen在Medium上的文章，以獲得該軟件包的出色概述。

mlmachine uses hyperopt as a foundation for performing Bayesian optimization, and takes the functionality of hyperopt a step further through a simplified workflow that allows for optimization of multiple models in single process execution. In this article, we are going to optimize four classifiers:

mlmachine使用hyperopt作為執行貝葉斯優化的基礎，并通過簡化的工作流程使hyperopt的功能更進一步，該工作流程允許在單個流程執行中優化多個模型。在本文中，我們將優化四個分類器：

LogisticRegression()
LogisticRegression()
XGBClassifier()
XGBClassifier()
RandomForestClassifier()
RandomForestClassifier()
KNeighborsClassifier()
KNeighborsClassifier()

準備數據 (Prepare Data)

First, we apply data preprocessing techniques to clean up our data. We’ll start by creating two Machine() objects — one for the training data and a second for the validation data:

首先，我們應用數據預處理技術來清理數據。我們將首先創建兩個Machine()對象-一個用于訓練數據，另一個用于驗證數據：

Now we process the data by imputing nulls and applying various binning, feature engineering and encoding techniques:

現在，我們通過估算空值并應用各種裝倉，特征工程和編碼技術來處理數據：

Here is the output, still in a DataFrame:

這是輸出，仍然在DataFrame ：

功能重要性摘要 (Feature Importance Summary)

As a second preparatory step, we want to perform feature selection for each of our classifiers:

作為第二準備步驟，我們要為每個分類器執行特征選擇：

詳盡的迭代特征選擇 (Exhaustively Iterative Feature Selection)

For our final preparatory step, we use this feature selection summary to perform iterative cross-validation on smaller and smaller subsets of features for each of our estimators:

對于最后的準備步驟，我們使用此特征選擇摘要為每個估計量對越來越小的特征子集執行迭代交叉驗證：

From this result, we extract our dictionary of optimum feature sets for each estimator:

從這個結果中，我們提取出每個估計量的最佳特征集字典：

The keys are estimator names, and the associated values are lists containing the column names of the best performing feature subset for each estimator. Here are the key/value pairs for XGBClassifier(), which used only 10 of the available 43 features to achieve the best average cross-validation accuracy on the validation dataset:

鍵是估計器名稱，而關聯的值是包含每個估計器性能最佳的特征子集的列名稱的列表。以下是XGBClassifier()的鍵/值對，該鍵/值對僅使用了43個可用功能中的10個，以在驗證數據集上實現最佳的平均交叉驗證準確性：

With our processed dataset and optimum feature subsets in hand, it’s time to use Bayesian optimization to tune the hyperparameters of our 4 estimators.

有了我們經過處理的數據集和最佳特征子集，是時候使用貝葉斯優化來調整4個估計量的超參數了。

概述我們的特征空間 (Outline Our Feature Space)

First, we need to establish our feature space for each parameter for each estimator:

首先，我們需要為每個估計量的每個參數建立特征空間：

The outermost keys of the dictionary are names of classifiers, represented by strings. The associated values are also dictionaries, where the keys are parameter names, represented as strings, and the values are hyperopt sampling distributions from which parameter values will be chosen.

字典的最外鍵是分類器的名稱，由字符串表示。關聯的值也是字典，其中鍵是參數名稱(表示為字符串)，值是hyperopt采樣分布，將從中選擇參數值。

運行貝葉斯優化作業 (Run the Bayesian Optimization Job)

Now we’re ready to run our Bayesian optimization hyperparameter tuning job. We will use a built-in method belonging to our Machine() object called exec_bayes_optim_search(). Let’s see mlmachine in action:

現在，我們可以運行貝葉斯優化超參數調整工作了。我們將使用屬于Machine()對象的內置方法，稱為exec_bayes_optim_search() 。讓我們看看mlmachine的作用：

Let’s review the parameters:

讓我們回顧一下參數：

estimator_parameter_space: The dictionary-based feature space we setup above.
estimator_parameter_space ：我們在上面設置的基于字典的特征空間。
data: Our observations.
data ：我們的觀察。
target: Our target data.
target ：我們的目標數據。
columns: An optional parameter that allows us to subset the input dataset features. Accepts a list of feature names, which will apply equally to all estimators. Also accepts a dictionary, where the keys represent estimator class names and values are lists of feature names to be used with the associated estimator. In this example, we use the latter by passing in the dictionary returned by cross_val_feature_dict() in the FeatureSelector() workflow above.
columns ：一個可選參數，允許我們對輸入數據集要素進行子集化。接受要素名稱列表，這將同樣適用于所有估計量。還接受字典，其中的鍵代表估計器類名稱，值是要與關聯的估計器一起使用的功能名稱的列表。在此示例中，我們通過在上面的FeatureSelector()工作流程中cross_val_feature_dict()返回的字典來使用后者。
scoring: The scoring metric to be evaluated.
scoring ：要評估的得分指標。
n_folds: Number of folds to use in cross-validation procedure.
n_folds ：在交叉驗證過程中使用的折疊數。
iters: Total number of iterations to run the hyperparameter tuning process. In this example, we run the experiment for 200 iterations.
iters ：運行超參數調整過程的迭代總數。在此示例中，我們對實驗進行了200次迭代。
show_progressbar: Controls whether progress bar displays and actively updates during the course of the process.
show_progressbar ：控制進度條是否在過程中顯示和主動更新。

Anyone familiar with hyperopt will be wondering where the objective function is. mlmachine abstracts away this complexity.

任何熟悉hyperopt的人都會想知道目標函數在哪里。 mlmachine消除了這種復雜性。

The process runtime depends on several attributes, including hardware, the number and type of estimators used, the number of folds, feature selection, and the number of sampling iterations. Runtimes can be quite lengthy. For this reason, exec_bayes_optim_search() automatically saves the result of each iteration to a CSV.

流程運行時取決于幾個屬性，包括硬件，使用的估計量的數量和類型，折疊的數量，特征選擇以及采樣迭代的數量。運行時間可能很長。因此， exec_bayes_optim_search()自動將每次迭代的結果保存到CSV中。

結果分析 (Results Analysis)

結果匯總 (Results Summary)

Let’s start by loading and reviewing the results:

讓我們從加載和查看結果開始：

Our Bayesian optimization log maintains key information about each iteration:

我們的貝葉斯優化日志維護有關每次迭代的關鍵信息：

Iteration number, estimator and scoring metric
迭代數，估計量和評分指標
Cross-validation summary statistics
交叉驗證摘要統計
Iteration training time
迭代訓練時間
Dictionary of parameters used
使用的參數字典

This log provides an immense amount of data for us to analyze and evaluate the effectiveness of the Bayesian optimization process.

該日志為我們提供了大量數據，以分析和評估貝葉斯優化過程的有效性。

模型優化評估 (Model Optimization Assessment)

First and foremost, we want to see how if performance improved over the iterations.

首先，我們想看看在迭代過程中性能如何提高。

Let’s visualize the XGBClassifier() loss by iteration:

讓我們通過迭代可視化XGBClassifier()損失：

Each dot represents the performance of one of our 200 experiments. The key detail to notice is that the line of best fit has a clear downward slope - exactly what we want. This means that with each iteration, model performance tends to improve compared to the previous iterations.

每個點代表我們200個實驗之一的性能。需要注意的關鍵細節是，最合適的線具有明顯的向下傾斜-正是我們想要的。這意味著與以前的迭代相比，每次迭代時模型性能都有提高的趨勢。

參數選擇評估 (Parameter Selection Assessment)

One of the coolest parts of Bayesian optimization is seeing how parameter selection is optimized.

貝葉斯優化的最酷部分之一就是了解如何優化參數選擇。

For each model and for each model’s parameters, we can generate a two-panel visual.

對于每個模型和每個模型的參數，我們可以生成一個兩面板的視覺效果。

For numeric parameters, such as n_estimators or learning_rate, the two-visual panel includes:

對于數字參數，例如n_estimators或learning_rate ，兩個可視面板包括：

Parameter selection KDE, overplayed on a theoretical distribution KDE
參數選擇KDE，超過了理論分布KDE
Parameter selection by iteration scatter plot, with line of best fit
通過迭代散點圖選擇參數，并選擇最佳擬合線

For categorical parameters, such as loss function, the two-visual panel includes:

對于分類參數(例如損失函數)，兩個可視面板包括：

Parameter selection and theoretical distribution bar chart
參數選擇和理論分布條形圖
Parameter selection by iteration scatter plot, faceted by parameter category
通過迭代散點圖選擇參數，按參數類別進行分面

Let’s review the parameter selection panels for KNeighborsClassifier():

讓我們回顧一下KNeighborsClassifier()的參數選擇面板：

The built-in method model_param_plot() cycles through of the estimator’s parameters and presents the appropriate panel given each parameter’s type. Let’s look at a numeric parameter and categorical parameter separately.

內置方法model_param_plot()循環遍歷估計器的參數，并根據每個參數的類型顯示適當的面板。讓我們分別看一下數字參數和分類參數。

First, we’ll review the panel for the numeric parameter n_neighbors:

首先，我們將在面板上查看數字參數n_neighbors ：

On the left, we can see two overlapping kernel density plots summarizing the actual parameter selections and the theoretical parameter distribution. The purple line corresponds to the theoretical distribution, and, as expected, this curve is smooth and evenly distributed. The teal line corresponds to the actual parameter selections, and it’s clearly evident that hyperopt prefers values between 5 and 10.

在左側，我們可以看到兩個重疊的核密度圖，總結了實際參數選擇和理論參數分布。紫色線對應于理論分布，并且正如預期的那樣，該曲線是平滑且均勻分布的。藍綠色線對應于實際的參數選擇，很明顯，hyperopt更喜歡5到10之間的值。

On the right, the scatter plot visualizes the n_neighbors value selections over the iterations. There is a slight downward slope to the line of best fit, as the Bayesian optimization process hones in on values around 7.

在右側，散點圖將迭代中的n_neighbors值選擇可視化。最佳擬合線略有向下傾斜，因為貝葉斯優化過程的值大約為7。

Next, we’ll review the panel for the categorical parameter algorithm:

接下來，我們將回顧分類參數algorithm面板：

On the left, we see a bar chart displaying the counts of parameter selections, faceted by actual parameter selections and selections from the theoretical distribution . The purple bars, representing selections from the theoretical distribution, are more even than the teal bars, representing the actual selection.

在左側，我們看到一個條形圖，其中顯示了參數選擇的計數，其中包括實際參數選擇和理論分布中的選擇。代表理論分布的選擇的紫色條比代表實際選擇的藍綠色條還要均勻。

On the right, the scatter plot again visualizes the algorithm value selection over the iterations. There is a clear decrease in selection of “ball_tree” and “auto” in favor of “kd_tree” and “brute” over the the iterations.

在右側，散點圖再次可視化了迭代中的algorithm值選擇。在迭代過程中，對“ ball_tree”和“ auto”的選擇明顯減少，而對“ kd_tree”和“ brute”有利。

模型實例化 (Models Reinstantiation)

頂級模特鑒定 (Top Model Identification)

Our Machine() object has a built-in method called top_bayes_optim_models(), which identifies the best model for each estimator type based on the results in our Bayesian optimization log.

我們的Machine()對象具有一個稱為top_bayes_optim_models()的內置方法，該方法根據貝葉斯優化日志中的結果為每種估計器類型標識最佳模型。

With this method, we can identify the top N models for each estimator based on mean cross-validation score. In this experiment, top_bayes_optim_models() returns the dictionary below, which tells us that LogisticRegression() identified its top model on iteration 30, XGBClassifier() on iteration 61, RandomForestClassifier() on iteration 46, and KNeighborsClassifier() on iteration 109.

使用這種方法，我們可以基于平均交叉驗證得分為每個估計量確定前N個模型。在該實驗中， top_bayes_optim_models()返回下面的字典，它告訴我們， LogisticRegression()識別其頂部模型上迭代30， XGBClassifier()上迭代61， RandomForestClassifier()上迭代46和KNeighborsClassifier()上迭代109。

使用模型 (Putting the Models to Use)

To reinstantiate a model, we leverage our Machine() object’s built-in method BayesOptimClassifierBuilder(). To use this method, we pass in our results log, specify an estimator class and iteration number. This will instantiate a model object with the parameters stored on that record of the log:

為了重新實例化模型，我們利用了Machine()對象的內置方法BayesOptimClassifierBuilder() 。要使用此方法，我們傳入結果日志，指定一個估計器類和迭代數。這將使用存儲在日志記錄中的參數實例化模型對象：

Here we see the model parameters:

在這里，我們看到模型參數：

The models instantiated with BayesOptimClassifierBuilder() use .fit() and .predict() in a way that should feel quite familiar.

與實例化的模型BayesOptimClassifierBuilder()使用.fit()和.predict()的方式，應該感到很熟悉。

Let’s finish this article with a very basic model performance evaluation. We will fit this RandomForestClassifier() on the training data and labels, generate predictions with the training data, and evaluate the model’s performance by comparing these predictions to the ground-truth:

讓我們以一個非常基本的模型性能評估結束本文。我們將將此RandomForestClassifier()擬合到訓練數據和標簽上，使用訓練數據生成預測，并通過將這些預測與真實性進行比較來評估模型的性能：

Anyone familiar with Scikit-learn should feel right at home.

任何熟悉Scikit學習的人都應該感到賓至如歸。

收盤時 (In Closing)

mlmachine makes it easy to efficiently optimize the hyperparameters for multiple estimators in one shot, and facilitates the visual inspection of model improvement and parameter selection.

mlmachine使您可以輕松高效地一次優化多個估計器的超參數，并有助于對模型改進和參數選擇進行直觀檢查。

Check out the GitHub repository, and stay tuned for additional column entries.

簽出GitHub存儲庫，并繼續關注其他列條目。

翻譯自: https://towardsdatascience.com/mlmachine-hyperparameter-tuning-with-bayesian-optimization-2de81472e6d