微软NNI---AutoFeatureENG
01 AutoML概述
AutoML不光是調參,應該包含自動特征工程。”今天回過頭來看AutoML是一個系統化的體系,包含3個要素:
02 NNI概述
NNI(NerualNetworkIntelligence)是微軟發起的一個AutoML開源工具,覆蓋了上文提到的3要素
包括特征工程、神經網絡架構搜索(NAS)、超參調優和模型壓縮在內的步驟,你都能使用自動機器學習算法來完成
https://github.com/SpongebBob/tabular_automl_NNI
總體看微軟的工具都有一個比較大的特點,技術可能不一定多新穎,但是設計都非常贊。NNI的AutoFeatureENG基本包含了用戶對于AutoFeatureENG的一切幻想。
03 細說NNI-AutoFeatureENG
NNI把AutoFeatureENG拆分成exploration和selection兩個模塊。exploration主要是特征衍生和交叉,selection講的是如何做特征篩選。
04 特征Exploration
在特征衍生方面,微軟教科書般的把特征衍生分成以下一些方式:
count: Count encoding is based on replacing categories with their counts computed on the train set, also named frequency encoding.
target: Target encoding is based on encoding categorical variable values with the mean of target variable per value.
embedding: Regard features as sentences, generate vectors using Word2Vec.
crosscout: Count encoding on more than one-dimension, alike CTR (Click Through Rate).
aggregete: Decide the aggregation functions of the features, including min/max/mean/var.
nunique: Statistics of the number of unique features.
histsta: Statistics of feature buckets, like histogram statistics.
具體特征怎么交叉,哪一列和哪一列交叉,每一列特征用什么方式衍生呢?可以通過search_space.json這個文件控制。
NNI provides count encoding for 1-order-op, as well as crosscount encoding, aggerate statistics (min max var mean median nunique) for 2-order-op.
For example, we want to search the features which are a frequency encoding (valuecount) features on columns name {“C1”, …,” C26”}, in the following way:
we can define a cross frequency encoding (value count on cross dims) method on columns {“C1”,…,”C26”} x {“C1”,…,”C26”} in the following way:
The purpose of Exploration is to generate new features. You can use get_next_parameter function to get received feature candidates of one trial.
RECEIVED_PARAMS = nni.get_next_parameter()
05 特征Selection
為了避免特征泛濫的情況,避免過擬合,一定要有Selection的機制挑選特征。這里微軟同學用了個小心機,在特征篩選的時候主推了同樣是他們自己開源的算法lightGBM
了解xgboost或者GBDT算法同學應該知道,這種樹形結構的算法是很容易計算出每個特征對于結果的影響的。所以使用lightGBM可以天然的進行特征篩選。弊病就是,如果下游是個LR這種線性算法,篩選出來的特征是否具備普適性。跑通后產出的結果包含了每個特征的value以及屬性。
06 總結
NNI的AutoFeature模塊是給整個行業制定了一個教科書般的標準,告訴大家這個東西要怎么做,有哪些模塊,使用起來非常方便。但是如果只是基于這樣簡單的模式,不一定能達到很好的效果。
Suggestions to NNI
About Exploration: If consider using DNN (like xDeepFM) to extract high-order feature would be better.
About Selection: There could be more intelligent options, such as automatic selection system based on downstream models.
Conclusion: NNI could offer users some inspirations of design and it is a good open source project. I suggest researchers leverage it to accelerate the AI research.
source: 如何看待微軟最新發布的AutoML平臺NNI?By Garvin Li
NNI review article from Zhihu: - By Garvin Li — An open source AutoML toolkit for neural architecture search, model compression and hyper-parameter tuning (NNI v2.4)
總結
以上是生活随笔為你收集整理的微软NNI---AutoFeatureENG的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: [机器学习]总结特征工程干货
- 下一篇: TCL 预告新款 C11G QLED 智