The Wide and Deep Learning Model(译文+Tensorlfow源码解析) 原创 2017年11月03日 22:14:47 标签: 深度学习 / 谷歌 / tensorf
The Wide and Deep Learning Model(譯文+Tensorlfow源碼解析)
原創(chuàng) 2017年11月03日 22:14:47Author: DivinerShi
本文主要講解Google的Wide and Deep Learning 模型。本文先從原始論文開(kāi)始,先一步步分析論文,把論文看懂。再去分析官方開(kāi)源的Tensorflow源碼,解析各個(gè)特征的具體實(shí)現(xiàn)方法,以及模型的具體構(gòu)造方法等。
先上圖
1.論文翻譯
ABSTRACT
Generalized linear models with nonlinear feature transformations are widely used for large-scale regression and classification problems with sparse inputs. Memorization of feature interactions through a wide set of cross-product feature transformations are effective and interpretable, while generalization requires more feature engineering effort. With less feature engineering, deep neural networks can generalize better to unseen feature combinations through low-dimensional dense embeddings learned for the sparse features. However, deep neural networks with embeddings can over-generalize and recommend less relevant items when the user-item interactions are sparse and high-rank. In this paper, we present Wide & Deep learning—jointly trained wide linear models and deep neural networks—to combine the benefits of memorization and generalization for recommender systems. We productionized and evaluated the system on Google Play, a commercial mobile app store with over one billion active users and over one million apps. Online experiment results show that Wide & Deep significantly increased app acquisitions compared with wide-only and deep-only models. We have also open sourced our implementation in TensorFlow.
譯文:
通過(guò)將稀疏數(shù)據(jù)的非線性轉(zhuǎn)化特征應(yīng)用在廣義線性模型中被廣泛應(yīng)用于大規(guī)模的回歸和分類問(wèn)題。通過(guò)廣泛的使用交叉特征轉(zhuǎn)化,使得特征交互的記憶性是有效的,并且具有可解釋性,而然不得不做許多的特征工作。相對(duì)來(lái)說(shuō),通過(guò)從稀疏數(shù)據(jù)中學(xué)習(xí)低緯稠密embedding特征,并應(yīng)用到深度學(xué)習(xí)中,只需要少量的特征工程就能對(duì)潛在的特征組合具有更好的范化性。但是當(dāng)用戶項(xiàng)目交互是稀疏和高緯數(shù)據(jù)的時(shí)候,利用了embeddings的深度學(xué)習(xí)則表現(xiàn)得過(guò)于籠統(tǒng)(over-generalize),推薦的都是些相關(guān)性很低的items。在這篇文章中,我們提出了一個(gè)wide and deep 聯(lián)合學(xué)習(xí)模型,去結(jié)合集合推薦系統(tǒng)的memorization和generalization。我們?cè)u(píng)估在Google Play上評(píng)估了該方法,在線實(shí)驗(yàn)結(jié)果顯示,相比于單個(gè)的wide或者deep模型,WD模型顯著的增加了app獲取率。我們?cè)趖ensorflow上開(kāi)源了該源碼。
點(diǎn)評(píng):
提出了一個(gè)結(jié)合使用了非線性特征的線性模型和一個(gè)用來(lái)embedding特征的深度學(xué)習(xí),并且使用聯(lián)合訓(xùn)練的方法進(jìn)行優(yōu)化。思想是,基于交叉特征的線性模型只能從歷史出現(xiàn)過(guò)的數(shù)據(jù)中找到非線性(顯性的非線性),深度學(xué)習(xí)可以找到?jīng)]有出現(xiàn)過(guò)的非線性(隱形的非線性)。
INTRODUCTION
A recommender system can be viewed as a search ranking system, where the input query is a set of user and contextual information, and the output is a ranked list of items. Given a query, the recommendation task is to find the relevant items in a database and then rank the items based on certain objectives, such as clicks or purchases.
One challenge in recommender systems, similar to the general search ranking problem, is to achieve both memorization and generalization. Memorization can be loosely defined as learning the frequent co-occurrence of items or features and exploiting the correlation available in the historical data.Generalization, on the other hand, is based on transitivity of correlation and explores new feature combinations that have never or rarely occurred in the past. Recommendations based on memorization are usually more topical and directly relevant to the items on which users have already performed actions. Compared with memorization, generalization tends to improve the diversity of the recommended items. In this paper, we focus on the apps recommendation problem for the Google Play store, but the approach should apply to generic recommender systems.
譯文:
*推薦系統(tǒng)可以被看做是一個(gè)搜索排序系統(tǒng),其中輸入的query是一系列的用戶和文本信息,輸出是items的排序列表。給定一個(gè)query,推薦的任務(wù)就是到數(shù)據(jù)庫(kù)中去找出相關(guān)的items,然后對(duì)這些items根據(jù)相關(guān)對(duì)象,如點(diǎn)擊或者購(gòu)買行為,進(jìn)行排序。
和傳統(tǒng)的搜索排序問(wèn)題一樣,在推薦系統(tǒng)中,一個(gè)挑戰(zhàn)就是區(qū)域同時(shí)達(dá)到memorization和generalization。Memorization可以被大概定義為學(xué)習(xí)items或者features之間的相關(guān)頻率,在歷史數(shù)據(jù)中探索相關(guān)性的可行性。Generalizaion的話則是基于相關(guān)性的傳遞,去探索一些在過(guò)去沒(méi)有出現(xiàn)過(guò)的特征組合。基于memorization的推薦相對(duì)來(lái)說(shuō)具有局部性,是在哪些用戶和items已經(jīng)有直接相關(guān)聯(lián)的活動(dòng)上。相較于memorization,generalization嘗試去提高推薦items的多元化。在這篇paper中,我們主要關(guān)注Google Play 商店的app推薦問(wèn)題,但是該方法對(duì)推薦系統(tǒng)具有通用性。*
點(diǎn)評(píng):
這里我沒(méi)有對(duì)memorization和generalization進(jìn)行翻譯,因?yàn)槲乙膊恢涝趺捶g,memorization的話就是去把歷史數(shù)據(jù)中顯性的非線性找出來(lái),generalization的就是范化性,就是把一些隱性的找出來(lái)。
For massive-scale online recommendation and ranking systems in an industrial setting, generalized linear models such as logistic regression are widely used because they are simple, scalable and interpretable. The models are often trained on binarized sparse features with one-hot encoding. E.g., the binary feature “user_installed_app=netflix ”has value 1 if the user installed Netflix. Memorization can be achieved effectively using cross-product transformations over sparse features, such as AND( user_installed_app=netflix, impression_app=pandora”), whose value is 1 if the user installed Netflix and then is later shown Pandora. This explains how the co-occurrence of a feature pair correlates with the target label. Generalization can be added by using features that are less granular, such as AND ( user_installed_category=video, impression_category=music ), but manual feature engineering is often required. One limitation of cross-product transformations is that they do not generalize to query-item feature pairs that have not appeared in the training data.
譯文:
在工業(yè)中,對(duì)于大規(guī)模的在線推薦和排序系統(tǒng),想邏輯回歸這樣的廣義線性模型應(yīng)用是相當(dāng)廣泛的,簡(jiǎn)單,伸縮性好,可解釋性強(qiáng)。可以喂給它一些one-hot編碼的稀疏特征,比如二值特征‘user_installed_app=netfix’表示用戶安裝了Netflix。Memorization則可以通過(guò)對(duì)稀疏特征做交叉積轉(zhuǎn)換獲得,就是求交叉特征,比如AND操作 (user_installed_app= netflix, impression_app=pandora )這兩個(gè)特征,當(dāng)用戶安裝了Netflix并且之后展示在Pandora上,那么得到特征的值為1,其余為0.這個(gè)交叉特征就展示了特征對(duì)之間的相關(guān)性和目標(biāo)lable之間的關(guān)聯(lián)。Generalization可以通過(guò)增加一些粗粒度的特征實(shí)現(xiàn),如AND(user_installed_category=video, impression_category=music ),但是這寫都是需要手工做特征工程實(shí)現(xiàn)。交叉積轉(zhuǎn)換的一個(gè)限制就是他們不能生成從未在訓(xùn)練數(shù)據(jù)中出現(xiàn)過(guò)的query-item特征對(duì)。
點(diǎn)評(píng):
這里主要是對(duì)接下來(lái)線性模型需要的特征做了下解釋,一個(gè)是one-hot,比較稀疏。一個(gè)是交叉特征,簡(jiǎn)單的說(shuō)就是AND,就是特征之間做笛卡爾積。用于線性模型去尋找顯性的非線性。
Embedding-based models, such as factorization machines[5] or deep neural networks, can generalize to previously unseen query-item feature pairs by learning a low-dimensional dense embedding vector for each query and item feature, with less burden of feature engineering. However, it is difficult to learn effective low-dimensional representations for queries and items when the underlying query-item matrix is sparse and high-rank, such as users with specific preferences or niche items with a narrow appeal. In such cases, there should be no interactions between most query-item pairs, but dense embeddings will lead to nonzero predictions for all query-item pairs, and thus can over-generalize and make less relevant recommendations. On the other hand, linear models with cross-product feature transformations can memorize these “exception rules” with much fewer parameters.
譯文:
像FM或者DNN,這種基于embedding的模型,是對(duì)預(yù)先沒(méi)出現(xiàn)的query-item特征對(duì)有一定范化性,通過(guò)為每個(gè)query和item特征學(xué)習(xí)一個(gè)低緯稠密的embedding向量,而且不需要太多的特征工程。但是如果潛在的query-item矩陣是稀疏,高秩的話,為query和items學(xué)習(xí)出一個(gè)有效的低緯表示往往很困難,比如基于特殊愛(ài)好的users,或者一些很少出現(xiàn)的小眾items。在這種情況下,大多數(shù)的query-item沒(méi)有交互,但是稠密的embedding還是會(huì)對(duì)全部的query-item對(duì)有非零的輸出預(yù)測(cè),因此能做出一些過(guò)范化和做出一些不太相關(guān)的推薦。另一方面,利用交叉積特征的線性模型能用很少的參數(shù)記住那些‘exception_rules’。
點(diǎn)評(píng):
講了下深度網(wǎng)絡(luò)需要的特征,embedding特征,就是把稀疏數(shù)據(jù)映射到稠密的低緯數(shù)據(jù)。
In this paper, we present the Wide & Deep learning framework to achieve both memorization and generalization in one model, by jointly training a linear model component and a neural network component as shown in Figure 1.
The main contributions of the paper include:
? The Wide & Deep learning framework for jointly training feed-forward neural networks with embeddings and linear model with feature transformations for generic recommender systems with sparse inputs.
? The implementation and evaluation of the Wide & Deep recommender system productionized on Google Play, a mobile app store with over one billion active users and over one million apps.
? We have open-sourced our implementation along with a high-level API in TensorFlow.
While the idea is simple, we show that the Wide & Deep framework significantly improves the app acquisition rate on the mobile app store, while satisfying the training and serving speed requirements.
譯文:
在這篇paper里,我們提出一個(gè)wide&Deep學(xué)習(xí)框架,以此來(lái)同時(shí)在一個(gè)模型中獲得Memorization和generalization,并聯(lián)合訓(xùn)練之。
本文的主要貢獻(xiàn):
1.聯(lián)合訓(xùn)練使用了embedding的深度網(wǎng)絡(luò)和使用了交叉特征的線性模型。
2.WD系統(tǒng)在Google Play上投入使用。
3.在Tensrolfow開(kāi)源代碼。
盡管idea簡(jiǎn)單,但是wd顯著的提高了app獲取率,且速度也還可以。
RECOMMENDER SYSTEM OVERVIEW
An overview of the app recommender system is shown in Figure 2. A query, which can include various user and contextual features, is generated when a user visits the app store. The recommender system returns a list of apps (also referred to as impressions) on which users can perform certain actions such as clicks or purchases. These user actions, along with the queries and impressions, are recorded in the logs as the training data for the learner. Since there are over a million apps in the database, it is intractable to exhaustively score every app for every query within the serving latency requirements (often O(10) milliseconds). Therefore, the first step upon receiving a query is retrieval. The retrieval system returns a short list of items that best match the query using various signals, usually a combination of machine-learned models and human-defined rules. After reducing the candidate pool, the ranking system ranks all items by their scores. The scores are usually P(y|x), the probability of a user action label y given the features x, including user features (e.g., country, language, demographics), contextual features (e.g., device, hour of the day, day of the week), and impression features (e.g., app age, historical statistics of an app). In this paper, we focus on the ranking model using the Wide & Deep learning framework.
圖2展示了app推薦系統(tǒng)的概括圖。
query:當(dāng)用戶訪問(wèn)app store的時(shí)候生成的許多用戶和文本特征。
推薦系統(tǒng)返回一個(gè)app列表(也被叫做展示(impressions)),然后用戶能在這些展示的app上進(jìn)行確切的操作,比如點(diǎn)擊或者購(gòu)買。這些用戶活動(dòng),以及queries和impressions都被記錄下來(lái)作為訓(xùn)練數(shù)據(jù)。
因?yàn)閿?shù)據(jù)庫(kù)中有過(guò)百萬(wàn)的apps,所以對(duì)全部的app計(jì)算score不合理。因此,收到一個(gè)query的第一步是retrieval(檢索)。檢索系統(tǒng)返回一個(gè)items的短列表,這個(gè)列表是通過(guò)機(jī)器學(xué)習(xí)和人工定義的大量標(biāo)記找出來(lái)的,和query最匹配的一個(gè)app列表。然后減少了候選池后,排序系統(tǒng)通過(guò)對(duì)這些items按score再對(duì)其進(jìn)行排序。而這個(gè)scores通常就是給定的特征x下,用戶行為y的概率值
P(y|x)。特征x包括一些用戶特征(國(guó)家,語(yǔ)言。。。),文本特征(設(shè)備。使用時(shí)長(zhǎng)。。。)和展示特征(app歷史統(tǒng)計(jì)數(shù)據(jù)。。。)。在本論文中,我們主要關(guān)注的是將WD模型用戶排序系統(tǒng)。
WIDE&DEEP LEARNING
The Wide Component
The wide component is a generalized linear model of the form y = wT x + b, as illustrated in Figure 1 (left). y is the prediction, x = [x1, x2, …, xd] is a vector of d features, w =[w1, w2, …, wd] are the model parameters and b is the bias.The feature set includes raw input features and transformed features. One of the most important transformations is the cross-product transformation, which is defined as:
where cki is a boolean variable that is 1 if the i-th feature is part of the k-th transformation φk, and 0 otherwise.For binary features, a cross-product transformation (e.g.,“AND(gender=female, language=en)”) is 1 if and only if the constituent features (“gender=female” and “l(fā)anguage=en”) are all 1, and 0 otherwise. This captures the interactions between the binary features, and adds non linearity to the generalized linear model.
譯文:
模型中Wide模塊是一個(gè)形如y=WTX+b的廣義線性模型,如圖1左所示。y是預(yù)測(cè)值,X=[x1,x2,...,xd]是d維特征的一個(gè)向量,其中W=[w1,w2,...,wd]是模型的參數(shù),b是偏置項(xiàng)。特征集包含了原始輸入特征和轉(zhuǎn)化后的特征。其中最重要的就是交叉積轉(zhuǎn)換特征,可以被定義為:
其中cki是一個(gè)boolean值變量,當(dāng)?shù)趇個(gè)特征是第k個(gè)轉(zhuǎn)換?k,否則的就是0。對(duì)于一個(gè)二進(jìn)制特征,交叉積特征可以簡(jiǎn)單理解為AND(gender=female,language=en),當(dāng)且僅當(dāng)gender=female,language=en時(shí),交叉特征為1,其他都為0。該方法能捕捉出特征間的交互,為模型添加非線性。
點(diǎn)評(píng):
就是生成交叉特征
The Deep Component
deep模塊則是一個(gè)前向神經(jīng)網(wǎng)絡(luò),如圖1右,對(duì)于類別特征,原始輸入特征其原始輸入都是字符串形式的特征,如“l(fā)anguage=en”.我們把這些稀疏,高維的類別特征轉(zhuǎn)換為低緯稠密的實(shí)值向量,這就是embedding向量。embedding隨機(jī)初始化,并利用反向傳播對(duì)其進(jìn)行更新。將高維的特征換換為embedding特征后,這些低維的embedding向量就被fed到神經(jīng)網(wǎng)絡(luò)中,每個(gè)隱藏層做如下計(jì)算:
其中l(wèi)是網(wǎng)絡(luò)的層數(shù),f是激活函數(shù),一般用RELU,al,bl,Wl分別為第l層的激活函數(shù),偏置項(xiàng),模型權(quán)值。
點(diǎn)評(píng):
其實(shí)就是說(shuō)輸入的類別特征是字符串,得轉(zhuǎn)化下,然后做embedding,模型是一個(gè)全連接。
Joint Training of Wide & Deep Model
通過(guò)將Wide模塊和Deep模塊的對(duì)數(shù)加權(quán)輸出作為預(yù)測(cè)值,然后將其fed給一個(gè)常規(guī)的邏輯損失函數(shù)中,用于聯(lián)合訓(xùn)練。需要注意的是,聯(lián)合訓(xùn)練和ensemble是由區(qū)別滴。在集成方法中,模型都是獨(dú)立訓(xùn)練的,模型之間沒(méi)有關(guān)系,他們的預(yù)測(cè)輸出只在最后才合并。但是,聯(lián)合訓(xùn)練的話,兩個(gè)模型是一起訓(xùn)練所有參數(shù)。對(duì)于模型大小來(lái)說(shuō),集成方法,因?yàn)槟P椭g獨(dú)立,所以單個(gè)模型的大小需要更大,即需要更多的特征和特征工程。以此起來(lái)獲得合理的精度。但是聯(lián)合訓(xùn)練,兩個(gè)模塊只要互相補(bǔ)充對(duì)方不足即可。
WD模型的聯(lián)合訓(xùn)練通過(guò)反向傳播將輸出值的誤差梯度通過(guò)最小批隨機(jī)梯度同時(shí)傳送給Wide和Deep模塊。在實(shí)驗(yàn)中,我們使用帶L1的FTRL算法作為wide模塊的優(yōu)化器,使用AdaGrad更新deep模塊。
結(jié)合的模型在圖1(中)。對(duì)于邏輯回歸問(wèn)題,我們模型的預(yù)測(cè)是:
P(Y=1|X)=σ(WTwide[X,?(X)]+WTdeepalf+b) (3)
其中Y是一個(gè)二值的類別標(biāo)簽,σ()是sigmoid函數(shù),?(x)表示交叉特征,b是一個(gè)bias項(xiàng),Wwide是Wide模型的權(quán)值,Wdeep是應(yīng)用在最后的隱藏層輸出上的權(quán)值。
點(diǎn)評(píng):
一個(gè)是聯(lián)合訓(xùn)練,就是一起訓(xùn)練唄,一個(gè)是優(yōu)化器,分別為FTRL和AdaGrad。最后是將兩個(gè)模型的輸出加起來(lái)。Wdeep其實(shí)就是隱藏層到輸出層的權(quán)值。
Data Generation
In this stage, user and app impression data within a period of time are used to generate training data. Each example corresponds to one impression. The label is app acquisition:1 if the impressed app was installed, and 0 otherwise. Vocabularies, which are tables mapping categorical feature strings to integer IDs, are also generated in this stage. The system computes the ID space for all the string features that occurred more than a minimum number of times. Continuous real-valued features are normalized to [0, 1] by mapping a feature value x to its cumulative distribution function P(X ≤ x), divided into nq quantiles. The normalized value is i?1 nq?1 for values in the i-th quantiles. Quantile boundaries are computed during data generation.
譯文:
app推薦主要由三個(gè)階段組成,data generation,model training,model serving。圖3所示。
數(shù)據(jù)生成階段,就是把之前的用戶和app展示數(shù)據(jù)用于生成訓(xùn)練數(shù)據(jù)。每個(gè)樣本對(duì)應(yīng)一個(gè)展示,標(biāo)簽是app acquisition:如果展示的app被安裝了則為1,否則為0。
Vacabularies,是一些將類別特征字符串映射為整型的ID。系統(tǒng)計(jì)算為哪些出現(xiàn)超過(guò)設(shè)置的最小次數(shù)的字符串特征計(jì)算ID空間。連續(xù)的實(shí)值特征通過(guò)映射特征x到它的累積分布P(X<=x),將其標(biāo)準(zhǔn)化到[0,1],然后在離散到nq個(gè)分位數(shù)。這些分位數(shù)邊界也是在該階段計(jì)算獲得。
點(diǎn)評(píng):
將整個(gè)推薦過(guò)程分為三部分,一:數(shù)據(jù)生成,為線性模型創(chuàng)建交叉特征,為深度模型創(chuàng)建embedding特征。對(duì)字符串類型的類別特征做整型轉(zhuǎn)換。
Model Training
The model structure we used in the experiment is shown in Figure 4. During training, our input layer takes in training data and vocabularies and generate sparse and dense features together with a label. The wide component consists of the cross-product transformation of user installed apps and impression apps. For the deep part of the model, A 32 dimensional embedding vector is learned for each categorical feature. We concatenate all the embeddings together with the dense features, resulting in a dense vector of approximately 1200 dimensions. The concatenated vector is then fed into 3 ReLU layers, and finally the logistic output unit. The Wide & Deep models are trained on over 500 billion examples. Every time a new set of training data arrives, the model needs to be re-trained. However, retraining from scratch every time is computationally expensive and delays the time from data arrival to serving an updated model.
To tackle this challenge, we implemented a warm-starting system which initializes a new model with the embeddings and the linear model weights from the previous model. Before loading the models into the model servers, a dry run of the model is done to make sure that it does not cause problems in serving live traffic. We empirically validate the model quality against the previous model as a sanity check.
譯文:
我們?cè)趯?shí)驗(yàn)中所用的模型結(jié)構(gòu)展示在圖4中。訓(xùn)練階段,我們的輸入層吸收訓(xùn)練數(shù)據(jù),詞匯,生成稀疏和稠密特征。Wide模塊包含用戶安裝的app和展示的app的交叉特征。對(duì)于深度模塊,我們?yōu)槊總€(gè)類別特征學(xué)習(xí)了32維的emedding特征。并將全部的embedding特征串聯(lián)起來(lái),獲得一個(gè)近似1200維的稠密向量。并將該向量傳入3層的RELU隱層,最終獲得邏輯輸出單元。
WD將被訓(xùn)練在超過(guò)5000億的樣本上。每次一個(gè)新的訓(xùn)練數(shù)據(jù)達(dá)到,模型需要重新訓(xùn)練。但是,重新訓(xùn)練費(fèi)時(shí)費(fèi)力。為了克服這個(gè)挑戰(zhàn),我們實(shí)現(xiàn)了一個(gè)熱啟動(dòng)系統(tǒng),我們使用預(yù)先的模型權(quán)值去初始化新模型權(quán)值。
在加載模型到模型server之前,為確保模型在實(shí)時(shí)情況下不會(huì)出現(xiàn)問(wèn)題,我們對(duì)模型進(jìn)行了預(yù)先模擬。
Model Serving
Once the model is trained and verified, we load it into the model servers. For each request, the servers receive a set of app candidates from the app retrieval system and user features to score each app. Then, the apps are ranked from the highest scores to the lowest, and we show the apps to the users in this order. The scores are calculated by running a forward inference pass over the Wide & Deep model. In order to serve each request on the order of 10 ms, we optimized the performance using multithreading parallelism by running smaller batches in parallel, instead of scoring all candidate apps in a single batch inference step.
譯文:
一旦模型完成訓(xùn)練和驗(yàn)證,我們就將它放到模型server中。對(duì)每次請(qǐng)求,server都會(huì)從app檢索系統(tǒng)獲得一個(gè)app候選集,然后,對(duì)這些app利用模型計(jì)算的成績(jī)排序,我們?cè)侔丛擁樞蝻@示app。
為了使得能在10ms內(nèi)響應(yīng)請(qǐng)求,我們利用多線程并行運(yùn)行小批次數(shù)據(jù)來(lái)代替對(duì)全部候選集在單個(gè)batch上打分,一次優(yōu)化時(shí)間。
實(shí)驗(yàn)
后面是實(shí)驗(yàn)部分,不再翻譯。
代碼分析
該模型的Tenflow源碼最近好像更新過(guò)了,和之前所用的模塊也不一樣了,之前的特征構(gòu)建都是使用的tf.contrib模塊,現(xiàn)在的代碼使用的是tf.feature_column,之前的模型在tf.contrib.learn.DNNLinearCombinedClassifier,現(xiàn)在的版本在tf.estimator.DNNLinearCombinedClassifier。不過(guò)我試了下,之前的代碼是能用的,說(shuō)明兩個(gè)都存在。(TF有點(diǎn)臃腫了),我就按我down下來(lái)的代碼來(lái)分析吧,之前那個(gè)有大神分析的相當(dāng)透徹:http://geek.csdn.net/news/detail/235465(完美,跪舔)。
數(shù)據(jù)
人口普查數(shù)據(jù)
分別有連續(xù)性數(shù)據(jù)和類別數(shù)據(jù),最后一類離散化后作為標(biāo)簽
用jupyter notebook看了下?lián)蟾砰L(zhǎng)這樣子
選出了幾個(gè)類別(教育年份,職業(yè)、國(guó)家),看了下,是這樣子的,有的類別值比較多,有的少點(diǎn)。
數(shù)據(jù)的輸入
數(shù)據(jù)用的是pandas讀取,直接將原始數(shù)據(jù)傳入,并以原始數(shù)據(jù)的列名作為key,為之后的做特征工作做準(zhǔn)備,之后所有的特征工作都是基于原始數(shù)據(jù)的key來(lái)構(gòu)造的。
具體可看下圖,
特征工程
Feature_column模塊自帶的函數(shù)有這么幾個(gè):
‘crossed_column’,
‘numeric_column’,
‘bucketized_column’,‘
‘categorical_column_with_hash_bucket’,
‘categorical_column_with_vocabulary_file’,
‘categorical_column_with_vocabulary_list’,
‘categorical_column_with_identity’,
‘weighted_categorical_column’,
‘indicator_column’,
crossed_column用于構(gòu)造交叉特征,numeric_column用于處理實(shí)值,bucketized_column用于離散連續(xù)特征,categorical_column_with_hash_bucket將類別特征hash到不同bin中,categorical_column_with_vocabulary_file將類別特征的所有取值保存在文件中,categorical_column_vocabulary_list將類別特征的所有取值保存在list中,categorical_column_with_identity返回的是和特征本身一樣的id,weighted_categorical_column是加權(quán)用的,indicator_column用來(lái)對(duì)類別特征做one-hot。
下面,我針對(duì)demo展開(kāi)講一下這幾個(gè)方法
1.針對(duì)取值較少的類別特征,demo里使用了tf.feature_column.categorical_column_with_vocabulary_list()方法將類別特征從字符串類型映射到整型。比如性別特征,原始數(shù)據(jù)集中的取值為Femal或者M(jìn)ale,這樣我們就可以將其通過(guò)
gender = tf.feature_column.categorical_column_with_vocabulary_list( "gender", ["Female", "Male"])- 1
- 2
把字符串Female和Male按其在vocabulary中的順序,從0開(kāi)始,按序編碼,這里的話就是Female:0;Male:1。
categorical_column_with_vocabulary_list()方法中還有一個(gè)參數(shù)是oov,意思就是out of vocabulary,就是說(shuō)如果數(shù)據(jù)沒(méi)有出現(xiàn)在我們定義的vocabulary中的話,我們可以將其投到oov中。其實(shí)這個(gè)方法的底層就是一個(gè)將String映射到int的一個(gè)hashtable。
2.針對(duì)那些不清楚有多少個(gè)取值的類別特征,或者說(shuō)取值數(shù)很多的特征,可以使用tf.feature_column.categorical_column_with_hash_bucket()方法,思想和categorical_column_with_vocabulary_list一樣,因?yàn)槲覀儾恢李悇e特征的取值,所以沒(méi)法定義vocabulary。所以可以直接利用hash方法將其直接hash到不同的bucket中,該方法將特征中的每一個(gè)可能的取值散列分配一個(gè)整型ID。比如
occupation = tf.feature_column.categorical_column_with_hash_bucket( "occupation", hash_bucket_size=1000)- 1
- 2
這段代碼就是講occupation中的取值,哈希到1000個(gè)bucket中,這1000個(gè)bucket分別為0~999,那么occupation中的值就會(huì)被映射為這1000個(gè)中的一個(gè)整數(shù)。
if self.dtype == dtypes.string:sparse_values = input_tensor.values else:sparse_values = string_ops.as_string(input_tensor.values) sparse_id_values = string_ops.string_to_hash_bucket_fast( sparse_values, self.hash_bucket_size, name='lookup')- 1
- 2
- 3
- 4
- 5
底層做的就是講String轉(zhuǎn)化為整型,然后再做hash,其實(shí)干的就是這么回事:
output_id = Hash(input_feature_string) % bucket_size
3.連續(xù)性變量
對(duì)于連續(xù)型變量就沒(méi)啥說(shuō)的,就是將其轉(zhuǎn)化為浮點(diǎn)型
- 1
- 2
- 3
- 4
- 5
- 6
- 7
4.對(duì)于分布不平均的連續(xù)性變量
對(duì)于一些每個(gè)段分布密度不均的連續(xù)性變量可以做分塊,所以有了如下函數(shù)tf.feature_column.bucketized_column()。
連續(xù)型特征通過(guò) bucketization 生成離散特征,boundaries 是一個(gè)浮點(diǎn)數(shù)的列表,而且列表必須是遞增序的,如下代碼
- 1
- 2
這里就是講age按給定的boundaries分成11個(gè)區(qū)域,比如樣本的age是34,那么輸出的就是3,age是21,那么輸出的就是1。
5.交叉特征
交叉特征是為了找出一些非線性的特征,tf.feature_column.crossed_column()。如
tf.feature_column.crossed_column(
["education", "occupation"], hash_bucket_size=1000)
就是講education和occupation做交叉,然后再做hash。
如下是源碼中的一個(gè)例子,
SparseTensor referred by first key: shape = [2, 2] {[0, 0]: "a"[1, 0]: "b" [1, 1]: "c"} SparseTensor referred by second key: shape = [2, 1] { [0, 0]: "d"[1, 0]: "e"}then crossed feature will look like:shape = [2, 2]{ [0, 0]: Hash64("d", Hash64("a")) % hash_bucket_size[1, 0]: Hash64("e", Hash64("b")) % hash_bucket_size [1, 1]: Hash64("e", Hash64("c")) % hash_bucket_size- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
這里的[0,0]表示的是輸入batch_size含有值的坐標(biāo)(就是sparseTensor,tensorflow里的sparseTensor定義就是由三個(gè)denseTensor組成,一個(gè)id,用于表示那個(gè)位置有值,一個(gè)value,用于表示這個(gè)位置上的值是多少,還有一個(gè)shape,用于表示數(shù)據(jù)的shape),像第一個(gè)key第一個(gè)樣本只有[0,0],即只有第一個(gè)位置有值,第二個(gè)樣本有[1,0],[1,1],那么說(shuō)明其第一個(gè)維度和第二個(gè)維度都有值。
借用下前面提到過(guò)的那位大神畫的圖
本例的交叉特征做的就是這么回事,一行表示一個(gè)樣本。
6.indicator特征,因?yàn)閐nn是不能直接輸入sparseColumn的,怎么說(shuō)呢,之前那些類別特征處理好后,全是將string轉(zhuǎn)化成了int,但是針對(duì)每個(gè)取值返回的還是一個(gè)整形的id值,我們不可能直接將該id傳入網(wǎng)絡(luò),但是線性模型可以直接將這類特征做embedding,來(lái)實(shí)現(xiàn)線性模型。
接著看,具體的方法是tf.feature_column.indicator_column(),該方法主要講一些類別特征進(jìn)行one-hot編碼,如果是多值的就進(jìn)行multi-hot編碼,底層調(diào)用的就是_IndicatorColumn()類,其實(shí)現(xiàn)就是一個(gè)one-hot()
- 1
- 2
- 3
- 4
- 5
- 6
如果是多值的特征,在參數(shù)返回的時(shí)候回將各個(gè)one-hot編碼進(jìn)行壓縮
return math_ops.reduce_sum(one_hot_id_tensor, axis=[-2])- 1
Embedding_column
tf.feature_column.embedding_column(native_country, dimension=8)- 1
看了下底層實(shí)現(xiàn),相當(dāng)于建了一個(gè)表,從表里去取embedding向量。
可以看這張圖,
大概就是這么個(gè)意思,按id從矩陣表里取embedding向量。
具體實(shí)現(xiàn)在
return _EmbeddingColumn(
categorical_column=categorical_column,
dimension=dimension,
combiner=combiner,
initializer=initializer,
ckpt_to_load_from=ckpt_to_load_from,
tensor_name_in_ckpt=tensor_name_in_ckpt,
max_norm=max_norm,
trainable=trainable)
然后其初始化了embedding矩陣
- 1
- 2
- 3
- 4
- 5
- 6
- 7
這個(gè)權(quán)值矩陣,其實(shí)相當(dāng)于神經(jīng)網(wǎng)絡(luò)的權(quán)值,后續(xù)如果是trainable的話,我們就會(huì)把這個(gè)當(dāng)做網(wǎng)絡(luò)的權(quán)值矩陣進(jìn)行訓(xùn)練,但是在用的時(shí)候,就把這個(gè)當(dāng)成一個(gè)embedding表,按id去取每個(gè)特征的embedding。
取的時(shí)候是去_safe_embedding_lookup_sparse()按id取embedding。
為了防止矩陣過(guò)大,其底層還實(shí)現(xiàn)了矩陣的分塊,就是將一個(gè)大矩陣分成幾個(gè)小矩陣,所以有一個(gè);partition_strategy
其定義了兩種取數(shù)據(jù)的方式
https://stackoverflow.com/questions/34870614/what-does-tf-nn-embedding-lookup-function-do
這里分析的挺詳細(xì),說(shuō)白了就是很多個(gè)矩陣現(xiàn)在來(lái)個(gè)id怎么去取數(shù)據(jù),那肯定是按表取,每個(gè)表取完了再去取下一個(gè)表,這就是mod;或者一個(gè)一個(gè)來(lái),這個(gè)表取一個(gè),下一個(gè)表取一個(gè),按順序依次從各個(gè)表取,就是div。
模型的構(gòu)造
m = tf.estimator.DNNLinearCombinedClassifier( model_dir=model_dir, linear_feature_columns=crossed_columns, dnn_feature_columns=deep_columns, dnn_hidden_units=[100, 50])- 1
- 2
- 3
- 4
- 5
模型的構(gòu)造直接調(diào)用該方法,該方法繼承自Estimator。
class DNNLinearCombinedClassifier(estimator.Estimator)- 1
具體實(shí)現(xiàn)在
_dnn_linear_combined_model_fn- 1
主要做兩件事,定義好優(yōu)化器,架好模型結(jié)構(gòu)(頂層的輸出和loss等單獨(dú)定義在head中)
DNN模型構(gòu)建
先造輸入層
net = feature_column_lib.input_layer()- 1
其實(shí)就是按每個(gè)特征的維度建立節(jié)點(diǎn),然后把全部的特征合起來(lái)輸出為output_tensors。當(dāng)然輸入時(shí)按batch_size輸入的。可以看出來(lái),網(wǎng)絡(luò)的數(shù)據(jù)是直接按batch_size來(lái)的,應(yīng)該一次訓(xùn)練就是一個(gè)batch_size的數(shù)據(jù),而不是一個(gè)一個(gè)的算,在按batch_size加起來(lái)。
輸入層造好了,就是按自己傳入的隱藏層節(jié)點(diǎn)數(shù)構(gòu)造隱藏層。
他這里能改的只有激活函數(shù),節(jié)點(diǎn)個(gè)數(shù),權(quán)值的初試化方式是默認(rèn)的,沒(méi)有提供接口。
最后就是輸出
輸出層的節(jié)點(diǎn),像這里因?yàn)槭嵌诸?#xff0c;所以輸出層的節(jié)點(diǎn)只有一個(gè),這里head.logits_dimension獲取的就是二分類問(wèn)題的輸出節(jié)點(diǎn)個(gè)數(shù)。如果是多分類,幾分類最后的節(jié)點(diǎn)個(gè)數(shù)就是幾,三分類那么輸出的節(jié)點(diǎn)就是3。這里也可以發(fā)現(xiàn),最后這里是沒(méi)有激活函數(shù)的,因?yàn)榈认乱途€性模型加起來(lái)才一起輸入的sigmoid函數(shù)里去。這里的dnn_logits就是deep部分的輸出了。
這里做的就是這么一件事:
Linear模塊
線性模型的話就是去做y=WTX+b一個(gè)東西,很簡(jiǎn)單,具體在linear_logits = feature_column_lib.linear_model()函數(shù)實(shí)現(xiàn),這里和一般的線性模型不一樣的是,它對(duì)類別特征和實(shí)值特征具體實(shí)現(xiàn)的方法有所不一樣。
for column in sorted(feature_columns, key=lambda x: x.name): with variable_scope.variable_scope(None, default_name=column.name):ordered_columns.append(column)if isinstance(column, _CategoricalColumn):weighted_sums.append(_create_categorical_column_weighted_sum( column, builder, units, sparse_combiner, weight_collections, trainable))else: weighted_sums.append(_create_dense_column_weighted_sum( column, builder, units, weight_collections, trainable))- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
針對(duì)categorical_column和dense_column其實(shí)現(xiàn)的時(shí)候分別使用embedding和矩陣乘積。分別在_create_categorical_column_weighted_sum()和_create_dense_column_weighted_sum()里實(shí)現(xiàn)。并將每個(gè)特征的實(shí)現(xiàn)加到weighted_sum中做匯總。
_create_categorical_column_weighted_sum()
weight = variable_scope.get_variable( name='weights', shape=(column._num_buckets, units), # pylint: disable=protected-access initializer=init_ops.zeros_initializer(), trainable=trainable, collections=weight_collections) return _safe_embedding_lookup_sparse( weight, id_tensor, sparse_weights=weight_tensor, combiner=sparse_combiner, name='weighted_sum')- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
可以看到,其先用全0初始化了一個(gè)權(quán)值矩陣,再去調(diào)用_safe_embedding_lookup_sparse去取權(quán)重值,其實(shí)就是一個(gè)embedding的過(guò)程。
_create_dense_column_weighted_sum()
對(duì)于其他實(shí)值特征的話,就比較直接了
最后再把輸出的weighted_sum都加起來(lái),再加個(gè)bias就可以了
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
簡(jiǎn)單畫個(gè)圖,其實(shí)就是這么回事
combine
if dnn_logits is not None and linear_logits is not None: logits = dnn_logits + linear_logits- 1
- 2
如圖:
最后把兩個(gè)模型的輸出,直接加起來(lái),送到sigmoid里去,再用交叉熵計(jì)算loss,這些都在head_lib._binary_logistic_head_with_sigmoid_cross_entropy_loss()里面完成了。
loss的反向傳播,也很直接
`def _train_op_fn(loss):
“”“Returns the op to optimize the loss.”“” ……
……
if dnn_logits is not None:
train_ops.append(
dnn_optimizer.minimize(
loss, ……
……)
if linear_logits is not None:
train_ops.append(
loss, ……
……)
其實(shí)我不是特別理解,感覺(jué)線性模型的收斂速度,肯定比網(wǎng)絡(luò)要快很多,那怎么去保證兩邊收斂情況呢?當(dāng)然正則做得好,只要不過(guò)擬合,最后應(yīng)該都會(huì)收斂得比較好。
還有就是這個(gè)版本的wd代碼和之前版本的wd代碼有一些差異,參數(shù)方面,默認(rèn)的學(xué)習(xí)率其做了一定的調(diào)整,之前的center_bias,現(xiàn)在沒(méi)有了。
總結(jié)
以上是生活随笔為你收集整理的The Wide and Deep Learning Model(译文+Tensorlfow源码解析) 原创 2017年11月03日 22:14:47 标签: 深度学习 / 谷歌 / tensorf的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: Kaggle常用函数总结 原创 2017
- 下一篇: 如何利用大数据做金融风控? 原创 201