Deep Learning-论文翻译以及笔记
論文題目:Deep Learning
論文來源:Deep Learning_2015_Nature
翻譯人:莫陌莫墨
Deep Learning
Yann LeCun? Yoshua Bengio? Geoffrey Hinton深度學(xué)習(xí)
Yann LeCun? Yoshua Bengio? Geoffrey HintonAbstract
Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech rec-ognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.
摘要
深度學(xué)習(xí)允許由多個處理層組成的計(jì)算模型學(xué)習(xí)具有多個抽象級別的數(shù)據(jù)表示。這些方法極大地提升了語音識別、視覺目標(biāo)識別、目標(biāo)檢測以及許多其他領(lǐng)域的最新技術(shù),例如藥物發(fā)現(xiàn)和基因組學(xué)。深度學(xué)習(xí)通過使用反向傳播算法來指示機(jī)器應(yīng)如何更新其內(nèi)部參數(shù)(從上一層的表示形式計(jì)算每一層的表示形式),從而發(fā)現(xiàn)大型數(shù)據(jù)集中的復(fù)雜結(jié)構(gòu)。深層卷積網(wǎng)絡(luò)在處理圖像、視頻、語音和音頻方面帶來了突破,而遞歸網(wǎng)絡(luò)則對諸如文本和語音之類的順序數(shù)據(jù)有所啟發(fā)。
正文
Machine-learning technology powers many aspects of modern society: from web searches to content filtering on social networks to recommendations on e-commerce websites, and it is increasingly present in consumer products such as cameras and smartphones. Machine-learning systems are used to identify objects in images, transcribe speech into text, match news items, posts or products with users’ interests, and select relevant results of search. Increasingly, these applications make use of a class of techniques called deep learning.
Conventional machine-learning techniques were limited in their ability to process natural data in their raw form. For decades, constructing a pattern-recognition or machine-learning system required careful engineering and considerable domain expertise to design a feature extractor that transformed the raw data (such as the pixel values of an image) into a suitable internal representation or feature vector from which the learning subsystem, often a classifier, could detect or classify patterns in the input.
Representation learning is a set of methods that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification. Deep-learning methods are representation-learning methods with multiple levels of representation, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. With the composition of enough such transformations, very complex functions can be learned. For classification tasks, higher layers of representation amplify aspects of the input that are important for discrimination and suppress irrelevant variations. An image, for example, comes in the form of an array of pixel values, and the learned features in the first layer of representation typically represent the presence or absence of edges at particular orientations and locations in the image. The second layer typically detects motifs by spotting particular arrangements of edges, regardless of small variations in the edge positions. The third layer may assemble motifs into larger combinations that correspond to parts of familiar objects, and subsequent layers would detect objects as combinations of these parts. The key aspect of deep learning is that these layers of features are not designed by human engineers: they are learned from data using a general-purpose learning procedure.
Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years. It has turned out to be very good at discovering intricate structures in high-dimensional data and is therefore applicable to many domains of science, business and government. In addition to beating records in image recognition and speech recognition, it has beaten other machine-learning techniques at predicting the activity of potential drug molecules, analysing particle accelerator data, reconstructing brain circuits, and predicting the effects of mutations in non-coding DNA on gene expression and disease. Perhaps more surprisingly, deep learning has produced extremely promising results for various tasks in natural language understanding, particularly topic classification, sentiment analysis, question answering and lan-guage translation.
We think that deep learning will have many more successes in the near future because it requires very little engineering by hand, so it can easily take advantage of increases in the amount of available computation and data. New learning algorithms and architectures that are currently being developed for deep neural networks will only accelerate this progress.
機(jī)器學(xué)習(xí)技術(shù)為現(xiàn)代社會的各個方面提供了強(qiáng)大的支持:從網(wǎng)絡(luò)搜索到社交網(wǎng)絡(luò)上的內(nèi)容過濾再到電子商務(wù)網(wǎng)站上的推薦,并且它越來越多地出現(xiàn)在諸如相機(jī)和智能手機(jī)之類的消費(fèi)產(chǎn)品中。機(jī)器學(xué)習(xí)系統(tǒng)用于識別圖像中的目標(biāo),語音轉(zhuǎn)錄為文本,新聞標(biāo)題、帖子或具有用戶興趣的產(chǎn)品匹配,以及選擇相關(guān)的搜索結(jié)果。這些應(yīng)用程序越來越多地使用一類稱為深度學(xué)習(xí)的技術(shù)。
傳統(tǒng)的機(jī)器學(xué)習(xí)技術(shù)在處理原始格式的自然數(shù)據(jù)方面的能力受到限制。幾十年來,構(gòu)建模式識別或機(jī)器學(xué)習(xí)系統(tǒng)需要認(rèn)真的工程設(shè)計(jì)和相當(dāng)多的領(lǐng)域?qū)I(yè)知識,才能設(shè)計(jì)特征提取器,以將原始數(shù)據(jù)(例如圖像的像素值)轉(zhuǎn)換為合適的內(nèi)部表示或特征向量,學(xué)習(xí)子系統(tǒng)(通常是分類器)可以對輸入的圖片進(jìn)行檢測或分類。
表示學(xué)習(xí)是一組方法,這些方法允許向機(jī)器提供原始數(shù)據(jù)并自動發(fā)現(xiàn)檢測或分類所需的表示。深度學(xué)習(xí)方法是具有表示形式的多層次的表示學(xué)習(xí)方法,它是通過組合簡單但非線性的模塊而獲得的,每個模塊都將一個級別(從原始輸入開始)的表示形式轉(zhuǎn)換為更高、更抽象的級別的表示形式。有了足夠多的此類轉(zhuǎn)換,就可以學(xué)習(xí)非常復(fù)雜的功能。對于分類任務(wù),較高的表示層會放大輸入中對區(qū)分非常重要的方面,并抑制不相關(guān)的變化。例如,圖像以像素值序列的形式出現(xiàn),并且在表示的第一層中學(xué)習(xí)的特征通常表示圖像中特定方向和位置上是否存在邊緣。第二層通常通過發(fā)現(xiàn)邊緣的特定布置來檢測圖案,而與邊緣位置的微小變化無關(guān)。第三層可以將圖案組裝成與熟悉的對象的各個部分相對應(yīng)的較大組合,并且隨后的層將這些部分的組合作為目標(biāo)進(jìn)行檢測。深度學(xué)習(xí)的關(guān)鍵在于每層的功能不是由人類工程師設(shè)計(jì)的,而是通用訓(xùn)練過程從數(shù)據(jù)中學(xué)習(xí)的。
深度學(xué)習(xí)在解決多年來抵制人工智能界最大嘗試的問題方面取得了重大進(jìn)展。事實(shí)證明,它非常善于發(fā)現(xiàn)高維數(shù)據(jù)中的復(fù)雜結(jié)構(gòu),因此適用于科學(xué)、商業(yè)和政府的許多領(lǐng)域。除了打破圖像識別和語音識別中的記錄,它在預(yù)測潛在藥物分子的活性、分析粒子加速器數(shù)據(jù),重建腦回路和預(yù)測非編碼DNA突變對基因表達(dá)和疾病的影響方面還優(yōu)于其他機(jī)器學(xué)習(xí)技術(shù)。更令人驚訝的是,深度學(xué)習(xí)在自然語言理解中的各種任務(wù)上產(chǎn)生了非常有希望的結(jié)果,尤其是主題分類、情感分析、問答系統(tǒng)和語言翻譯。
由于深度學(xué)習(xí)只需要極少的人工操作,我們認(rèn)為其在不久的將來會取得更多的成功,因此可以輕松地利用增加的可用計(jì)算量和數(shù)據(jù)量的優(yōu)勢。目前正在為深度神經(jīng)網(wǎng)絡(luò)開發(fā)的新學(xué)習(xí)算法和體系結(jié)構(gòu)只會加速這一進(jìn)展。
總述:先敘述了機(jī)器學(xué)習(xí)的廣泛應(yīng)用,傳統(tǒng)的機(jī)器學(xué)習(xí)局限與輸入需要對原始數(shù)據(jù)加工,而加工是一個手藝活,需要很多的經(jīng)驗(yàn)和算法知識,然后引入Representation learning 。
Supervised learning
The most common form of machine learning, deep or not, is supervised learning. Imagine that we want to build a system that can classify images as containing, say, a house, a car, a person or a pet. We first collect a large data set of images of houses, cars, people and pets, each labelled with its category. During training, the machine is shown an image and produces an output in the form of a vector of scores, one for each category. We want the desired category to have the highest score of all categories, but this is unlikely to happen before training. We compute an objective function that measures the error (or distance) between the output scores and the desired pattern of scores. The machine then modifies its internal adjustable parameters to reduce this error. These adjustable parameters, often called weights, are real numbers that can be seen as ‘knobs’ that define the input–output function of the machine. In a typical deep-learning system, there may be hundreds of millions of these adjustable weights, and hundreds of millions of labelled examples with which to train the machine.
To properly adjust the weight vector, the learning algorithm computes a gradient vector that, for each weight, indicates by what amount the error would increase or decrease if the weight were increased by a tiny amount. The weight vector is then adjusted in the opposite direction to the gradient vector.
The objective function, averaged over all the training examples, can be seen as a kind of hilly landscape in the high-dimensional space of weight values. The negative gradient vector indicates the direction of steepest descent in this landscape, taking it closer to a minimum, where the output error is low on average.
In practice, most practitioners use a procedure called stochastic gradient descent (SGD). This consists of showing the input vector for a few examples, computing the outputs and the errors, computing the average gradient for those examples, and adjusting the weights accordingly. The process is repeated for many small sets of examples from the training set until the average of the objective function stops decreasing. It is called stochastic because each small set of examples gives a noisy estimate of the average gradient over all examples. This simple procedure usually finds a good set of weights surprisingly quickly when compared with far more elaborate optimization techniques. After training, the performance of the system is measured on a different set of examples called a test set. This serves to test the generalization ability of the machine — its ability to produce sensible answers on new inputs that it has never seen during training.
Many of the current practical applications of machine learning use linear classifiers on top of hand-engineered features. A two-class linear classifier computes a weighted sum of the feature vector components. If the weighted sum is above a threshold, the input is classified as belonging to a particular category.
Since the 1960s we have known that linear classifiers can only carve their input space into very simple regions, namely half-spaces separated by a hyperplane. But problems such as image and speech recognition require the input–output function to be insensitive to irrelevant variations of the input, such as variations in position, orientation or illumination of an object, or variations in the pitch or accent of speech, while being very sensitive to particular minute variations (for example, the difference between a white wolf and a breed of wolf-like white dog called a Samoyed). At the pixel level, images of two Samoyeds in different poses and in different environments may be very different from each other, whereas two images of a Samoyed and a wolf in the same position and on similar backgrounds may be very similar to each other. A linear classifier, or any other ‘shallow’ classifier operating on aw pixels could not possibly distinguish the latter two, while putting the former two in the same category. This is why shallow classifiers require a good feature extractor that solves the selectivity–invariance dilemma — one that produces representations that are selective to the aspects of the image that are important for discrimination, but that are invariant to irrelevant aspects such as the pose of the animal. To make classifiers more powerful, one can use generic non-linear features, as with kernel methods, but generic features such as those arising with the Gaussian kernel do not allow the learner to generalize well far from the training examples. The conventional option is to hand design good feature extractors, which requires a considerable amount of engineering skill and domain expertise. But this can all be avoided if good features can be learned automatically using a general-purpose learning procedure. This is the key advantage of deep learning.
A deep-learning architecture is a multilayer stack of simple modules, all (or most) of which are subject to learning, and many of which compute non-linear input–output mappings. Each module in the stack transforms its input to increase both the selectivity and the invariance of the representation. With multiple non-linear layers, say a depth of 5 to 20, a system can implement extremely intricate functions of its inputs that are simultaneously sensitive to minute details-distinguishing Samoyeds from white wolves-and insensitive to large irrelevant variations such as the background, pose, lighting and surrounding objects.
監(jiān)督學(xué)習(xí)
不論深度與否,機(jī)器學(xué)習(xí)最常見的形式都是監(jiān)督學(xué)習(xí)。想象一下,我們想建立一個可以將圖像分類為包含房屋、汽車、人或?qū)櫸锏南到y(tǒng)。我們首先收集大量的房屋、汽車、人和寵物的圖像數(shù)據(jù)集,每個圖像均標(biāo)有類別。在訓(xùn)練過程中,機(jī)器將顯示一張圖像,并輸出一個分?jǐn)?shù)向量,每個類別一個。我們希望所需的類別在所有類別中得分最高,但這不太可能在訓(xùn)練之前發(fā)生。我們計(jì)算一個目標(biāo)函數(shù),該函數(shù)測量輸出得分與期望得分模式之間的誤差(或距離)。然后機(jī)器修改其內(nèi)部可更新參數(shù)以減少此誤差。這些可更新的參數(shù)(通常稱為權(quán)重)是實(shí)數(shù),可以看作是定義機(jī)器輸入輸出功能的“旋鈕”。在典型的深度學(xué)習(xí)系統(tǒng)中,可能會有數(shù)以億計(jì)的可更新權(quán)重,以及數(shù)億個帶有標(biāo)簽的實(shí)例,用于訓(xùn)練模型。
為了適當(dāng)?shù)馗聶?quán)重向量,學(xué)習(xí)算法計(jì)算一個梯度向量,針對每個權(quán)重,該梯度向量表明,如果權(quán)重增加很小的量,誤差將增加或減少的相應(yīng)的量。然后沿與梯度向量相反的方向更新權(quán)重向量。
在所有訓(xùn)練示例中平均的目標(biāo)函數(shù)可以在權(quán)重值的高維空間中被視為一種丘陵地形。負(fù)梯度矢量指示此地形中最陡下降的方向,使其更接近最小值,其中輸出誤差平均較低。
在實(shí)踐中,大多數(shù)從業(yè)者使用一種稱為隨機(jī)梯度下降(SGD)的算法。這包括顯示幾個示例的輸入向量,計(jì)算輸出和誤差,計(jì)算這些示例的平均梯度以及相應(yīng)地更新權(quán)重。對訓(xùn)練集中的許多小樣本示例重復(fù)此過程,直到目標(biāo)函數(shù)的平均值停止下降。之所以稱其為隨機(jī)的,是因?yàn)槊總€小的示例集都會給出所有示例中平均梯度的噪聲估計(jì)。與更復(fù)雜的優(yōu)化技術(shù)相比[18],這種簡單的過程通常會出乎意料地快速找到一組良好的權(quán)重。訓(xùn)練后,系統(tǒng)的性能將在稱為測試集的不同示例集上進(jìn)行測量。這用于測試機(jī)器的泛化能力:機(jī)器在新的輸入數(shù)據(jù)上產(chǎn)生好的效果的能力,這些輸入數(shù)據(jù)在訓(xùn)練集上是沒有的。
機(jī)器學(xué)習(xí)的許多當(dāng)前實(shí)際應(yīng)用都在人工設(shè)計(jì)的基礎(chǔ)上使用線性分類器。兩類別線性分類器計(jì)算特征向量分量的加權(quán)和。如果加權(quán)和大于閾值,則將輸入分為特定類別。
自二十世紀(jì)六十年代以來,我們就知道線性分類器只能將其輸入空間劃分為非常簡單的區(qū)域,即由超平面分隔的對半空間。但是,諸如圖像和語音識別之類的問題要求輸入輸出功能對輸入的不相關(guān)變化不敏感,例如目標(biāo)的位置、方向或照明的變化,或語音的音高或口音的變化。對特定的微小變化敏感(例如,白狼與薩摩耶之間的差異,薩摩耶是很像狼的白狗)。在像素級別,兩幅處于不同姿勢和不同環(huán)境中的薩摩耶圖像可能差別很大,而兩幅位于相同位置且背景相似的薩摩耶和狼的圖像可能非常相似。線性分類器或其他任何在其上運(yùn)行的“淺”分類器無法區(qū)分后兩幅圖片,而將前兩幅圖像歸為同一類別。這就是為什么淺分類器需要一個好的特征提取器來解決選擇性不變性難題的原因。提取器可以產(chǎn)生對圖像中對于辨別重要的方面具有選擇性但對不相關(guān)方面(例如動物的姿態(tài))不變的表示形式。為了使分類器更強(qiáng)大,可以使用通用的非線性特征,如核方法,但是諸如高斯核所產(chǎn)生的那些通用特征,使學(xué)習(xí)者無法從訓(xùn)練示例中很好地概括。傳統(tǒng)的選擇是人工設(shè)計(jì)好的特征提取器,這需要大量的工程技術(shù)和領(lǐng)域?qū)I(yè)知識。但是,如果可以使用通用學(xué)習(xí)過程自動學(xué)習(xí)好的功能,則可以避免所有這些情況。這是深度學(xué)習(xí)的關(guān)鍵優(yōu)勢。
深度學(xué)習(xí)架構(gòu)是簡單模塊的多層堆疊,所有模塊(或大多數(shù)模塊)都需要學(xué)習(xí),并且其中許多模塊都會計(jì)算非線性的輸入-輸出映射。堆疊中的每個模塊都會轉(zhuǎn)換其輸入,以增加表示的選擇性和不變性。系統(tǒng)具有多個非線性層(例如深度為5到20),可以實(shí)現(xiàn)極為復(fù)雜的輸入功能,這些功能同時對細(xì)小的細(xì)節(jié)敏感(區(qū)分薩摩耶犬與白狼),并且對不相關(guān)的大變化不敏感,例如背景功能、姿勢、燈光和周圍物體。
2,敘述監(jiān)督學(xué)習(xí)(supervised learning):就是給各種狗的圖片(train dataset),提示網(wǎng)絡(luò)架構(gòu)這是狗(label),多次重復(fù)后再給一張狗的圖片,訓(xùn)練好的網(wǎng)絡(luò)能自動反應(yīng),給出結(jié)果,這是貓。那么到底是怎么train的呢?我們需要定義一個objective function,它的作用是計(jì)算預(yù)測值和label之間的distance,網(wǎng)絡(luò)learning的任務(wù)就是縮小這個objective function的值,也就是讓預(yù)測值不斷接近真值,這個objective function是關(guān)于weight的函數(shù),下面就粗略的提到一種優(yōu)化的方法,叫隨機(jī)梯度下降(SGD),就是每次在所有的樣本中隨機(jī)選一個樣本,計(jì)算objective function關(guān)于weight的偏導(dǎo)(梯度),讓weight往梯度的負(fù)方向(梯度的負(fù)方向就是objective function即誤差減小的方向)變化,然后多次重復(fù),最后發(fā)現(xiàn)objective function的值不變了或者變得很小了,就停止迭代,此時的參數(shù)weight,就是我們的網(wǎng)絡(luò)學(xué)習(xí)到的。最后就可以訓(xùn)練處有自己理解的一個架構(gòu),從而可以判斷新的物體是不是之前自己學(xué)習(xí)過的物體,給出結(jié)果
Backpropagation to train multilayer architectures
From the earliest days of pattern recognition, the aim of researchers has been to replace hand-engineered features with trainable multilayer networks, but despite its simplicity, the solution was not widely understood until the mid 1980s. As it turns out, multilayer architectures can be trained by simple stochastic gradient descent. As long as the modules are relatively smooth functions of their inputs and of their internal weights, one can compute gradients using the backpropagation procedure. The idea that this could be done, and that it worked, was discovered independently by several different groups during the 1970s and 1980s.
The backpropagation procedure to compute the gradient of an objective function with respect to the weights of a multilayer stack of modules is nothing more than a practical application of the chain rule for derivatives. The key insight is that the derivative (or gradient) of the objective with respect to the input of a module can be computed by working backwards from the gradient with respect to the output of that module (or the input of the subsequent module) (Fig. 1). The backpropagation equation can be applied repeatedly to propagate gradients through all modules, starting from the output at the top (where the network produces its prediction) all the way to the bottom (where the external input is fed). Once these gradients have been computed, it is straightforward to compute the gradients with respect to the weights of each module.
Many applications of deep learning use feedforward neural network architectures (Fig. 1), which learn to map a fixed-size input (for example, an image) to a fixed-size output (for example, a prob-ability for each of several categories). To go from one layer to the next, a set of units compute a weighted sum of their inputs from the previous layer and pass the result through a non-linear function. At present, the most popular non-linear function is the rectified linear unit (ReLU), which is simply the half-wave rectifier f(z)=max?(0,z)f(z)=max?(0,z)f(z)=max?(0,z). In past decades, neural nets used smoother non-linearities, such as tanh(z)tanh(z)tanh(z) or 1/(1+exp?(?z))1/(1+exp?(-z))1/(1+exp?(?z)), but the ReLU typically learns much faster in networks with many layers, allowing training of a deep supervised network without unsupervised pre-training. Units that are not in the input or output layer are conventionally called hidden units. The hidden layers can be seen as distorting the input in a non-linear way so that categories become linearly separable by the last layer (Fig. 1).
In the late 1990s, neural nets and backpropagation were largely forsaken by the machine-learning community and ignored by the computer-vision and speech-recognition communities. It was widely thought that learning useful, multistage, feature extractors with lit-tle prior knowledge was infeasible. In particular, it was commonly thought that simple gradient descent would get trapped in poor local minima-weight configurations for which no small change would reduce the average error.
In practice, poor local minima are rarely a problem with large networks. Regardless of the initial conditions, the system nearly always reaches solutions of very similar quality. Recent theoretical and empirical results strongly suggest that local minima are not a serious issue in general. Instead, the landscape is packed with a combinatorially large number of saddle points where the gradient is zero, and the surface curves up in most dimensions and curves down in the remainder. The analysis seems to show that saddle points with only a few downward curving directions are present in very large numbers, but almost all of them have very similar values of the objec-tive function. Hence, it does not much matter which of these saddle points the algorithm gets stuck at.
Interest in deep feedforward networks was revived around 2006 by a group of researchers brought together by the Canadian Institute for Advanced Research (CIFAR). The researchers intro-duced unsupervised learning procedures that could create layers of feature detectors without requiring labelled data. The objective in learning each layer of feature detectors was to be able to reconstruct or model the activities of feature detectors (or raw inputs) in the layer below. By ‘pre-training’ several layers of progressively more complex feature detectors using this reconstruction objective, the weights of a deep network could be initialized to sensible values. A final layer of output units could then be added to the top of the network and the whole deep system could be fine-tuned using standard backpropagation. This worked remarkably well for recognizing handwritten digits or for detecting pedestrians, especially when the amount of labelled data was very limited36.
The first major application of this pre-training approach was in speech recognition, and it was made possible by the advent of fast graphics processing units (GPUs) that were convenient to program and allowed researchers to train networks 10 or 20 times faster. In 2009, the approach was used to map short temporal windows of coef-ficients extracted from a sound wave to a set of probabilities for the various fragments of speech that might be represented by the frame in the centre of the window. It achieved record-breaking results on a standard speech recognition benchmark that used a small vocabu-lary and was quickly developed to give record-breaking results on a large vocabulary task. By 2012, versions of the deep net from 2009 were being developed by many of the major speech groups and were already being deployed in Android phones. For smaller data sets, unsupervised pre-training helps to prevent overfitting, leading to significantly better generalization when the number of labelled examples is small, or in a transfer setting where we have lots of examples for some ‘source’ tasks but very few for some ‘target’ tasks. Once deep learning had been rehabilitated, it turned out that the pre-training stage was only needed for small data sets.
There was, however, one particular type of deep, feedforward network that was much easier to train and generalized much better than networks with full connectivity between adjacent layers. This was the convolutional neural network (ConvNet). It achieved many practical successes during the period when neural networks were out of favour and it has recently been widely adopted by the computervision community
反向傳播訓(xùn)練多層架構(gòu)
從模式識別的早期開始,研究人員的目的一直是用可訓(xùn)練的多層網(wǎng)絡(luò)代替手工設(shè)計(jì)的功能,但是盡管它很簡單,但直到二十世紀(jì)八十年代中期才廣泛了解該解決方案。事實(shí)證明,可以通過簡單的隨機(jī)梯度下降來訓(xùn)練多層體系結(jié)構(gòu)。只要模塊是其輸入及其內(nèi)部權(quán)重的相對平滑函數(shù),就可以使用反向傳播過程來計(jì)算梯度。 二十世紀(jì)七十年代和二十世紀(jì)八十年代,幾個不同的小組獨(dú)立地發(fā)現(xiàn)了可以做到這一點(diǎn)并且起作用的想法。
反向傳播程序用于計(jì)算目標(biāo)函數(shù)相對于模塊多層堆疊權(quán)重的梯度,無非是導(dǎo)數(shù)鏈規(guī)則的實(shí)際應(yīng)用。關(guān)鍵的見解是,相對于模塊輸入的目標(biāo)的導(dǎo)數(shù)(或梯度)可以通過相對于該模塊的輸出(或后續(xù)模塊的輸入)的梯度進(jìn)行反運(yùn)算來計(jì)算(圖1)。反向傳播方程式可以反復(fù)應(yīng)用,以通過所有模塊傳播梯度,從頂部的輸出(網(wǎng)絡(luò)產(chǎn)生其預(yù)測)一直到底部的輸出(外部輸入被饋送)。一旦計(jì)算出這些梯度,就可以相對于每個模塊的權(quán)重來計(jì)算梯度。
深度學(xué)習(xí)的許多應(yīng)用都使用前饋神經(jīng)網(wǎng)絡(luò)體系結(jié)構(gòu)(圖1),該體系結(jié)構(gòu)會將固定大小的輸入(例如圖像)映射到固定大小的輸出(例如幾個類別中的每一個的概率)。為了從一層到下一層,一組單元計(jì)算它們來自上一層的輸入的加權(quán)和,并將結(jié)果傳遞給非線性函數(shù)。目前,最流行的非線性函數(shù)是整流線性單元(ReLU),即半波整流器f(z)=max?(0,z)f(z)=max?(0,z)f(z)=max?(0,z)。在過去的幾十年中,神經(jīng)網(wǎng)絡(luò)使用了更平滑的非線性,例如tanh(z)??tanh(z)??tanh(z)??或1/(1+exp?(?z))1/(1+exp?(-z))1/(1+exp?(?z)),但ReLU通常在具有多個層的網(wǎng)絡(luò)中學(xué)習(xí)得更快,從而可以在無需監(jiān)督的情況下進(jìn)行深度監(jiān)督的網(wǎng)絡(luò)訓(xùn)練。不在輸入或輸出層中的單元通常稱為隱藏單元。隱藏的層可以被視為以非線性方式使輸入失真,以便類別可以由最后一層實(shí)現(xiàn)線性分別(圖1)。
在二十世紀(jì)九十年代后期,神經(jīng)網(wǎng)絡(luò)和反向傳播在很大程度上被機(jī)器學(xué)習(xí)領(lǐng)域拋棄,而被計(jì)算機(jī)視覺和語音識別領(lǐng)域所忽略。人們普遍認(rèn)為,在沒有先驗(yàn)知識的情況下學(xué)習(xí)有用的多階段特征提取器是不可行的。特別是,通常認(rèn)為簡單的梯度下降會陷入不良的局部極小值——權(quán)重配置,對其進(jìn)行很小的變化將減少平均誤差。
實(shí)際上,較差的局部最小值在大型網(wǎng)絡(luò)中很少出現(xiàn)問題。不管初始條件如何,該系統(tǒng)幾乎總是能獲得效果非常相似的解決方案。最近的理論和經(jīng)驗(yàn)結(jié)果強(qiáng)烈表明,局部極小值通常不是一個嚴(yán)重的問題。取而代之的是,景觀中堆積了許多鞍點(diǎn),其中梯度為零,并且曲面在大多數(shù)維度上都向上彎曲,而在其余維度上則向下彎曲。分析似乎表明,只有少數(shù)幾個向下彎曲方向的鞍點(diǎn)存在很多,但幾乎所有鞍點(diǎn)的目標(biāo)函數(shù)值都非常相似。因此,算法陷入這些鞍點(diǎn)中的哪一個都沒關(guān)系。
加拿大高級研究所(CIFAR)召集的一組研究人員在2006年左右恢復(fù)了對深層前饋網(wǎng)絡(luò)的興趣。研究人員介紹了無需監(jiān)督的學(xué)習(xí)程序,這些程序可以創(chuàng)建特征檢測器層,而無需標(biāo)記數(shù)據(jù)。學(xué)習(xí)特征檢測器每一層的目的是能夠在下一層中重建或建模特征檢測器(或原始輸入)的活動。通過使用此重建目標(biāo)“預(yù)訓(xùn)練”幾層逐漸復(fù)雜的特征檢測器,可以將深度網(wǎng)絡(luò)的權(quán)重初始化為合理的值。然后可以將輸出單元的最后一層添加到網(wǎng)絡(luò)的頂部,并且可以使用標(biāo)準(zhǔn)反向傳播對整個深度系統(tǒng)進(jìn)行微調(diào)。這對于識別手寫數(shù)字或檢測行人非常有效,特別是在標(biāo)記數(shù)據(jù)量非常有限的情況下。
這種預(yù)訓(xùn)練方法的第一個主要應(yīng)用是語音識別,而快速圖形處理單元(GPU)的出現(xiàn)使編程成為可能,并且使研究人員訓(xùn)練網(wǎng)絡(luò)的速度提高了10或20倍,從而使之成為可能。在2009年,該方法用于將從聲波提取的系數(shù)的短暫時間窗口映射到可能由窗口中心的幀表示的各種語音片段的一組概率。它在使用少量詞匯的標(biāo)準(zhǔn)語音識別基準(zhǔn)上取得了創(chuàng)紀(jì)錄的結(jié)果,并迅速發(fā)展為大型詞匯任上取得了創(chuàng)紀(jì)錄的結(jié)果。到2012年,許多主要的語音組織都在開發(fā)2009年以來的深度網(wǎng)絡(luò)版本,并且已經(jīng)在Android手機(jī)中進(jìn)行了部署。對于較小的數(shù)據(jù)集,無監(jiān)督的預(yù)訓(xùn)練有助于防止過擬合,從而在標(biāo)記的示例數(shù)量較少時或在轉(zhuǎn)移設(shè)置中,對于一些“源”任務(wù),我們有很多示例,而對于某些“源”任務(wù)卻很少,這會導(dǎo)致泛化效果更好“目標(biāo)”任務(wù)。恢復(fù)深度學(xué)習(xí)后,事實(shí)證明,僅對于小型數(shù)據(jù)集才需要進(jìn)行預(yù)訓(xùn)練。
但是,存在一種特定類型的深層前饋網(wǎng)絡(luò),它比相鄰層之間具有完全連接的網(wǎng)絡(luò)更容易訓(xùn)練和推廣。這就是卷積神經(jīng)網(wǎng)絡(luò)(ConvNet)。在神經(jīng)網(wǎng)絡(luò)未受關(guān)注期間,它取得了許多實(shí)際的成功,并且最近被計(jì)算機(jī)視覺界廣泛采用。
3,敘述BP算法(Backpropagation to train multilayer architectures ):核心思想就是 chain rule for derivatives(鏈?zhǔn)角髮?dǎo)),然后說神經(jīng)網(wǎng)絡(luò)和反向傳播算法在機(jī)器學(xué)習(xí)里被遺忘,后來在2006年有個團(tuán)隊(duì)提出了 unsupervised learning procedures,又revive了, unsupervised learning procedure顧名思義就是無監(jiān)督嘛,能夠用沒有標(biāo)簽的數(shù)據(jù)就創(chuàng)造網(wǎng)絡(luò),它的厲害之處就是做一個pre-training ,作用是能夠把我們的參數(shù)weight初始化到一個合適的值,并且呢對一些小的數(shù)據(jù)集,unsupervised pre-training能夠避免過擬合
Convolutional neural networks
ConvNets are designed to process data that come in the form of multiple arrays, for example a colour image composed of three 2D arrays containing pixel intensities in the three colour channels. Many data modalities are in the form of multiple arrays: 1D for signals and sequences, including language; 2D for images or audio spectrograms; and 3D for video or volumetric images. There are four key ideas behind ConvNets that take advantage of the properties of natural signals: local connections, shared weights, pooling and the use of many layers.
The architecture of a typical ConvNet (Fig. 2) is structured as a series of stages. The first few stages are composed of two types of layers: convolutional layers and pooling layers. Units in a convolutional layer are organized in feature maps, within which each unit is connected to local patches in the feature maps of the previous layer through a set of weights called a filter bank. The result of this local weighted sum is then passed through a non-linearity such as a ReLU. All units in a feature map share the same filter bank. Different feature maps in a layer use different filter banks. The reason for this architecture is twofold. First, in array data such as images, local groups of values are often highly correlated, forming distinctive local motifs that are easily detected. Second, the local statistics of images and other signals are invariant to location. In other words, if a motif can appear in one part of the image, it could appear anywhere, hence the idea of units at different locations sharing the same weights and detecting the same pattern in different parts of the array. Mathematically, the filtering operation performed by a feature map is a discrete convolution, hence the name.
Although the role of the convolutional layer is to detect local conjunctions of features from the previous layer, the role of the pooling layer is to merge semantically similar features into one. Because the relative positions of the features forming a motif can vary somewhat, reliably detecting the motif can be done by coarse-graining the position of each feature. A typical pooling unit computes the maximum of a local patch of units in one feature map (or in a few feature maps). Neighbouring pooling units take input from patches that are shifted by more than one row or column, thereby reducing the dimension of the representation and creating an invariance to small shifts and distortions. Two or three stages of convolution, non-linearity and pooling are stacked, followed by more convolutional and fully-connected layers. Backpropagating gradients through a ConvNet is as simple as through a regular deep network, allowing all the weights in all the filter banks to be trained.
Deep neural networks exploit the property that many natural signals are compositional hierarchies, in which higher-level features are obtained by composing lower-level ones. In images, local combinations of edges form motifs, motifs assemble into parts, and parts form objects. Similar hierarchies exist in speech and text from sounds to phones, phonemes, syllables, words and sentences. The pooling allows representations to vary very little when elements in the previous layer vary in position and appearance.
The convolutional and pooling layers in ConvNets are directly inspired by the classic notions of simple cells and complex cells in visual neuroscience, and the overall architecture is reminiscent of the LGN–V1–V2–V4–IT hierarchy in the visual cortex ventral pathway. When ConvNet models and monkeys are shown the same picture, the activations of high-level units in the ConvNet explains half of the variance of random sets of 160 neurons in the monkey’s inferotemporal cortex. ConvNets have their roots in the neocognitron, the architecture of which was somewhat similar, but did not have an end-to-end supervised-learning algorithm such as backpropagation. A primitive 1D ConvNet called a time-delay neural net was used for the recognition of phonemes and simple words.
There have been numerous applications of convolutional networks going back to the early 1990s, starting with time-delay neural networks for speech recognition and document reading. The document reading system used a ConvNet trained jointly with a probabilistic model that implemented language constraints. By the late 1990s this system was reading over 10% of all the cheques in the United States. A number of ConvNet-based optical character recognition and handwriting recognition systems were later deployed by Microsoft. ConvNets were also experimented with in the early 1990s for object detection in natural images, including faces and hands and for face recognition.
卷積神經(jīng)網(wǎng)絡(luò)
ConvNets被設(shè)計(jì)為處理以多個陣列形式出現(xiàn)的數(shù)據(jù),例如,由三個二維通道組成的彩色圖像,其中三個二維通道在三個彩色通道中包含像素強(qiáng)度。許多數(shù)據(jù)形式以多個數(shù)組的形式出現(xiàn):一維用于信號和序列,包括語言;2D用于圖像或音頻頻譜圖;和3D視頻或體積圖像。ConvNets有四個利用自然信號屬性的關(guān)鍵思想:局部連接,共享權(quán)重,池化和多層使用。
典型的ConvNet的體系結(jié)構(gòu)(圖2)由一系列階段構(gòu)成。前幾個階段由兩種類型的層組成:卷積層和池化層。卷積層中的單元組織在特征圖中,其中每個單元通過稱為濾波器組的一組權(quán)重連接到上一層特征圖中的局部塊。然后,該局部加權(quán)和的結(jié)果將通過非線性(如ReLU)傳遞。特征圖中的所有單元共享相同的過濾器組。圖層中的不同要素圖使用不同的濾鏡庫。這種體系結(jié)構(gòu)的原因有兩個。首先,在諸如圖像的陣列數(shù)據(jù)中,局部的值通常高度相關(guān),從而形成易于檢測的獨(dú)特局部圖案。其次,圖像和其他信號的局部統(tǒng)計(jì)量對于位置是不變的。換句話說,如果圖形可以出現(xiàn)在圖像的一部分中,則它可以出現(xiàn)在任何位置,因此,位于不同位置的單元在數(shù)組的不同部分共享相同的權(quán)重并檢測相同的圖案。在數(shù)學(xué)上,由特征圖執(zhí)行的過濾操作是離散卷積,因此得名。
盡管卷積層的作用是檢測上一層的特征的局部連接,但池化層的作用是將語義相似的要素合并為一個。由于形成圖案的特征的相對位置可能會略有變化,因此可以通過對每個特征的位置進(jìn)行粗粒度來可靠地檢測圖案。一個典型的池化單元計(jì)算一個特征圖中(或幾個特征圖中)的局部塊的最大值。相鄰的池化單元從移動了不止一個行或一列的色塊中獲取輸入,從而減小了表示的尺寸,并為小幅度的移位和失真創(chuàng)建了不變性。卷積、非線性和池化的兩個或三個階段被堆疊,隨后是更多卷積和全連接的層。通過ConvNet進(jìn)行反向傳播的梯度與通過常規(guī)深度網(wǎng)絡(luò)一樣簡單,從而可以訓(xùn)練所有濾波器組中的所有權(quán)重。
深度神經(jīng)網(wǎng)絡(luò)利用了許多自然信號是成分層次結(jié)構(gòu)的特性,其中通過組合較低層的特征獲得較高層的特征。在圖像中,邊緣的局部組合形成圖案,圖案組裝成零件,而零件形成對象。從聲音到電話,音素,音節(jié),單詞和句子,語音和文本中也存在類似的層次結(jié)構(gòu)。當(dāng)上一層中的元素的位置和外觀變化時,池化使表示形式的變化很小。
卷積網(wǎng)絡(luò)中的卷積和池化層直接受到視覺神經(jīng)科學(xué)中簡單細(xì)胞和復(fù)雜細(xì)胞的經(jīng)典概念的啟發(fā),整個架構(gòu)讓人聯(lián)想到視覺皮層腹側(cè)通路中的LGN-V1-V2-V4-IT層次結(jié)構(gòu)。當(dāng)ConvNet模型和猴子顯示相同的圖片時,ConvNet中高層單元的激活解釋了猴子下顳葉皮層中160個神經(jīng)元隨機(jī)集合的一半方差。 ConvNets的根源是新認(rèn)知器,其架構(gòu)有些相似,但沒有反向傳播等端到端監(jiān)督學(xué)習(xí)算法。稱為時延神經(jīng)網(wǎng)絡(luò)的原始一維ConvNet用于識別音素和簡單單詞。
卷積網(wǎng)絡(luò)的大量應(yīng)用可以追溯到二十世紀(jì)九十年代初,首先是用于語音識別和文檔閱讀的時延神經(jīng)網(wǎng)絡(luò)。該文檔閱讀系統(tǒng)使用了一個ConvNet,并與一個實(shí)現(xiàn)語言約束的概率模型一起進(jìn)行了培訓(xùn)。到二十世紀(jì)九十年代后期,該系統(tǒng)已讀取了美國所有支票的10%以上。 Microsoft隨后部署了許多基于ConvNet的光學(xué)字符識別和手寫識別系統(tǒng)。在二十世紀(jì)九十年代初,還對ConvNets進(jìn)行試驗(yàn),以檢測自然圖像中的物體,包括面部和手部,以及面部識別。
講卷積神經(jīng)網(wǎng)絡(luò)(Convolutional neural networks) :ConvNets背后的四個關(guān)鍵思想:
局部連接(local connections):每個神經(jīng)元其實(shí)沒有必要對全局圖像進(jìn)行感知,只需要對局部進(jìn)行感知,然后在更高層將局部的信息綜合起來就得到了全局的信息;
權(quán)值共享(shared weights):權(quán)值共享(也就是卷積操作)減少了權(quán)值數(shù)量,降低了網(wǎng)絡(luò)復(fù)雜度,可以看成是特征提取的方式。其中隱含的原理是:圖像中的一部分的統(tǒng)計(jì)特性與其他部分是一樣的。意味著我們在這一部分學(xué)習(xí)的特征也能用在另一部分上,所以對于這個圖像上的所有位置,我們都能使用同樣的學(xué)習(xí)特征;
池化( pooling):在通過卷積獲得了特征 (features) 之后,下一步我們希望利用這些特征去做分類。人們可以用所有提取得到的特征去訓(xùn)練分類器,例如 softmax 分類器,但這樣做面臨計(jì)算量的挑戰(zhàn),并且容易出現(xiàn)過擬合 (over-fitting),因此,為了描述大的圖像,可以對不同位置的特征進(jìn)行聚合統(tǒng)計(jì),如計(jì)算平均值或者是最大值,即mean-pooling和max-pooling; 多層(the use of many layers)。
接下來就講到,典型 ConvNet的結(jié)構(gòu): convolution layers, non-linearity and pooling ,分別是卷積層,非線性操作,池化層,然后將這個結(jié)構(gòu)多次堆疊就構(gòu)成了ConvNet的隱藏層,然后講了ConvNets中卷積層和池化層的設(shè)計(jì)靈感
Image understanding with deep convolutional networks
Since the early 2000s, ConvNets have been applied with great success to the detection, segmentation and recognition of objects and regions in images. These were all tasks in which labelled data was relatively abundant, such as traffic sign recognition, the segmentation of biological images particularly for connectomics, and the detection of faces, text, pedestrians and human bodies in natural images. A major recent practical success of ConvNets is face recognition.
Importantly, images can be labelled at the pixel level, which will have applications in technology, including autonomous mobile robots and self-driving cars. Companies such as Mobileye and NVIDIA are using such ConvNet-based methods in their upcoming vision systems for cars. Other applications gaining importance involve natural language understanding and speech recognition.
Despite these successes, ConvNets were largely forsaken by the mainstream computer-vision and machine-learning communities until the ImageNet competition in 2012. When deep convolutional networks were applied to a data set of about a million images from the web that contained 1,000 different classes, they achieved spectacular results, almost halving the error rates of the best competing approaches. This success came from the efficient use of GPUs, ReLUs, a new regularization technique called dropout, and techniques to generate more training examples by deforming the existing ones. This success has brought about a revolution in computer vision; ConvNets are now the dominant approach for almost all recognition and detection tasks and approach human performance on some tasks. A recent stunning demonstration combines ConvNets and recurrent net modules for the generation of image captions (Fig. 3).
Recent ConvNet architectures have 10 to 20 layers of ReLUs, hundreds of millions of weights, and billions of connections between units. Whereas training such large networks could have taken weeks only two years ago, progress in hardware, software and algorithm parallelization have reduced training times to a few hours.
The performance of ConvNet-based vision systems has caused most major technology companies, including Google, Facebook, Microsoft, IBM, Y ahoo!, Twitter and Adobe, as well as a quickly growing number of start-ups to initiate research and development projects and to deploy ConvNet-based image understanding products and services.
ConvNets are easily amenable to efficient hardware implementations in chips or field-programmable gate arrays. A number of companies such as NVIDIA, Mobileye, Intel, Qualcomm and Samsung are developing ConvNet chips to enable real-time vision applications in smartphones, cameras, robots and self-driving cars.
深度卷積網(wǎng)絡(luò)的圖像理解
自二十一世紀(jì)初以來,ConvNets已成功應(yīng)用于圖像中對象和區(qū)域的檢測、分割和識別。這些都是標(biāo)記數(shù)據(jù)相對豐富的任務(wù),例如交通標(biāo)志識別、生物圖像分割、尤其是用于連接組學(xué),以及在自然圖像中檢測人臉、文字、行人和人體。ConvNets最近在實(shí)踐中取得的主要成功是面部識別[59]。
重要的是,可以在像素級別標(biāo)記圖像,這將在技術(shù)中得到應(yīng)用,包括自動駕駛機(jī)器人和自動駕駛汽車。 Mobileye和NVIDIA等公司正在其即將推出的汽車視覺系統(tǒng)中使用基于ConvNet的方法。其他日益重要的應(yīng)用包括自然語言理解和語音識別。
盡管取得了這些成功,但ConvNet在很大程度上被主流計(jì)算機(jī)視覺和機(jī)器學(xué)習(xí)領(lǐng)域棄用,直到2012年ImageNet競賽為止。當(dāng)深度卷積網(wǎng)絡(luò)應(yīng)用于來自網(wǎng)絡(luò)的大約一百萬個圖像的數(shù)據(jù)集時,其中包含1000個不同的類別,取得了驚人的成績,幾乎使最佳競爭方法的錯誤率降低了一半。成功的原因是有效利用了GPU、ReLU、一種稱為dropout的新正則化技術(shù),以及通過使現(xiàn)有示例變形而生成更多訓(xùn)練示例的技術(shù)。這一成功帶來了計(jì)算機(jī)視覺的一場革命?,F(xiàn)在,ConvNets是幾乎所有識別和檢測任務(wù)的主要方法,并且在某些任務(wù)上達(dá)到了人類水平。最近的一次令人震驚的演示結(jié)合了ConvNets和遞歸網(wǎng)絡(luò)模塊以生成圖像字幕(圖3)。
最新的ConvNet架構(gòu)具有10到20層ReLU,數(shù)億個權(quán)重以及單元之間的數(shù)十億個連接。盡管培訓(xùn)如此大型的網(wǎng)絡(luò)可能僅在兩年前才花了幾周的時間,但是硬件、軟件和算法并行化方面的進(jìn)步已將培訓(xùn)時間減少到幾個小時。
基于ConvNet的視覺系統(tǒng)的性能已引起大多數(shù)主要技術(shù)公司的發(fā)展,其中包括Google、Facebook、Microsoft、IBM、Y ahoo、Twitter和Adobe,以及數(shù)量迅速增長的初創(chuàng)公司啟動了研究和開發(fā)項(xiàng)目,部署基于ConvNet的圖像理解產(chǎn)品和服務(wù)。
卷積網(wǎng)絡(luò)很容易適應(yīng)芯片或現(xiàn)場可編程門陣列中的高效硬件實(shí)現(xiàn)。 NVIDIA、Mobileye、英特爾、高通和三星等多家公司正在開發(fā)ConvNet芯片,以支持智能手機(jī)、相機(jī)、機(jī)器人和自動駕駛汽車中的實(shí)時視覺應(yīng)用。
深卷積網(wǎng)絡(luò)對圖像進(jìn)行理解(Image understanding with deep convolutional networks):感覺這段沒講什么技術(shù)上的類容,主要就是各個互聯(lián)網(wǎng)巨頭用DNN做出了厲害的成績
Distributed representations and language processing
Deep-learning theory shows that deep nets have two different exponential advantages over classic learning algorithms that do not use distributed representations. Both of these advantages arise from the power of composition and depend on the underlying data-generating distribution having an appropriate componential structure. First, learning distributed representations enable generalization to new combinations of the values of learned features beyond those seen during training (for example, 2n2n2n combinations are possible with nnn binary features). Second, composing layers of representation in a deep net brings the potential for another exponential advantage (exponential in the depth).
The hidden layers of a multilayer neural network learn to represent the network’s inputs in a way that makes it easy to predict the target outputs. This is nicely demonstrated by training a multilayer neural network to predict the next word in a sequence from a local context of earlier words. Each word in the context is presented to the network as a one-of-N vector, that is, one component has a value of 1 and the rest are 0. In the first layer, each word creates a different pattern of activations, or word vectors (Fig. 4). In a language model, the other layers of the network learn to convert the input word vectors into an output word vector for the predicted next word, which can be used to predict the probability for any word in the vocabulary to appear as the next word. The network learns word vectors that contain many active components each of which can be interpreted as a separate feature of the word, as was first demonstrated in the context of learning distributed representations for symbols. These semantic features were not explicitly present in the input. They were discovered by the learning procedure as a good way of factorizing the structured relationships between the input and output symbols into multiple ‘micro-rules’ . Learning word vectors turned out to also work very well when the word sequences come from a large corpus of real text and the individual micro-rules are unreliable. When trained to predict the next word in a news story, for example, the learned word vectors for Tuesday and Wednesday are very similar, as are the word vectors for Sweden and Norway. Such representations are called distributed representations because their elements (the features) are not mutually exclusive and their many configurations correspond to the variations seen in the observed data. These word vectors are composed of learned features that were not determined ahead of time by experts, but automatically discovered by the neural network. Vector representations of words learned from text are now very widely used in natural language applications.
The issue of representation lies at the heart of the debate between the logic-inspired and the neural-network-inspired paradigms for cognition. In the logic-inspired paradigm, an instance of a symbol is something for which the only property is that it is either identical or non-identical to other symbol instances. It has no internal structure that is relevant to its use; and to reason with symbols, they must be bound to the variables in judiciously chosen rules of inference. By contrast, neural networks just use big activity vectors, big weight matrices and scalar non-linearities to perform the type of fast ‘intuitive’ inference that underpins effortless commonsense reasoning.
Before the introduction of neural language models, the standard approach to statistical modelling of language did not exploit distributed representations: it was based on counting frequencies of occurrences of short symbol sequences of length up to NNN (called N?gramsN-gramsN?grams). The number of possible N?gramsN-gramsN?grams is on the order of VNV^NVN, where VVV is the vocabulary size, so taking into account a context of more than a handful of words would require very large training corpora. N?gramsN-gramsN?grams treat each word as an atomic unit, so they cannot generalize across semantically related sequences of words, whereas neural language models can because they associate each word with a vector of real valued features, and semantically related words end up close to each other in that vector space (Fig. 4).
分布式表示和語言處理
深度學(xué)習(xí)理論表明,與不使用分布式表示的經(jīng)典學(xué)習(xí)算法相比,深網(wǎng)具有兩個不同的指數(shù)優(yōu)勢。這兩個優(yōu)點(diǎn)都來自于組合的力量,并取決于具有適當(dāng)組件結(jié)構(gòu)的底層數(shù)據(jù)生成分布。首先,學(xué)習(xí)分布式表示可以將學(xué)習(xí)到的特征值的新組合推廣到訓(xùn)練期間看不到的那些新組合(例如使用nnn個二進(jìn)制特征可以進(jìn)行2n2n2n個組合)。其次,在一個深層網(wǎng)絡(luò)中構(gòu)成表示層會帶來另一個指數(shù)優(yōu)勢(深度指數(shù))。
多層神經(jīng)網(wǎng)絡(luò)的隱藏層學(xué)習(xí)以易于預(yù)測目標(biāo)輸出的方式來表示網(wǎng)絡(luò)的輸入。通過訓(xùn)練多層神經(jīng)網(wǎng)絡(luò)從較早單詞的局部上下文中預(yù)測序列中的下一個單詞,可以很好地證明這一點(diǎn)。上下文中的每個單詞都以NNN個向量的形式呈現(xiàn)給網(wǎng)絡(luò),也就是說,一個組成部分的值為1,其余均為0。在第一層中,每個單詞都會創(chuàng)建不同的激活模式,或者字向量(圖4)。在語言模型中,網(wǎng)絡(luò)的其他層學(xué)習(xí)將輸入的單詞矢量轉(zhuǎn)換為預(yù)測的下一個單詞的輸出單詞矢量,這可用于預(yù)測詞匯表中任何單詞出現(xiàn)為下一個單詞的概率。網(wǎng)絡(luò)學(xué)習(xí)包含許多有效成分的單詞向量,每個成分都可以解釋為單詞的一個獨(dú)立特征,如在學(xué)習(xí)符號的分布式表示形式時首先證明的那樣。這些語義特征未在輸入中明確顯示。通過學(xué)習(xí)過程可以發(fā)現(xiàn)它們,這是將輸入和輸出符號之間的結(jié)構(gòu)化關(guān)系分解為多個“微規(guī)則”的好方法。當(dāng)單詞序列來自大量的真實(shí)文本并且單個微規(guī)則不可靠時,學(xué)習(xí)單詞向量也可以很好地工作。例如,在訓(xùn)練以預(yù)測新聞故事中的下一個單詞時,周二和周三學(xué)到的單詞向量與瑞典和挪威的單詞向量非常相似。這樣的表示稱為分布式表示,因?yàn)樗鼈兊脑?#xff08;特征)不是互斥的,并且它們的許多配置對應(yīng)于在觀察到的數(shù)據(jù)中看到的變化。這些詞向量由專家事先未確定但由神經(jīng)網(wǎng)絡(luò)自動發(fā)現(xiàn)的學(xué)習(xí)特征組成。從文本中學(xué)到的單詞的矢量表示現(xiàn)在已在自然語言應(yīng)用中得到廣泛使用。
表示問題是邏輯啟發(fā)和神經(jīng)網(wǎng)絡(luò)啟發(fā)的認(rèn)知范式之間爭論的核心。在邏輯啟發(fā)范式中,符號實(shí)例是某些事物,其唯一屬性是它與其他符號實(shí)例相同或不同。它沒有與其使用相關(guān)的內(nèi)部結(jié)構(gòu);為了用符號進(jìn)行推理,必須將它們綁定到明智選擇的推理規(guī)則中的變量。相比之下,神經(jīng)網(wǎng)絡(luò)僅使用較大的活動矢量,較大的權(quán)重矩陣和標(biāo)量非線性來執(zhí)行快速的“直覺”推斷類型,從而支持毫不費(fèi)力的常識推理。
在引入神經(jīng)語言模型之前,語言統(tǒng)計(jì)建模的標(biāo)準(zhǔn)方法并未利用分布式表示形式:它是基于對長度不超過NNN(稱為N?gramsN-gramsN?grams)的短符號序列的出現(xiàn)頻率進(jìn)行計(jì)數(shù)??赡艿?span id="ze8trgl8bvbq" class="katex--inline">N?gramsN-gramsN?grams的數(shù)量在VNV^NVN的數(shù)量級上,其中VVV是詞匯量,因此考慮到少數(shù)單詞的上下文,將需要非常大的訓(xùn)練語料庫。N?gramsN-gramsN?grams將每個單詞視為一個原子單元,因此它們無法在語義上相關(guān)的單詞序列中進(jìn)行泛化,而神經(jīng)語言模型則可以將它們與實(shí)值特征向量相關(guān)聯(lián),而語義相關(guān)的單詞最終彼此靠近在該向量空間中(圖4)。
Distributed representations and language processing,深度學(xué)習(xí)理論表明,與不使用分布式表示的經(jīng)典學(xué)習(xí)算法相比,深網(wǎng)具有兩個不同的指數(shù)優(yōu)勢。這兩個優(yōu)點(diǎn)都來自于組合的力量,并取決于具有適當(dāng)組件結(jié)構(gòu)的底層數(shù)據(jù)生成分布,然后講述了下分布式以及應(yīng)用發(fā)展
Recurrent neural networks
When backpropagation was first introduced, its most exciting use was for training recurrent neural networks (RNNs). For tasks that involve sequential inputs, such as speech and language, it is often better to use RNNs (Fig. 5). RNNs process an input sequence one element at a time, maintaining in their hidden units a ‘state vector’ that implicitly contains information about the history of all the past elements of the sequence. When we consider the outputs of the hidden units at different discrete time steps as if they were the outputs of different neurons in a deep multilayer network (Fig. 5, right), it becomes clear how we can apply backpropagation to train RNNs.
RNNs are very powerful dynamic systems, but training them has proved to be problematic because the backpropagated gradients either grow or shrink at each time step, so over many time steps they typically explode or vanish.
Thanks to advances in their architecture and ways of training them, RNNs have been found to be very good at predicting the next character in the text or the next word in a sequence, but they can also be used for more complex tasks. For example, after reading an English sentence one word at a time, an English ‘encoder’ network can be trained so that the final state vector of its hidden units is a good representation of the thought expressed by the sentence. This thought vector can then be used as the initial hidden state of (or as extra input to) a jointly trained French ‘decoder’ network, which outputs a probability distribution for the first word of the French translation. If a particular first word is chosen from this distribution and provided as input to the decoder network it will then output a probability distribution for the second word of the translation and so on until a full stop is chosen. Overall, this process generates sequences of French words according to a probability distribution that depends on the English sentence. This rather naive way of performing machine translation has quickly become competitive with the state-of-the-art, and this raises serious doubts about whether understanding a sen-tence requires anything like the internal symbolic expressions that are manipulated by using inference rules. It is more compatible with the view that everyday reasoning involves many simultaneous analogies that each contribute plausibility to a conclusion.
Instead of translating the meaning of a French sentence into an English sentence, one can learn to ‘translate’ the meaning of an image into an English sentence (Fig. 3). The encoder here is a deep ConvNet that converts the pixels into an activity vector in its last hidden layer. The decoder is an RNN similar to the ones used for machine translation and neural language modelling. There has been a surge of interest in such systems recently .
RNNs, once unfolded in time (Fig. 5), can be seen as very deep feedforward networks in which all the layers share the same weights. Although their main purpose is to learn long-term dependencies, theoretical and empirical evidence shows that it is difficult to learn to store information for very long.
To correct for that, one idea is to augment the network with an explicit memory. The first proposal of this kind is the long short-term memory (LSTM) networks that use special hidden units, the natural behaviour of which is to remember inputs for a long time. A special unit called the memory cell acts like an accumulator or a gated leaky neuron: it has a connection to itself at the next time step that has a weight of one, so it copies its own real-valued state and accumulates the external signal, but this self-connection is multiplicatively gated by another unit that learns to decide when to clear the content of the memory.
LSTM networks have subsequently proved to be more effective than conventional RNNs, especially when they have several layers for each time step, enabling an entire speech recognition system that goes all the way from acoustics to the sequence of characters in the transcription. LSTM networks or related forms of gated units are also currently used for the encoder and decoder networks that perform so well at machine translation.
Over the past year, several authors have made different proposals to augment RNNs with a memory module. Proposals include the Neural Turing Machine in which the network is augmented by a ‘tape-like’ memory that the RNN can choose to read from or write to, and memory networks, in which a regular network is augmented by a kind of associative memory. Memory networks have yielded excellent performance on standard question-answering benchmarks. The memory is used to remember the story about which the network is later asked to answer questions.
Beyond simple memorization, neural Turing machines and memory networks are being used for tasks that would normally require reasoning and symbol manipulation. Neural Turing machines can be taught ‘a(chǎn)lgorithms’ . Among other things, they can learn to output a sorted list of symbols when their input consists of an unsorted sequence in which each symbol is accompanied by a real value that indicates its priority in the list. Memory networks can be trained to keep track of the state of the world in a setting similar to a text adventure game and after reading a story, they can answer questions that require complex inference90. In one test example, the network is shown a 15-sentence version of the The Lord of the Rings and correctly answers questions such as “where is Frodo now?”.
遞歸神經(jīng)網(wǎng)絡(luò)
首次引入反向傳播時,其最令人興奮的用途是訓(xùn)練循環(huán)神經(jīng)網(wǎng)絡(luò)(RNN)。對于涉及順序輸入的任務(wù),例如語音和語言,通常最好使用RNN(圖5)。 RNN一次處理一個輸入序列的一個元素,在其隱藏的單元中維護(hù)一個“狀態(tài)向量”,該“狀態(tài)向量”隱式包含有關(guān)該序列的所有過去元素的歷史信息。當(dāng)我們將隱藏單位在不同離散時間步長的輸出視為是深層多層網(wǎng)絡(luò)中不同神經(jīng)元的輸出時(圖5右),顯然我們可以如何應(yīng)用反向傳播來訓(xùn)練RNN。
RNN是非常強(qiáng)大的動態(tài)系統(tǒng),但是事實(shí)證明,訓(xùn)練它們是有問題的,因?yàn)榉聪騻鞑サ奶荻仍诿總€時間步長都會增大或縮小,因此在許多時間步長上它們通常會爆炸或消失。
由于其結(jié)構(gòu)和訓(xùn)練方法的進(jìn)步,人們發(fā)現(xiàn)RNN非常擅長預(yù)測文本中的下一個字符或序列中的下一個單詞,但它們也可以用于更復(fù)雜的任務(wù)。例如,一次讀一個單詞的英語句子后,可以訓(xùn)練英語的“編碼器”網(wǎng)絡(luò),使其隱藏單元的最終狀態(tài)向量很好地表示了該句子表達(dá)的思想。然后,可以將此思想向量用作聯(lián)合訓(xùn)練的法語“解碼器”網(wǎng)絡(luò)的初始隱藏狀態(tài)(或作為其額外輸入),該網(wǎng)絡(luò)將輸出法語翻譯的第一個單詞的概率分布。如果從該分布中選擇了一個特定的第一個單詞,并將其作為輸入提供給解碼器網(wǎng)絡(luò),則它將輸出翻譯的第二個單詞的概率分布,依此類推,直到選擇了句號。總體而言,此過程根據(jù)取決于英語句子的概率分布生成法語單詞序列。這種相當(dāng)幼稚的執(zhí)行機(jī)器翻譯的方式已迅速與最新技術(shù)競爭,這引起了人們對理解句子是否需要諸如通過使用推理規(guī)則操縱的內(nèi)部符號表達(dá)式之類的嚴(yán)重質(zhì)疑。日常推理涉及許多同時進(jìn)行的類比,每個類比都為結(jié)論提供了合理性,這一觀點(diǎn)與觀點(diǎn)更加兼容。
與其將法語句子的含義翻譯成英語句子,不如學(xué)習(xí)將圖像的含義“翻譯”成英語句子(圖3)。這里的編碼器是一個深層的ConvNet,可將像素轉(zhuǎn)換為其最后一個隱藏層中的活動矢量。解碼器是一個RNN,類似于用于機(jī)器翻譯和神經(jīng)語言建模的RNN。近年來,對此類系統(tǒng)的興趣激增。
RNNs隨時間展開(圖5),可以看作是非常深的前饋網(wǎng)絡(luò),其中所有層共享相同的權(quán)重。盡管它們的主要目的是學(xué)習(xí)長期依賴關(guān)系,但理論和經(jīng)驗(yàn)證據(jù)表明,很難長期存儲信息。
為了解決這個問題,一個想法是用顯式內(nèi)存擴(kuò)展網(wǎng)絡(luò)。此類第一種建議是使用特殊隱藏單元的長短期記憶(LSTM)網(wǎng)絡(luò),其自然行為是長時間記住輸入。稱為存儲單元的特殊單元的作用類似于累加器或門控泄漏神經(jīng)元:它在下一時間步與其自身具有連接,其權(quán)重為1,因此它復(fù)制自己的實(shí)值狀態(tài)并累積外部信號,但是此自連接是由另一個單元乘法控制的,該單元學(xué)會確定何時清除內(nèi)存內(nèi)容。
LSTM網(wǎng)絡(luò)隨后被證明比常規(guī)RNN更有效,特別是當(dāng)它們在每個時間步都有多層時,使整個語音識別系統(tǒng)從聲學(xué)到轉(zhuǎn)錄中的字符序列都一路走來。LSTM網(wǎng)絡(luò)或相關(guān)形式的門控單元目前也用于編碼器和解碼器網(wǎng)絡(luò),它們在機(jī)器翻譯方面表現(xiàn)出色。
在過去的一年中,幾位作者提出了不同的建議,以使用內(nèi)存模塊擴(kuò)展RNN。建議包括神經(jīng)圖靈機(jī),其中網(wǎng)絡(luò)由RNN可以選擇讀取或?qū)懭氲摹跋駧А贝鎯ζ鱽碓鰪?qiáng),以及存儲網(wǎng)絡(luò),其中常規(guī)網(wǎng)絡(luò)由一種關(guān)聯(lián)性存儲器來增強(qiáng)。內(nèi)存網(wǎng)絡(luò)在標(biāo)準(zhǔn)問答基準(zhǔn)方面已表現(xiàn)出出色的性能。存儲器用于記住故事,有關(guān)該故事后來被要求網(wǎng)絡(luò)回答問題。
除了簡單的記憶外,神經(jīng)圖靈機(jī)和存儲網(wǎng)絡(luò)還用于執(zhí)行通常需要推理和符號操作的任務(wù)。神經(jīng)圖靈機(jī)可以被稱為“算法”。除其他事項(xiàng)外,當(dāng)他們的輸入由未排序的序列組成時,他們可以學(xué)習(xí)輸出已排序的符號列表,其中每個符號都帶有一個實(shí)數(shù)值,該實(shí)數(shù)值指示其在列表中的優(yōu)先級。可以訓(xùn)練記憶網(wǎng)絡(luò),使其在類似于文字冒險游戲的環(huán)境中跟蹤世界狀況,閱讀故事后,它們可以回答需要復(fù)雜推理的問題。在一個測試示例中,該網(wǎng)絡(luò)顯示了15句的《指環(huán)王》,并正確回答了諸如“ Frodo現(xiàn)在在哪里?”之類的問題。
7,Recurrent neural networks 六,七部分都是講的DL在文本和語言處理領(lǐng)域的發(fā)展
The future of deep learning
Unsupervised learning had a catalytic effect in reviving interest in deep learning, but has since been overshadowed by the successes of purely supervised learning. Although we have not focused on it in this Review, we expect unsupervised learning to become far more important in the longer term. Human and animal learning is largely unsupervised: we discover the structure of the world by observing it, not by being told the name of every object.
Human vision is an active process that sequentially samples the optic array in an intelligent, task-specific way using a small, high-resolution fovea with a large, low-resolution surround. We expect much of the future progress in vision to come from systems that are trained end-to-end and combine ConvNets with RNNs that use reinforcement learning to decide where to look. Systems combining deep learning and rein-forcement learning are in their infancy, but they already outperform passive vision systems99 at classification tasks and produce impressive results in learning to play many different video game.
Natural language understanding is another area in which deep learning is poised to make a large impact over the next few years. We expect systems that use RNNs to understand sentences or whole documents will become much better when they learn strategies for selectively attending to one part at a time.
Ultimately, major progress in artificial intelligence will come about through systems that combine representation learning with complex reasoning. Although deep learning and simple reasoning have been used for speech and handwriting recognition for a long time, new paradigms are needed to replace rule-based manipulation of symbolic expressions by operations on large vectors.
深度學(xué)習(xí)的未來
無監(jiān)督學(xué)習(xí)在恢復(fù)對深度學(xué)習(xí)的興趣方面起了催化作用,但此后被監(jiān)督學(xué)習(xí)的成功所掩蓋。盡管我們在本評論中并未對此進(jìn)行關(guān)注,但我們希望從長遠(yuǎn)來看,無監(jiān)督學(xué)習(xí)將變得越來越重要。人類和動物的學(xué)習(xí)在很大程度上不受監(jiān)督:我們通過觀察來發(fā)現(xiàn)世界的結(jié)構(gòu),而不是通過告知每個物體的名稱來發(fā)現(xiàn)世界的結(jié)構(gòu)。
人的視覺是一個活躍的過程,它使用具有高分辨率,低分辨率環(huán)繞的小型高分辨率中央凹,以智能的,針對特定任務(wù)的方式對光學(xué)陣列進(jìn)行順序采樣。我們期望在視覺上未來的許多進(jìn)步都將來自端到端訓(xùn)練的系統(tǒng),并將ConvNets與RNN結(jié)合起來,后者使用強(qiáng)化學(xué)習(xí)來決定在哪里看。結(jié)合了深度學(xué)習(xí)和強(qiáng)化學(xué)習(xí)的系統(tǒng)尚處于起步階段,但在分類任務(wù)上它們已經(jīng)超過了被動視覺系統(tǒng),并且在學(xué)習(xí)玩許多不同的視頻游戲方面產(chǎn)生了令人印象深刻的結(jié)果。
自然語言理解是深度學(xué)習(xí)必將在未來幾年產(chǎn)生巨大影響的另一個領(lǐng)域。我們希望使用RNN理解句子或整個文檔的系統(tǒng)在學(xué)習(xí)一次選擇性地關(guān)注一部分的策略時會變得更好。
最終,人工智能的重大進(jìn)步將通過將表示學(xué)習(xí)與復(fù)雜推理相結(jié)合的系統(tǒng)來實(shí)現(xiàn)。盡管長期以來,深度學(xué)習(xí)和簡單推理已被用于語音和手寫識別,但仍需要新的范例來通過對大向量進(jìn)行運(yùn)算來代替基于規(guī)則的符號表達(dá)操縱。
深度學(xué)習(xí)的將來(The future of deep learning ):主要對Unsupervised learning 的發(fā)展有個很棒的展望
總結(jié)
以上是生活随笔為你收集整理的Deep Learning-论文翻译以及笔记的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: css设置字体颜色、文本对齐方式、首行缩
- 下一篇: 锐界机器人_2019款锐界智能家居远程控