出人意料的生日会400字_出人意料的有效遗传方法进行特征选择
出人意料的生日會(huì)400字
Genetic and evolutionary algorithms are often bashed as not being good enough to compete with the capabilities of neural networks, and for the most part, it’s true, which is why the industry seldom even considers these types of algorithms. They are too general whereas there are other specific solutions that are designed for specific problems, and require too much computing power. But there is one fascinating application of genetic algorithms to feature selection, an important part of machine learning.
遺傳算法和進(jìn)化算法經(jīng)常被指責(zé)為不夠好,無(wú)法與神經(jīng)網(wǎng)絡(luò)的功能競(jìng)爭(zhēng),并且在大多數(shù)情況下都是如此,這就是為什么業(yè)界很少考慮使用這類算法的原因。 它們太籠統(tǒng),而還有其他針對(duì)特定問(wèn)題而設(shè)計(jì)的特定解決方案,它們需要太多的計(jì)算能力。 但是遺傳算法在特征選擇方面有一種引人入勝的應(yīng)用,它是機(jī)器學(xué)習(xí)的重要組成部分。
We’ll explore the genetic/evolutionary model of thinking, how that approach can be applied to feature selection, and why it is effective, alongside diagrams and analogies.
我們將探究遺傳/進(jìn)化思維模型,以及如何將這種方法應(yīng)用于特征選擇,以及為何有效,以及圖表和類比。
In genetic algorithms, a population of candidate solutions, also known as individuals, creatures, or phenotypes, are evolved towards better solutions in an optimization problem. Each candidate has a set of properties that can be mutated and altered.
在遺傳算法中,一組候選解決方案(也稱為個(gè)體,生物或表型)在優(yōu)化問(wèn)題中朝著更好的解決方案發(fā)展。 每個(gè)候選者都有一組可以突變和更改的屬性。
These properties can be represented as a binary string (a sequences of zeroes and ones), but there exist other encodings. In the case of feature selection, each individual represents one selection of features, and each ‘property’ represents one feature, which can be turned on or off (1 or 0).
這些屬性可以表示為二進(jìn)制字符串(零和一的序列),但是存在其他編碼。 在特征選擇的情況下,每個(gè)人代表一個(gè)特征選擇,每個(gè)“屬性”代表一個(gè)特征,可以打開(kāi)或關(guān)閉(1或0)。
The evolution of individuals begins with a random generated population, meaning each’s properties are randomly initialized. Evolution is an iterative process, and the population in each iteration is referred to as a generation. In a genetic feature selection in a dataset with 900 columns, an initial population may consist of 300 individuals, or randomly generated combinations of on/off switches.
個(gè)體的進(jìn)化始于隨機(jī)產(chǎn)生的種群,這意味著每個(gè)個(gè)體的屬性都是隨機(jī)初始化的。 演化是一個(gè)迭代過(guò)程,每次迭代中的總體稱為一代。 在具有900列的數(shù)據(jù)集中的遺傳特征選擇中,初始種群可能包含300個(gè)個(gè)體,或者是隨機(jī)生成的開(kāi)/關(guān)開(kāi)關(guān)組合。
In each generation, the fitness, which is the function of the problem being solved, of each individual is evaluated.
在每一代中,評(píng)估每個(gè)人的適應(yīng)度,這是要解決的問(wèn)題的功能。
One direct fitness function would be to simply evaluate the accuracy of a model when trained on that subset of data, or another of many possible model metrics. This can be a bit costly, though, so it should only be used with small datasets or populations.
一個(gè)直接的適應(yīng)度函數(shù)將是在對(duì)該數(shù)據(jù)子集或許多可能的模型指標(biāo)中的另一個(gè)進(jìn)行訓(xùn)練時(shí),簡(jiǎn)單地評(píng)估模型的準(zhǔn)確性。 但是,這可能會(huì)有點(diǎn)昂貴,因此只能用于小型數(shù)據(jù)集或總體。
An alternative is use a variety of cheaper-to-access metrics that can assist in evaluating the fitness of each solution. Some include:
一種替代方法是使用各種價(jià)格便宜的度量標(biāo)準(zhǔn),可以幫助評(píng)估每個(gè)解決方案的適用性。 其中包括:
- Collinearity. Make sure that features in a subset do not contain similar information by evaluating the overall correlation of each subset. 共線性。 通過(guò)評(píng)估每個(gè)子集的整體相關(guān)性,確保子集中的要素不包含相似的信息。
Entropy / separability. With the current dataset, how well separated are the classes? The more separable the data, the better it is.
熵 /可分離性。 使用當(dāng)前數(shù)據(jù)集,這些類之間的分隔程度如何? 數(shù)據(jù)越分離,就越好。
- Hybrid. Combine these metrics with others like variance or how normally distributed the data is to yield a combination that satisfies the needs of the model. 混合動(dòng)力車 將這些指標(biāo)與方差或數(shù)據(jù)的正態(tài)分布等其他指標(biāo)結(jié)合起來(lái),可以得出滿足模型需求的組合。
With some controllable randomness injected to stimulate proper evolutionary discovery, individuals on the fitter side (scoring a better on the fitness function) are randomly selected. Randomness is added and ranking is not based on pure highest score because that would allow for little exploration and is not how evolution is conducted in the real biological world.
通過(guò)注入一些可控制的隨機(jī)性以刺激適當(dāng)?shù)倪M(jìn)化發(fā)現(xiàn),隨機(jī)選擇在鉗工側(cè)(在健身功能上得分更高)的個(gè)體。 增加了隨機(jī)性,并且排名不是基于純粹的最高分?jǐn)?shù),因?yàn)檫@將很少進(jìn)行探索 ,也不是在實(shí)際生物世界中如何進(jìn)行進(jìn)化。
Even though individual 1 is shorter in terms of fitness level, stochastic selection gives it a chance and sees if a slight alteration and boot its performance. In this case, it turns out it does!即使個(gè)人1的適應(yīng)水平較短,隨機(jī)選擇也會(huì)給它一個(gè)機(jī)會(huì),看看是否有微小的改變并啟動(dòng)其性能。 在這種情況下,事實(shí)證明確實(shí)如此!Each individual’s genome is modified — either through recombination, or ‘mating’, or through random mutation (slight modification) — to form a new generation. There are incredibly sophisticated ways to perform mutations and recombination that build upon the evolutionary discoveries of previous generations and the structures of the current population, so don’t think of this process as brute force. Instead, it is analyzing previous learning and testing out different hypotheses as to what may work and what will not in an intelligent fashion.
通過(guò)重組或“交配”或通過(guò)隨機(jī)突變(輕微修飾)對(duì)每個(gè)人的基因組進(jìn)行修飾,以形成新一代。 在前幾代人的進(jìn)化發(fā)現(xiàn)和當(dāng)前人口的結(jié)構(gòu)的基礎(chǔ)上,有許多非常復(fù)雜的方法可以進(jìn)行突變和重組,因此不要將這一過(guò)程視為蠻力。 取而代之的是,它正在分析先前的學(xué)習(xí),并以智能的方式測(cè)試了關(guān)于哪些可行,哪些無(wú)效的不同假設(shè)。
The algorithm completes when the maximum number of generations (iterations) have been created, or when a population has evolved enough such that its fitness level is satisfactory. We may say that the algorithm terminates when one individual reaches over 98% accuracy. Another alternative is to finish the algorithm when the best performing individual’s fitness plateaus, or converges.
當(dāng)創(chuàng)建了最大代數(shù)(迭代)時(shí),或者當(dāng)種群進(jìn)化到足以使其適應(yīng)度令人滿意時(shí),該算法即告完成。 我們可以說(shuō),當(dāng)一個(gè)人的準(zhǔn)確率超過(guò)98%時(shí),該算法就會(huì)終止。 另一種選擇是在表現(xiàn)最佳的個(gè)人健身平穩(wěn)或收斂時(shí)完成算法。
The result of this approach is an individual that yields a subset of the data that has best satisfies the cost function.
這種方法的結(jié)果是產(chǎn)生一個(gè)個(gè)體的數(shù)據(jù),該數(shù)據(jù)的子集最能滿足成本函數(shù) 。
There is grandeur in this view of life, with its several powers, having been originally breathed into a few forms or into one; and that, whilst this planet has gone cycling on according to the fixed law of gravity, from so simple a beginning endless forms most beautiful and most wonderful have been, and are being, evolved. — Charles Darwin, The Origin of Species
這種生命觀具有宏偉的力量,最初具有多種形式或一種形式,具有多種力量。 而且,雖然這個(gè)星球有根據(jù)萬(wàn)有引力定律固定自行車走了,從這么簡(jiǎn)單的一個(gè)開(kāi)始無(wú)休止的形式最美麗,最精彩的已經(jīng)和正在得到發(fā)展 。 —查爾斯·達(dá)爾文, 《物種起源》
The genetic approach to feature selection can be expanded such that each value is not a binary 0 or 1 to indicate a presence in the subset, but a scalar multiplier, much like the result that linear discriminant analysis or principal component analysis yield. Initialization would be based on normally distributed noise, a mutation would entail adding or subtracting some amount, and a recombination would yield something like an average.
可以擴(kuò)展遺傳方法進(jìn)行特征選擇,以使每個(gè)值不是表示子集中是否存在的二進(jìn)制0或1,而是標(biāo)量乘數(shù),非常類似于線性判別分析或主成分分析得出的結(jié)果。 初始化將基于正態(tài)分布的噪聲,突變將需要增加或減少一些量,重組將產(chǎn)生類似平均值的結(jié)果。
So what is the advantage of using genetic algorithms to select features?
那么使用遺傳算法選擇特征的優(yōu)勢(shì)是什么?
It doesn’t limit how many features you can choose. If you are to use something like permutation importance or almost every other feature selection method, you must determine the amount of features you want to select — but no one really knows how to select a good number of resulting features. Why is it so that a 6th feature is discarded if the limit is five features even if the sixth is almost just as valuable as the fifth best one?
它不限制您可以選擇多少個(gè)功能 。 如果要使用排列重要性或幾乎所有其他特征選擇方法,則必須確定要選擇的特征數(shù)量,但是沒(méi)人真正知道如何選擇大量的最終特征。 為什么即使限制為五個(gè)特征,即使第六個(gè)幾乎與第五個(gè)最佳特征一樣有價(jià)值,第六個(gè)特征也被丟棄了?
It is highly customizable. While genetic algorithms generally have the impression of being very computationally expensive, using it properly yields a great bang for your buck. You can control many aspects of evolutionary algorithms, from population size to learning rate to degree of random selection to mutation/recombination styling, and tailor them to your specific problem. Permutation importance is also a default expensive algorithm, but without the upsides and customizability of genetic feature selection.
它是高度可定制的 。 雖然遺傳算法通常給人以計(jì)算上非常昂貴的印象,但正確使用它會(huì)帶來(lái)很大的收益。 您可以控制進(jìn)化算法的許多方面,從人口規(guī)模到學(xué)習(xí)率再到隨機(jī)選擇的程度再到突變/重組樣式,并根據(jù)您的特定問(wèn)題進(jìn)行調(diào)整。 排列重要性也是默認(rèn)的昂貴算法,但是沒(méi)有遺傳特征選擇的缺點(diǎn)和可定制性。
It is efficient because it is intelligent. Traditional methods of feature selection essentially entail trying out all the combinations of potential features. Genetic feature selection takes a different approach — it learns from an exploration/exploitation trade-off, searching a larger search space and arriving at a better solution in less time — if it’s programmed properly.
這是有效的,因?yàn)樗苤悄?。 傳統(tǒng)的特征選擇方法實(shí)質(zhì)上需要嘗試所有潛在特征的組合。 遺傳特征選擇采用不同的方法-如果編程正確,它可以從探索/開(kāi)發(fā)的權(quán)衡中學(xué)習(xí),搜索更大的搜索空間并在更短的時(shí)間內(nèi)獲得更好的解決方案。
Data scientists and machine learning engineers have usually been quick to discard evolutionary algorithms, and is partially justified. On the other hand, innovation can only be achieved by opening our eyes up to different and sometimes seemingly stupid and pointless exploration and application.
數(shù)據(jù)科學(xué)家和機(jī)器學(xué)習(xí)工程師通常很快就放棄了進(jìn)化算法,并且在某種程度上是合理的。 另一方面,創(chuàng)新只能通過(guò)睜開(kāi)眼睛來(lái)進(jìn)行不同的,有時(shí)是愚蠢的,毫無(wú)意義的探索和應(yīng)用來(lái)實(shí)現(xiàn)。
All images except for header image created by author.
除作者創(chuàng)建的標(biāo)題圖像外的所有圖像。
翻譯自: https://towardsdatascience.com/the-surprisingly-effective-genetic-approach-to-feature-selection-7eb2b080b713
出人意料的生日會(huì)400字
總結(jié)
以上是生活随笔為你收集整理的出人意料的生日会400字_出人意料的有效遗传方法进行特征选择的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 总订单超 1200 架,C919 国产客
- 下一篇: 自由了?消息称自iOS 17起 苹果将放