cnn图像进行预测_CNN方法:使用聚合物图像预测其玻璃化转变温度
cnn圖像進行預測
In this article, we will be using the image of the polymer structure to predict its glass transition temperature. This article follows a similar methodology as published in one of the recent research papers by Luis.A. Miccio from Materials Physics Center and Donostia International Physics Center in Spain.
在本文中,我們將使用聚合物結構的圖像預測其玻璃化轉變溫度。 本文采用與Luis.A.最近的一篇研究論文中發表的方法類似的方法。 來自西班牙材料物理中心和Donostia國際物理中心的Miccio 。
介紹: (Introduction:)
Glass Transition temperature is one of the crucial properties of polymers. It marks the temperature range below which the atoms of a supercooled liquid are temporarily frozen (without crystallizing) upon cooling. Predicting glass transition temperature (Tg) provides valuable insights into polymer properties whose synthesis may otherwise be costly and time-consuming. Scientists have always been more keener to develop machine learning models qualitatively(For instance, using several other properties to predict its tensile strength). During the last few years, the major emphasis has been given to Quantitative Structure-Property Relationships. This opens the possibility of predicting various properties with just the Structure of the molecular (i.e just the image) compound avoiding requirement of any additional experimental properties or tedious calculations. In this article, we will be using Convolutional Neural Networks to predict Tg of unknow polymer compounds, using the image of the polymer. This sounds so cool, this literally means that if you just draw the image of the monomeric unit on a whiteboard that would be enough to predict its Tg. We do not need any other external information or properties for the polymer.
玻璃化轉變溫度是聚合物的關鍵性能之一。 它標志著溫度范圍,在該溫度范圍內,過冷液體的原子在冷卻后會暫時凍結(不結晶)。 預測玻璃化轉變溫度(Tg)提供了有價值的洞察力,以了解聚合物的性能,否則其合成可能既昂貴又費時。 科學家一直更熱衷于定性開發機器學習模型(例如,使用其他一些屬性來預測其拉伸強度)。 在過去的幾年中,主要的重點是定量結構與屬性的關系。 這開辟了僅用分子的結構(即僅圖像)預測各種性質的可能性,從而無需任何其他實驗性質或繁瑣的計算。 在本文中,我們將使用卷積神經網絡通過聚合物圖像來預測未知聚合物的Tg。 這聽起來很酷,從字面上看,這意味著,如果僅在白板上繪制單體單元的圖像,就足以預測其Tg。 我們不需要該聚合物的任何其他外部信息或特性。
導入相關包 (Importing Relevant Packages)
數據集 (Dataset)
The dataset used in our study was gathered from a popular polymer database. The dataset for this study comprises of 351 polymers along with their smiles codes, molecular names as input attributes and glass transition temperatures as the output variable. Subsets of 300 polymers and their Tg values were used for training validating the dataset, whereas the rest 51 unseen polymers were used to test the results for both the models, the CNN and the proposed ANN. The figure below shows the top 5 rows of the dataset. The dataset for this study can be found here.
我們研究中使用的數據集是從流行的聚合物數據庫中收集的。 這項研究的數據集包括351種聚合物及其微笑代碼,分子名稱作為輸入屬性和玻璃化轉變溫度作為輸出變量。 300個聚合物的子集及其Tg值用于訓練驗證數據集,而其余51個看不見的聚合物用于測試模型,CNN和擬議的ANN的結果。 下圖顯示了數據集的前5行。 這項研究的數據集可以在這里找到。
Reading and Cleaning the Dataset讀取和清理數據集 Top 5 rows of the Data Frame數據框的前5行聚合物分類 (Classifying Polymers)
The dataset was manually explored using the Pandas library in python and was classified into eight different classes of polymers- acrylates, styrenes, amides, alkenes, ether, amides, carbonates, and others.
使用python中的Pandas庫手動瀏覽了數據集,并將其分為八類不同的聚合物:丙烯酸酯,苯乙烯,酰胺,烯烴,醚,酰胺,碳酸鹽等。
探索性數據分析 (Exploratory Data Analysis)
Pie Plot: The pie plot in Figure shows the exact composition of the dataset, with acrylates and styrenes being the highest contributors
餅圖:圖中的餅圖顯示了數據集的確切組成,其中丙烯酸酯和苯乙烯的貢獻最大
Box Plot: The box plot is used to show the scatter plot of the underlying Tg distributions for each class of polymers. It can be seen that styrenes tend to have higher Tg whereas acrylates have a fairly mixed distribution.
箱形圖:箱形圖用于顯示每類聚合物的基礎Tg分布的散點圖。 可以看出,苯乙烯傾向于具有較高的Tg,而丙烯酸酯具有相當混合的分布。
單體中電荷的分布 (Distribution of Charge in Monomer)
The open-source RDKit [1] python package was used to visualize the molecular structure of polymers from the dataset into drawings. One such function in RDKit[1] module employed to compute the Gasteiger partial charges for monomeric units.
開源RDKit [1] python軟件包用于可視化從數據集中到圖紙的聚合物分子結構。 RDKit [1]模塊中的一種此類函數用于計算單體單元的Gasteiger部分電荷。
特征工程 (Feature Engineering)
Engineering features have been vital in preparing the data for modeling and presenting attributes in machine-readable form. As per the problem statement, the Tg prediction was to be based on images of the polymer chemical structure fed into the CNN architecture in encoded form using SMILES line notations. The main aim of feature engineering for this problem is to incorporate the chemical structure as well as the chemical composition of the monomeric unit to predict Tg. This is achieved using the SMILES line notations [2].
工程特征對于準備數據建模和以機器可讀形式顯示屬性至關重要。 根據問題陳述,Tg預測將基于使用SMILES線符號以編碼形式饋入CNN體系結構的聚合物化學結構圖像。 針對此問題的特征工程的主要目的是結合單體單元的化學結構以及化學組成來預測Tg。 這是使用SMILES線路符號[2]實現的。
Introduction to SMILES Notation: SMILES stands for Simplified molecular-input line-entry system. This is basically a way of describing the chemical structure in the form of a line notation using different characters. The image below shows the SMILES notation for the given chemical structure.
SMILES簡介表示法: SMILES表示簡化的分子輸入行輸入系統。 這基本上是一種使用不同字符以行符號形式描述化學結構的方法。 下圖顯示了給定化學結構的SMILES符號。
Molecular Structure to Image Encoding: We first defined a list containing all the unique characters that can be present in any given SMILES linear string for a polymer.
圖像編碼的分子結構:我們首先定義一個列表,其中包含可以在聚合物的任何給定SMILES線性字符串中出現的所有唯一字符。
Further, as the second step, the linear string of polymers in the form of line notation are 1-hot encoded in machine-readable form through binary images by using this list of unique SMILES characters. The resulting transformation is an n-dimensional matrix consisting of binary images that can be fed into the CNN architecture. Each binary image is a matrix of the dimensions m × n, where n represents the number of characters in the unique SMILES list and m is the number of characters present in polymer with the longest smiles code. The figure given below depicts the encoding process visually for 1 polymer example -Poly(4-biphenyl acrylate).
此外,作為第二步,使用該唯一的SMILES字符列表,通過二進制圖像以機器可讀的形式對線型聚合物形式的線性聚合物線進行1-hot編碼。 最終的變換是一個n維矩陣,其中包含可被饋入CNN體系結構的二進制圖像。 每個二進制圖像都是一個尺寸為m×n的矩陣,其中n表示唯一的SMILES列表中的字符數,m是具有最長笑碼的聚合物中存在的字符數。 下圖給出了1個聚合物實例-聚(丙烯酸4-聯苯酯)的可視化編碼過程。
Molecular Structure to Image Encoding Process分子結構到圖像編碼過程The generated one hot encoded image takes into account the chemical structure and the composition of the monomeric unit. We can see that the encoded image tells us the number of each kind of atoms present in the monomeric structure in a binary form along with the alignment structure of atoms in the polymeric chain relative to each other.
生成的一張熱編碼圖像考慮了單體單元的化學結構和組成。 我們可以看到,編碼圖像告訴我們以二元形式存在于單體結構中的每種原子的數目以及聚合物鏈中原子彼此之間的排列結構。
The figure given below generates the encoded image of the top five polymers in the data frame.
下圖顯示了數據框中前五種聚合物的編碼圖像。
Encoded Images for first five polymers in Data frame數據框中前五種聚合物的編碼圖像模型實施 (Model Implementation)
The image encoded molecular structure was fed as an input to the CNNs and the target variable was Tg of the given polymers, which is a continuous variable. This model was implemented using the Keras library which serves as an Application Programming Interface (API) for Tensorflow.
將圖像編碼的分子結構作為輸入輸入CNN,目標變量是給定聚合物的Tg,它是一個連續變量。 該模型是使用Keras庫實現的,該庫充當Tensorflow的應用程序編程接口(API)。
Schematic of CNN [2]CNN的示意圖[2]Proposed Architecture: The choice of the final hyper-parameters has been made by incorporating various combinations of all the different hyper-parameters. The best-observed configuration uses filter size of 64 with a window size of (5,5) in the first layer and size (3,3) with 32 filters in the second layer. This is followed by a max-pooling layer with a window size of (3,3). Post the max-pooling layer, we have three dense layers with 32, 10, and 1 neurons respectively, with the final dense layer being the output of our proposed ANN model. ReLu activation function was used by all layers with l2 regularization. The model achieved its best generalization by training up to 180 epochs with a batch size of 64 and a learning rate of 0.03. A validation split of 0.1 and drop out probability of 0.1 was used in training the network to perform cross-validation.
建議的體系結構:通過合并所有不同超參數的各種組合來選擇最終的超參數。 最佳觀察配置在第一層中使用的過濾器大小為64,窗口大小為(5,5),在第二層中使用的過濾器大小為(3,3)32個過濾器。 接下來是窗口大小為(3,3)的最大池化層。 在最大池化層之后,我們有三個密集層,分別具有32、10和1個神經元,最后一個密集層是我們提出的ANN模型的輸出。 ReLu激活功能已用于所有帶有12正則化的層。 該模型通過訓練多達180個時期(批大小為64,學習率為0.03)達到了最佳概括。 在訓練網絡執行交叉驗證時,使用了0.1的驗證拆分和0.1的丟失概率。
結果 (Results)
The figures given below show the experimental and the predicted values of the glass transition temperatures for the training and the unseen test sets. For an ideal model, we would expect the real values to be perfectly equal to the predicted Tg values. This will result in a straight line passing throw the origin.
下圖顯示了訓練和看不見的測試裝置的玻璃化轉變溫度的實驗值和預測值。 對于理想模型,我們希望實際值完全等于預測的Tg值。 這將導致直線經過并拋出原點。
Real vs predicted Tg values for the Training set訓練集的實際Tg值與預測Tg值 Real vs predicted Tg values for Unseen Test set看不見的測試集的真實Tg值與預測Tg值It can be perceived that most of the examples show very accurate prediction when compared to the real Tg values. However, there are a few polymers contributing to a significant level of uncertainty in prediction due to their lack of sufficient training data. These polymers belong to the minority classes of esters and ethers and due to insufficient training for either, the Tg of these polymers is not being learned effectively.
可以看出,與真實Tg值相比,大多數示例都顯示出非常準確的預測。 但是,由于缺乏足夠的訓練數據,有幾種聚合物在預測中具有很大的不確定性。 這些聚合物屬于酯和醚的少數種類,并且由于對它們的訓練不足,因此無法有效地學習這些聚合物的Tg。
Loss Metrics: We used the mean absolute loss function while training the neural networks. But for our final evaluation, we have used mean relative % error as the evaluation metrics for our model. This can be represented as follows-
損失指標:我們在訓練神經網絡時使用了平均絕對損失函數。 但是對于最終評估,我們使用平均相對誤差百分比作為模型的評估指標。 可以表示為
where Ai is the actual Tg value and Pi is the predicted Tg value. The average of this relative % error was taken over the full dataset of m polymers. Post the training process we computed the respective training and testing mean relative errors.
其中Ai是實際Tg值,Pi是預測Tg值。 相對誤差百分比的平均值取自m個聚合物的完整數據集。 在訓練過程之后,我們計算了各自的訓練和測試平均相對誤差。
The table given below shows the real and predicted Tg values for 4 unseen polymers. We can see that our predictions lie very close to the experimental Tg values signifying that the proposed model shows an excellent generalization ability.
下表列出了4種看不見的聚合物的實際Tg值和預測值。 我們可以看到我們的預測非常接近于實驗Tg值,這表明所提出的模型具有出色的泛化能力。
結論 (Conclusions)
In this study, we demonstrated the feasibility of CNN to predict the Tg of the polymer by taking into account the molecular structure and chemical composition of the monomeric units in the polymer. We were able to achieve a relative error of 6% and 7% training and test set respectively. In my next article, I will be using fully connected neural networks to predict the glass transition temperature. This new model will incorporate all kinds of intra-molecular interactions along with the chemical composition and molecular structure to predict Tg.
在這項研究中,我們通過考慮聚合物中單體單元的分子結構和化學組成,證明了CNN預測聚合物Tg的可行性。 我們分別獲得了6%的相對誤差和7%的訓練集和測試集的相對誤差。 在我的下一篇文章中,我將使用完全連接的神經網絡來預測玻璃化轉變溫度。 這個新模型將結合各種分子內相互作用以及化學組成和分子結構來預測Tg。
學分 (Credits)
Special Thanks to Danish for contributing to this project.
特別感謝Danish為這個項目做出的貢獻。
1-G. Landrum et al., “Rdkit: cheminformatics and machine learning software,” RDKIT. ORG, 2013.
1-G。 Landrum等人,“ Rdkit:化學信息學和機器學習軟件”,RDKIT。 組織,2013年。
2-Luis A. Miccio, Gustavo A. Schwartz, “From chemical structure to quantitative polymer properties prediction through convolutional neural networks”, Polymer, 2018
2-Luis A.Miccio,Gustavo A.Schwartz,“通過卷積神經網絡從化學結構到定量聚合物性質預測”,聚合物,2018年
謝謝您的閱讀!!! (Thank you for reading!!!!)
If you like my work and want to support me:
如果您喜歡我的工作并希望支持我:
1-The BEST way to support me is by following me on Medium.
1-支持我的最佳方法是在Medium上關注我。
2-Follow me on LinkedIn.
2-在LinkedIn上關注我。
翻譯自: https://towardsdatascience.com/cnn-approach-using-image-of-the-polymer-to-predict-its-glass-transition-temperature-4a64ee450450
cnn圖像進行預測
總結
以上是生活随笔為你收集整理的cnn图像进行预测_CNN方法:使用聚合物图像预测其玻璃化转变温度的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 微商怎么做?微商营销的10种方法
- 下一篇: 透过性别看世界_透过树林看森林