生信笔记 | 文本挖掘的一般流程
一.文本挖掘的一般過程
參考:
http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know
看看這45篇文章有啥規律
把tcga大計劃的CNS級別文章標題畫一個詞云
Step 1: Create a text file
本地文件,或者來源于網絡。
Step 2 : Install and load the required packages
# Install install.packages("tm") # for text mining install.packages("SnowballC") # for text stemming install.packages("wordcloud") # word-cloud generator install.packages("RColorBrewer") # color palettes # Load library("tm") library("SnowballC") library("wordcloud") library("RColorBrewer")Step 3 : Text mining
#讀入本地文件 text <- readLines('data/text/text.txt') # Read the text file from internet filePath <- "http://www.sthda.com/sthda/RDoc/example-files/martin-luther-king-i-have-a-dream-speech.txt" text <- readLines(filePath) # Load the data as a corpus docs <- Corpus(VectorSource(text))VectorSource(x)函數:向量源將向量x的每個元素解釋為一個文檔。
其他讀入函數:
read.csv() isused for reading comma-separated value (csv) files, where a comma “,” is used a field separator
read.delim() is used for reading tab-separated values (.txt) files
Inspect the content of the document
inspect(docs)文本轉換
清理文本數據首先要進行轉換,比如從文本中刪除特殊字符。這是通過使用tm_map()函數將特殊字符如“/”、“@”和“|”替換為空格來完成的。下一步是刪除不必要的空格,并將文本轉換為小寫。
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x)) docs <- tm_map(docs, toSpace, "/") docs <- tm_map(docs, toSpace, "@") docs <- tm_map(docs, toSpace, "\\|")tm_map()函數用于刪除不必要的空格,將文本轉換為小寫,刪除常見的停止詞,如' The ', " we "。
“stopwords”的信息值接近于零,因為它們在語言中非常常見。在進一步分析之前,刪除這類詞是有用的。對于“stopwords”,支持的語言是丹麥語,荷蘭語,英語,芬蘭語,法語,德語,匈牙利語,意大利語,挪威語,葡萄牙語,俄語,西班牙語和瑞典語。語言名稱區分大小寫。
您還可以使用removeNumbers和removePunctuation參數刪除數字和標點符號。
另一個重要的預處理步驟是使文本詞干化,將單詞還原為詞根形式。換句話說,這個過程去掉單詞的后綴,使其變得簡單,并獲得共同的起源。例如,詞干提取過程將單詞“moving”、“moved”和“movement”還原為詞根詞“move”。
# 將文本轉換為小寫 docs <- tm_map(docs, content_transformer(tolower)) # Remove numbers docs <- tm_map(docs, removeNumbers) # 去掉英語中常見的停頓詞 docs <- tm_map(docs, removeWords, stopwords("english")) # Remove your own stop word # specify your stopwords as a character vector docs <- tm_map(docs, removeWords, c("blabla1", "blabla2")) # Remove punctuations docs <- tm_map(docs, removePunctuation) # Eliminate extra white spaces docs <- tm_map(docs, stripWhitespace) # Text stemming # docs <- tm_map(docs, stemDocument)Step 4 : Build a term-document matrix
清理完文本數據后,下一步是統計每個單詞出現的次數,以確定流行或趨勢主題。使用文本挖掘包中的函數TermDocumentMatrix(),您可以構建一個文檔矩陣——一個包含單詞頻率的表。
TermDocumentMatrix()可以如下使用:
dtm <- TermDocumentMatrix(docs) m <- as.matrix(dtm) v <- sort(rowSums(m),decreasing=TRUE) d <- data.frame(word = names(v),freq=v) head(d, 10)Step 5 :?
Generate the Word cloud
單詞的重要性可以用單詞云來說明,如下所示:
set.seed(1234) wordcloud(words = d$word, freq = d$freq, min.freq = 1,max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))Word Association
相關性是一種統計技術,它可以證明成對的變量是否以及多大程度上是相關的。這種技術可以有效地用于分析哪些單詞與調查回答中最頻繁出現的單詞聯系在一起,這有助于查看這些單詞周圍的上下文。
# Find associations findAssocs(TextDoc_dtm, terms = c("good","work","health"), corlimit = 0.25)您可以修改上述腳本,以查找與出現至少50次或以上的單詞相關的術語,而不必在腳本中硬編碼這些術語。
# Find associations for words that occur at least 50 times findAssocs(TextDoc_dtm, terms = findFreqTerms(TextDoc_dtm, lowfreq = 50), corlimit = 0.25)Sentiment Scores
【自己覺得不適合自然科學,對社會科學比較實用】情緒可以分為積極的、中性的和消極的。它們也可以用數字表示,以便更好地表達文本主體中所包含的情緒的積極或消極程度。
這個例子使用Syuzhet包來生成情感分數,它有四個情感詞典,并提供了一種訪問斯坦福大學NLP小組開發的情感抽取工具的方法。get_sentiment函數接受兩個參數:一個字符向量(句子或單詞)和一個方法。所選擇的方法決定了將使用四種可用的情感提取方法中的哪一種。這四個方法是syuzhet(這是默認的)、bing、afinn和nrc。每種方法使用不同的刻度,因此返回的結果略有不同。請注意,nrc方法的結果不僅僅是一個數值分數,需要額外的解釋,超出了本文的范圍。get_sentiment函數的描述來源于:
https://cran.r-project.org/web/packages/syuzhet/vignettes/syuzhet-vignette.html?
# regular sentiment score using get_sentiment() function and method of your choice # please note that different methods may have different scales syuzhet_vector <- get_sentiment(text, method="syuzhet") # see the first row of the vector head(syuzhet_vector) # see summary statistics of the vector summary(syuzhet_vector)更多參考:https://www.red-gate.com/simple-talk/databases/sql-server/bi-sql-server/text-mining-and-sentiment-analysis-with-r/
二. 作業一:查看一下這些文獻有什么規律
A Novel Copolymer Poly(Lactide-co-b-Malic Acid).pdf A novel miRNA identified in GRSF1 complex drives the metastasis via the PIK3R3_AKT_NF-百B and TIMP3_MMP9 pathways in cervical cancer cells.pdf A novel microRNA identified in hepatocellular carcinomas is __responsive to LEF1 and facilitates proliferation and epithelial- mesenchymal transition via targeting of NFIX.pdf B4GALT3 up-regulation by miR-27a contributes to the oncogenic.pdf C14orf28 downregulated by miR-519d contributes to oncogenicity and regulates apoptosis and EMT in colorectal __cancer.pdf Contribution of hydrophobichydrophilic modification on cationic chains.pdf DCLK1 promotes epithelial-mesenchymal transition via the PI3K_Akt_NF-百B pathway in colorectal cancer.pdf DNA Methylation-mediated Repression of miR-941 Enhances.pdf Downregulation of PPP2R5E expression by miR-23a suppresses.pdf Downregulation of TNFRSF19 and RAB43 by a novel miRNA, miR-HCC3, promotes proliferation and epithelial–mesenchymal __transition in hepatocellular carcinoma cells.pdf GRSF1-mediated MIR-G-1 promotes malignant behavior and nuclear autophagy by directly upregulating TMED5 and LMNB1 __in cervical cancer cells.pdf HBV-encoded miR-2 functions as an oncogene by downregulating TRIM35 but upregulating RAN in liver cancer __cells.pdf HBx-induced MiR-1269b in NF-κB dependent manner upregulates cell division cycle 40 homolog (CDC40) to promote proliferation and migration in hepatoma cells.pdf ICP4-induced miR-101 attenuates HSV-1 replication.pdf INPP1 up-regulation by miR-27a contributes to the growth, migration and invasion of human cervical cancer.pdf KDM4B-mediated epigenetic silencing of miRNA-615-5p augments RAB24 to facilitate malignancy of hepatoma cells.pdf LncRNA RSU1P2 contributes to tumorigenesis by acting as a _ceRNA against let-7a in cervical cancer cells.pdf LncRNA n335586_miR-924_CKMT1A axis contributes to cell migration and invasion in hepatocellular carcinoma cells.pdf Long non-coding RNA Unigene56159 promotes.pdf MiR-124 represses vasculogenic mimicry and cell motility by.pdf MiR-23a Facilitates the Replication of.pdf MiR-346 Up-regulates Argonaute 2 (AGO2) Protein Expression to Augment the Activity of.pdf MiR-HCC2 Up-regulates BAMBI and ELMO1 Expression to Facilitate the Proliferation and EMT of Hepatocellular Carcinoma Cells.pdf MicroRNA-142-3p, a new regulator of RAC1, suppresses the migration.pdf MicroRNA-19a and -19b regulate cervical carcinoma cell proliferation.pdf MicroRNA-214 Suppresses Growth and Invasiveness.pdf NF-|E?B-modulated miR-130a targets TNF in cervical cancer cells.pdf PIWIL4 regulates cervical cancer cell line growth and is involved in.pdf TCDD-induced antagonism of MEHP-mediated migration and __invasion partly involves aryl hydrocarbon receptor in MCF7 breast cancer cells.pdf USP14 de-ubiquitinates vimentin and miR-320a modulates USP14 and vimentin to contribute to malignancy in gastric _cancer cells.pdf miR-10a suppresses colorectal cancer metastasis by modulating __the epithelial-to-mesenchymal transition and anoikis.pdf miR-1228 promotes the proliferation and.pdf miR-17-5p up-regulates YES1 to modulate the cell cycle progression and apoptosis in ovarian cancer cell lines.pdf miR-212132.pdf miR-23a Targets Interferon Regulatory Factor 1 and.pdf miR-23a promotes IKKa expression but.pdf miR-24-3p Suppresses Malignant Behavior of Lacrimal Adenoid Cystic Carcinoma by Targeting PRKCH to Regulate p53_p21 Pathway.pdf miR-30a reverses TGF-|A?2-induced migration and EMT in posterior capsular opacification by targeting Smad2.pdf miR-371-5p down-regulates pre mRNA processing factor 4 homolog B.pdf miR-377-3p drives malignancy characteristics via upregulating GSK-3 expression and activating NF-κB pathway in hCRC cells.pdf miR-3928v is induced by HBx via NF-kB_EGR1 and contributes __to hepatocellular carcinoma malignancy by down-regulating VDAC3.pdf miR-484 suppresses proliferation and epithelial–mesenchymal __transition by targeting ZEB1 and SMAD2 in cervical cancer cells.pdf miR-490-3p Modulates Cell Growth and Epithelial to Mesenchymal Transition of Epithelial to Mesenchymal Transition of Targeting Endoplasmic Reticulum-Golgi Intermediate Compartment Protein 3 (ERGIC3).pdf miR-639 Expression Is Silenced by DNMT3A-Mediated Hypermethylation and Functions as a Tumor Suppressor in Liver Cancer Cells.pdf microRNA-34a-Upregulated Retinoic Acid-Inducible Gene-I Promotes Apoptosis and Delays Cell Cycle Transition in Cervical Cancer Cells.pdf完整代碼:
#install.packages('wordcloud2') #devtools::install_github("lchiffon/wordcloud2") #最終采用本地安裝wordcloud2 0.2.0版本 # Install # install.packages("tm") # for text mining # install.packages("SnowballC") # for text stemming # install.packages("wordcloud") # word-cloud generator # install.packages("RColorBrewer") # color palettes # Load library("tm") library("SnowballC") library("wordcloud") library("RColorBrewer")text=readLines('data/text/text.txt') # Load the data as a corpus docs <- Corpus(VectorSource(text)) toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x)) docs <- tm_map(docs, toSpace, "/") docs <- tm_map(docs, toSpace, "@") docs <- tm_map(docs, toSpace, "\\|") docs <- tm_map(docs, toSpace, ".pdf") # Convert the text to lower case docs <- tm_map(docs, content_transformer(tolower)) # Remove numbers docs <- tm_map(docs, removeNumbers) # Remove english common stopwords docs <- tm_map(docs, removeWords, stopwords("english")) # Remove your own stop word # specify your stopwords as a character vector docs <- tm_map(docs, removeWords, c("characterization", "molecular","comprehensive",'cell','analysis','landscape')) # Remove punctuations docs <- tm_map(docs, removePunctuation) # Eliminate extra white spaces docs <- tm_map(docs, stripWhitespace) # Text stemming # docs <- tm_map(docs, stemDocument)dtm <- TermDocumentMatrix(docs) m <- as.matrix(dtm) v <- sort(rowSums(m),decreasing=TRUE) d <- data.frame(word = names(v),freq=v) head(d,?10) wordcloud(words = d$word, freq = d$freq, min.freq = 1,max.words=200, random.order=FALSE, rot.per=0.35,shape = 'pentagon',size=0.7,colors=brewer.pal(8, "Dark2"))再看一下關鍵詞比較多的cancer和mir較近的詞
findAssocs(dtm, terms = c("cancer","mir"), corlimit = 0.25)findAssocs(dtm, terms = findFreqTerms(dtm, lowfreq = 10), corlimit = 0.25)$cancercervical apoptosis colorectal metastasis 0.56 0.36 0.36 0.29 epithelialmesenchymal functions liver 0.29 0.29 0.29三. 作業二:TCGA project官方文章
TCGA計劃官方文章在:https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga/publications
Comprehensive genomic characterization defines human glioblastoma genes and core pathways Integrated genomic analyses of ovarian carcinoma Comprehensive molecular characterization of human colon and rectal cancer Comprehensive molecular portraits of human breast tumours Comprehensive genomic characterization of squamous cell lung cancers Integrated genomic characterization of endometrial carcinoma Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia Comprehensive molecular characterization of clear cell renal cell carcinoma The Cancer Genome Atlas Pan-Cancer analysis project The somatic genomic landscape of glioblastoma Comprehensive molecular characterization of urothelial bladder carcinoma Comprehensive molecular profiling of lung adenocarcinoma Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin The Somatic Genomic Landscape of Chromophobe Renal Cell Carcinoma Comprehensive molecular characterization of gastric adenocarcinoma Integrated genomic characterization of papillary thyroid carcinoma Comprehensive genomic characterization of head and neck squamous cell carcinomas Genomic Classification of Cutaneous Melanoma Comprehensive, Integrative Genomic Analysis of Diffuse Lower-Grade Gliomas Comprehensive Molecular Portraits of Invasive Lobular Breast Cancer The Molecular Taxonomy of Primary Prostate Cancer Comprehensive Molecular Characterization of Papillary Renal-Cell Carcinoma Comprehensive Pan-Genomic Characterization of Adrenocortical Carcinoma Distinct patterns of somatic genome alterations in lung adenocarcinomas and squamous cell carcinomas Integrated genomic characterization of oesophageal carcinoma Comprehensive Molecular Characterization of Pheochromocytoma and Paraganglioma Integrated Molecular Characterization of Uterine Carcinosarcoma Integrative Genomic Analysis of Cholangiocarcinoma Identifies Distinct IDH-Mutant Molecular Profiles Integrated genomic and molecular characterization of cervical cancer Comprehensive and Integrative Genomic Characterization of Hepatocellular Carcinoma Integrative Analysis Identifies Four Molecular and Clinical Subsets in Uveal Melanoma Integrated Genomic Characterization of Pancreatic Ductal Adenocarcinoma Comprehensive Molecular Characterization of Muscle-Invasive Bladder Cancer Comprehensive and Integrated Genomic Characterization of Adult Soft Tissue Sarcomas The Integrated Genomic Landscape of Thymic Epithelial Tumors Pan-cancer Alterations of the MYC Oncogene and Its Proximal Network across the Cancer Genome Atlas Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines Molecular Characterization and Clinical Relevance of Metabolic Expression Subtypes in Human Cancers Systematic Analysis of Splice-Site-Creating Mutations in Cancer Somatic Mutational Landscape of Splicing Factor Genes and Their Functional Consequences across 33 Cancer Types The Cancer Genome Atlas Comprehensive Molecular Characterization of Renal Cell Carcinoma Pan-Cancer Analysis of lncRNA Regulation Supports Their Targeting of Cancer Genes in Each Tumor Context Spatial Organization and Molecular Correlation of Tumor-Infiltrating Lymphocytes Using Deep Learning on Pathology Images Machine Learning Detects Pan-cancer Ras Pathway Activation in The Cancer Genome Atlas Genomic and Molecular Landscape of DNA Damage Repair Deficiency across The Cancer Genome Atlas Driver Fusions and Their Implications in the Development and Treatment of Human Cancers Genomic, Pathway Network, and Immunologic Features Distinguishing Squamous Carcinomas Integrated Genomic Analysis of the Ubiquitin Pathway across Cancer Types SnapShot: TCGA-Analyzed Tumors The Cancer Genome Atlas: Creating Lasting Value beyond Its Data Machine Learning Identifies Stemness Features Associated with Oncogenic Dedifferentiation Oncogenic Signaling Pathways in The Cancer Genome Atlas Perspective on Oncogenic Processes at the End of the Beginning of Cancer Genomics Comprehensive Characterization of Cancer Driver Genes and Mutations An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics Pathogenic Germline Variants in 10,389 Adult Cancers A Pan-Cancer Analysis of Enhancer Expression in Nearly 9000 Patient Samples Genomic and Functional Approaches to Understanding Cancer Aneuploidy A Comprehensive Pan-Cancer Molecular Study of Gynecologic and Breast Cancers Comparative Molecular Analysis of Gastrointestinal Adenocarcinomas lncRNA Epigenetic Landscape Analysis Identifies EPIC1 as an Oncogenic lncRNA that Interacts with MYC and Promotes Cell-Cycle Progression in Cancer The Immune Landscape of Cancer Integrated Molecular Characterization of Testicular Germ Cell Tumors Comprehensive Analysis of Alternative Splicing Across Tumors from 8,705 Patients A Pan-Cancer Analysis Reveals High-Frequency Genetic Alterations in Mediators of Signaling by the TGF-β Superfamily Integrative Molecular Characterization of Malignant Pleural Mesothelioma The chromatin accessibility landscape of primary human cancers Comprehensive Molecular Characterization of the Hippo Signaling Pathway in Cancer Before and After: Comparison of Legacy and Harmonized TCGA Genomic Data Commons'Data Comprehensive Analysis of Genetic Ancestry and Its Molecular Correlates in Cancer Whole-genome?characterization?of?lung?adenocarcinomas?lacking?alterations?in?the?RTK/RAS/RAF?pathway完整代碼:
TCGALett <- readLines('data/text/TCGA-literature.txt')docs <- Corpus(VectorSource(TCGALett)) toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x)) docs <- tm_map(docs, toSpace, "/") docs <- tm_map(docs, toSpace, "-") docs <- tm_map(docs, toSpace, "\\|")# Convert the text to lower case docs <- tm_map(docs, content_transformer(tolower)) # Remove numbers docs <- tm_map(docs, removeNumbers) # Remove english common stopwords docs <- tm_map(docs, removeWords, stopwords("english")) # Remove your own stop word # specify your stopwords as a character vector docs <- tm_map(docs, removeWords, c("characterization", "molecular","comprehensive",'cell','analysis','landscape')) # Remove punctuations docs <- tm_map(docs, removePunctuation) # Eliminate extra white spaces docs <- tm_map(docs, stripWhitespace) # Text stemming # docs <- tm_map(docs, stemDocument)dtm <- TermDocumentMatrix(docs) m <- as.matrix(dtm) v <- sort(rowSums(m),decreasing=TRUE) d <- data.frame(word = names(v),freq=v) head(d, 10) wordcloud(words = d$word, freq = d$freq, min.freq = 1,max.words=200, random.order=FALSE, rot.per=0.35,shape = 'pentagon',size=0.7,colors=brewer.pal(8, "Dark2"))findAssocs(dtm, terms = c("cancer","genomic"), corlimit = 0.25) findAssocs(dtm, terms = findFreqTerms(dtm, lowfreq = 10), corlimit = 0.25) wordcloud2(d,size = 0.6)> findAssocs(dtm, terms = findFreqTerms(dtm, lowfreq = 10), corlimit = 0.25) $genomic numeric(0)$carcinomarenal papillary analyses ovarian endometrial 0.57 0.40 0.28 0.28 0.28 clear urothelial chromophobe thyroid adrenocortical 0.28 0.28 0.28 0.28 0.28 oesophageal hepatocellular 0.28 0.28 $integratedanalyses ovarian endometrial thyroid oesophageal 0.27 0.27 0.27 0.27 0.27 carcinosarcoma uterine cervical ductal pancreatic 0.27 0.27 0.27 0.27 0.27 sarcomas soft tissue epithelial thymic 0.27 0.27 0.27 0.27 0.27 ubiquitin analytics drive outcome quality 0.27 0.27 0.27 0.27 0.27 resource survival germ testicular 0.27 0.27 0.27 0.27 $cancerpan atlas genome project oncogene proximal context regulation 0.56 0.54 0.42 0.31 0.31 0.31 0.31 0.31 supports targeting activation detects myc across 0.31 0.31 0.31 0.31 0.30 0.28關于詞云圖如何繪制的好看,參考文章:R繪圖筆記 | 詞云圖的繪制
總結
以上是生活随笔為你收集整理的生信笔记 | 文本挖掘的一般流程的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 泛函分析简列:度量空间之列紧集
- 下一篇: BZOJ 3698 XWW的难题