GSVA和生存分析
為什么一定要是預后相關基因呢 | 生信菜鳥團 (bio-info-trainee.com)?
200塊的代碼我的學徒免費送給你,GSVA和生存分析 (qq.com)
我們生信技能樹B站又免費悄咪咪的上線了一個GSVA生存分析教學視頻,然后學徒馬上就學習了,居然是主動學習的,讓我喜出望外!所以馬上分享了他匯報給我的學習筆記,希望對你有幫助哦!
(現在學習量和彈幕都非常少,大家的機會來了哦!)
https://www.bilibili.com/video/av81874183
前奏
最近做的生存分析都是奇奇怪怪的,從來沒有重復出作者的圖。哈哈哈,但是我相信自己的代碼是沒有錯的,只是參數設置跟作者不同而已。
最近跟著Jimmy老師的視頻,學習用GSVA做生存分析,反反復復做了幾次都無法復現作者的結果。
第一次是我找的基因集跟作者不一致,第二次可能參數不同,也只有一個顯著性結果。
看了這篇推文,幫助你節約200塊錢,可以多搓一頓火鍋了。
看上圖👆,不是我騙你,真的有人代碼賣200塊。好啦,正文開始啦!
學習前必須要知道的生物學知識
01
生存分析(Survival analysis)是指根據試驗或調查得到的數據對生物或人的生存時間進行分析和推斷,研究生存時間和結局與眾多影響因素間關系及其程度大小的方法,也稱生存率分析或存活率分析。常見分析的用壽命表法、Kaplan-Meier,Cox回歸等等. 構建在于病人分組后進行比較!
如果是多個基因,可以通過GSVA來進行分組再去做生存分析。
02
GSVA?(The Gene Set Variation Analysis package for microarray and RNA-seq data),算法太復雜,詳情搜對應的R包。
我是大標題-1.讀文獻
? ? ? ?第一步需要對文獻進行解析,文獻來源為Cell. 2018 May ?:Chemoresistance Evolution in Triple-Negative Breast Cancer Delineated by Single-Cell Sequencing
? ? ? ? 暫時跳過了這一步,因為時間有限,就先看了Jimmy老師的推文使用單細胞多組學探索TNBC病人的新輔助化療療效
大標題-2.數據庫查詢指定基因集的基因列表
-
msigdb數據庫(找出基因集的基因)
-
根據一些feature分類;可能是unique,可能是multi
-
下載msigd數據庫的all.gmt數據后,用Linux處理
2-1.下載gmt數據
打開對應的官網(http://software.broadinstitute.org/gsea/msigdb/index.jsp)
?
?
👆就是對應的gmt文件,我們下載基因名稱的gmt文件
然后打開終端,cd到文件夾
ls?-lh結果如👇
?
這里做了兩次,我分兩次講解,第二次會總結第一次的錯誤
下面是第一次的workflow
找到你想要的基因集的基因
###尋找基因集 grep?-i?epithelial_mesenchymal?* #h.all.v6.2.entrez.gmt:HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION????http://www.broadinstitute.org/gsea/msigdb/cards/HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION??1281????1290????12892200????1277????2335????1293????5054????1278????1282????1284????1462????3486????7045????66784060????3915????4015????3918????1490????6876????1294????4017????1292????3685????7058????13074837????7168????1000????4313????1301????7057????2191????633?871?11167???10631???70706696????3371????7980????22795???3693????4314#???7431????4016????10516???1303????2006????10091311????6695????649?9235????3909????7076????5768????7078????7412????3491????10085???800?57692???2192????6443????1893????3908????10272???7169????3624????1601????2014????10409???36784256????7422????2919????7474????6382????5352????5118????26585???3688????50509???388?56542247????6641????1647????4232????131578??4982????966?59??30008???4147????26577???52702817????25890???2517????6586????284217??56937???1296????2201????3485????5217????6385????960?4616????3576????11010???290?64175???7424????4323????6444????5351????4148????10398???6535813?5396????51330???2331????3398????2669????5329????4638????7040????6422????8985????3569333?2199????4487????5806????8325????3725????10979???22943???6591????667?7171????16342697????5376????3487????1314????4035????3673????2316????8076????5744????7049????6424????39565999????1004????6303????4907????1809????5479????7052????6445????3690????8572????115908??18429244????374?3600????4176????2619????5645????23705???5021????7857????6372????4312????7128822?10486???25878???2303????50863???2026????355?627?8038????5817????6387????51599353????4853????79709???2882????7456這里我下載成entrezID.gmt,后面重新下載。正確基因集的結果如👇
grep?-i?epithelial_mesenchymal?* #HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION????http://www.gsea-msigdb.org/gsea/msigdb/cards/HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITIONCOL3A1???COL5A2??COL5A1??FBN1????COL1A1??FN1?COL6A3??SERPINE1????COL1A2??COL4A1??COL4A2??VCAN????IGFBP3??TGFBI???SPARC???LUM?LAMC1???LOX?LAMC2???CCN2????TAGLN???COL7A1??LOXL2???COL6A2??ITGAV???THBS2???COL16A1?NNMT????TPM1????CDH2????MMP2????COL11A1?THBS1???FAP?BGN?SERPINH1????FSTL1???POSTN???THY1????SPP1????TNC?TFPI2???NID2????ITGB5???MMP3????VIM?LOXL1???FBLN5???COL12A1?ELN?CDH11???COMP????SPOCK1??BMP1????IL32????LAMA3???TIMP1???QSOX1???TIMP3???VCAM1???CCN1????EDIL3???CALD1???MAGEE1??FBLN1???SGCB????ECM1????LAMA2???FSTL3???TPM2????INHBA???DAB2????EMP3????BASP1???ITGA5???MGP?VEGFA???CXCL1???WNT5A???SDC1????PLOD2???PCOLCE??GREM1???ITGB1???COL5A3??RHOB????HTRA1???FGF2????SNTB1???GADD45A?MEST????LRRC15??TNFRSF11B???CD59????ACTA2???EFEMP2??MATN2???PCOLCE2?SERPINE2????GPC1????ABI3BP??FUCA1???SLIT3???LAMA1???PMEPA1??COL8A2??FBN2????IGFBP2??PFN2????SDC4????CD44????GADD45B?CXCL8???GLIPR1??ANPEP???P3H1????VEGFC???MMP14???SGCD????PLOD1???MATN3???MYL9????SLC6A8??CALU????PRRX1???TNFRSF12A???FMOD????ID2?GEM?PLAUR???MYLK????TGFB1???SFRP1???PLOD3???IL6?APLP1???FBLN2???MSX1????PTX3????FZD8????JUN?FERMT2??DKK1????SNAI2???DST?TPM4????DCN?GJA1????PMP22???IGFBP4??COPA????LRP1????ITGA2???FLNA????MFAP5???PTHLH???TGFBR3??SFRP4???LGALS1??RGS4????CDH6????SAT1????NT5E????DPYSL3??PPIB????TGM2????SGCG????ITGB3???PDLIM4??CTHRC1??ECM2????CRLF1???AREG????IL15????MCM7????GAS1????PRSS2???CADM1???OXTR????SCG2????CXCL6???MMP1????TNFAIP3?CAPG????CAP2????MXRA5???FOXC2???NTM?ENO2????FAS?BDNF????ADAM12??PVR?CXCL12??PDGFRB??SLIT2???NOTCH2??COLGALT1????GPX7????WIPF1 grep?-i?HALLMARK_HYPOXIA?* #msigdb.v7.0.symbols.gmt:HALLMARK_HYPOXIA????http://www.gsea-msigdb.org/gsea/msigdb/cards/HALLMARK_HYPOXIAPGK1???PDK1????GBE1????PFKL????ALDOA???ENO2????PGM1????NDRG1???HK2?ALDOC???GPI?MXI1????SLC2A1??P4HA1ADM????P4HA2???ENO1????PFKP????AK4?FAM162A?PFKFB3??VEGFA???BNIP3L??TPI1????ERO1A???KDM3A???CCNG2???LDHA????GYS1????GAPDH???BHLHE40?ANGPTL4?JUN?SERPINE1????LOX?GCK?PPFIA4??MAFF????DDIT4???SLC2A3??IGFBPNFIL3??FOS?RBPJ????HK1?CITED2??ISG20???GALK1???WSB1????PYGM????STC1????ZNF292??BTG1????PLIN2???CSRP2VLDLR??JMJD6???EXT1????F3??PDK3????ANKZF1??UGP2????ALDOB???STC2????ERRFI1??ENO3????PNRC1???HMOX1???PGF?GAPDHS??CHST2???TMEM45A?BCAN????ATF3????CAV1????AMPD3???GPC3????NDST1???IRS2????SAP30???GAA?SDC4????STBD1IER3???PKLR????IGFBP1??PLAUR???CAVIN3??CCN5????LARGE1??NOCT????S100A4??RRAGD???ZFP36???EGFR????EDN2????IDS?CDKN1A??RORA????DUSP1???MIF?PPP1R3C?DPYSL4??KDELR3??DTNA????ADORA2B?HS3ST1??CAVIN1??NR3C1???KLF6????GPC4????CCN1????TNFAIP3?CA12????HEXA????BGN?PPP1R15A????PGM2????PIM1????PRDX5???NAGK????CDKN1B??BRS3????TKTL1MT1E???ATP7A???MT2A????SDC3????TIPARP??PKP1????ANXA2???PGAM2???DDIT3???PRKCA???SLC37A4?CXCR4???EFNA3???CP??KLF7????CCN2????CHST3???TPD52???LXN?B4GALNT2????PPARGC1A????BCL2????GCNT2???HAS1????KLHL24??SCARBSLC25A1????SDC2????CASP6???VHL?FOXO3???PDGFB???B3GALT6?SLC2A5??SRPX????EFNA1???GLRX????ACKR3???PAM?TGFBIDCN????SIAH2???PLAC8???FBP1????TPST2???PHKG1???MYH9????CDKN1C??GRHPR???PCK1????INHA????HSPA5???NDST2???NEDD4TPBG???XPNPEP1?IL6?SLC6A6??MAP3K1??LDHC????AKAP12??TES?KIF5A???LALBA???COL5A1??GPC1????HDLBP???ILVBLNCAN???TGM2????ETS1????HOXB9???SELENBP1????FOSL2???SULT2B1?TGFB3 grep?-i?HALLMARK_ANGIOGENESIS?* #msigdb.v7.0.symbols.gmt:HALLMARK_ANGIOGENESIS????http://www.gsea-msigdb.org/gsea/msigdb/cards/HALLMARK_ANGIOGENESIS??VCAN????POSTN???FSTL1???LRPAP1??STC1????LPL?VEGFA???PF4?THBD????FGFR1???TNFRSF21????CCND2COL5A2?ITGAV???SERPINA5????KCNJ8???APP?JAG1????COL3A1??SPP1????NRP1????OLR1????PDGFA???PTK2????SLCO2A1?PGLYRP1?VAV2????S100A4??MSX1????VTN?TIMP1???APOH????PRG2????JAG2????LUM?CXCL6 grep?-i?HALLMARK_PI3K_AKT_MTOR_SIGNALING?* #msigdb.v7.0.symbols.gmt:HALLMARK_PI3K_AKT_MTOR_SIGNALING????http://www.gsea-msigdb.org/gsea/msigdb/cards/HALLMARK_PI3K_AKT_MTOR_SIGNALING???MAPK8???PIK3R3??GRB2????NFKBIB??MAP2K6??MAPK9???AKT1????MAPK1???PLCG1???TRIB3GSK3B??MAP2K3??CDKN1A??RAC1????RIPK1???AKT1S1??ACTR2???PRKAR2A?YWHAB???HRAS????PDK1????PIKFYVE?TBK1????ACTR3E2F1???MYD88???ITPR2???SQSTM1??RPS6KA1?PTPN11??MAPKAP1?PLCB1???RAF1????CAMK4???RPTOR???CFL1????CDK4????TRAF2GNGT1??UBE2N???ADCY2???CDKN1B??VAV3????FGF6????ECSIT???RALB????ARF1????MKNK1???CDK1????PTEN????ARHGDIA?GRK2????FGF17???DDIT3???IRAK4???TIAM1???CDK2????SFN?PRKCB???GNA14???EIF4E???CLTC????TSC2????FGF22???PPP1CA??DUSP3HSP90B1????IL4?STAT2???SLA?EGFR????PLA2G12A????MAPK10??CALR????THEM4???RIT1????MKNK2???PPP2R1B?CAB39ARPC3??PITX2???NCK1????IL2RG???PFN1????FASLG???NOD1????DAPP1???UBE2D3??CAB39???AP2M1???MAP3K7??PRKAG1??CSNK2PRKAA2?ATF1????SLC2A1??PIN1????TNFRSF1A????LCK?RPS6KA3?NGF?CXCR4???ACACA???SMAD2???PAK4 grep?-i?BIOCARTA_ECM_PATHWAY?* #msigdb.v7.0.symbols.gmt:BIOCARTA_ECM_PATHWAY????http://www.gsea-msigdb.org/gsea/msigdb/cards/BIOCARTA_ECM_PATHWAY???MAPK3???PIK3CG??HRAS????MYL2????PIK3R1??MAP2K1??RAF1????RHOA????ROCK1???FYN?MAPK1???SHC1????TLN1????PFN1????GSN?DIAPH1??PIK3CA??ARHGAP5?ITGB1這里有一個基因集找不到,只有查看來源文獻Loss of E-cadherin promotes metastasis via multiple downstream transcriptional pathways
結果是文獻套文獻,套娃么。。放棄掉,這5個應該也夠了。
下面是用哪個metabric下載表達矩陣
2-2.metabric下載表達矩陣
官網為[http://www.cbioportal.org/datasets],這里下載需要的BRCA的數據
?
下載完成解壓后內容如👇
?
主要用到臨床信息、表達矩陣與體細胞變異文件
后面就可以放在R語言了完成🌶
2-3.制作GSVA的輸入
主要是制作表達矩陣與臨床信息
###讀入表達矩陣與臨床信息 rm(list?=?ls()) wkdir=getwd()?#為工作目錄賦值 options(stringsAsFactors?=?F) ##臨床信息 clin=read.table(file.path(wkdir,'brca_metabric','data_clinical_patient.txt'),header?=?T,sep='\t') ##表達矩陣 expr=read.table(file.path(wkdir,'brca_metabric','data_expression_median.txt'),header?=?T,sep='\t') expr[1:4,1:4] #??Hugo_Symbol?Entrez_Gene_Id??MB.0362??MB.0346 1????????RERE????????????473?8.676978?9.653589 2??????RNF165?????????494470?6.075331?6.687887 3????CD049690?????????????NA?5.453928?5.454185 4????BC033982?????????????NA?4.994525?5.346010 #需要將第一列作為行名并去掉第一、二列 rownames(expr)=expr$Hugo_Symbol expr=expr[,-c(1,2)] ###這里Jimmy老師為了節省空間,只取了取前兩位數expr?<-?apply(expr,?2,?function(x){as.numeric(format(x,digits?=?2))?#digit代表數字位數})rownames(expr)=gsexpr=na.omit(expr)?#忽略NA值接下來還需要做一些變化,因為表達矩陣的樣本名是以"."分割
?
而臨床信息是"-"分割
?
clin$PATIENT_ID?=?gsub('-','.',clin$PATIENT_ID)?##將“-”變為“.” expr=clin[match(colnames(expr),clin$PATIENT_ID),]?#防止錯位進行配對矯正 ###取TNBC亞型 clin1=clin[clin$CLAUDIN_SUBTYPE%in%c("Basal","claudin-low"),] clin1=expr[,colnames(expr)%in%clin1$PATIENT_ID] save(clin1,clin1,file?=?"GSVA_input.Rdata")可以看到有427個病人屬于這個亞型。如果不看Jimmy老師的視屏,我不會知道哪些是TNBC的。😃,生物學背景真的很重要,見推文沒有生物學背景的數據分析很危險
2-4.gmt文件制作
用Linux將前面得到的基因集復制到一個文檔,后綴為gmt
cat?>?GSVA_input.gmt msigdb.v7.0.symbols.gmt:HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION????http://www.gsea-msigdb.org/gsea/msigdb/cards/HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITIONCOL3A1???COL5A2??COL5A1??FBN1????COL1A1??FN1?COL6A3??SERPINE1????COL1A2??COL4A1??COL4A2??VCAN????IGFBP3??TGFBI???SPARC???LUM?LAMC1???LOX?LAMC2???CCN2????TAGLN???COL7A1??LOXL2???COL6A2??ITGAV???THBS2???COL16A1?NNMT????TPM1????CDH2????MMP2????COL11A1?THBS1???FAP?BGN?SERPINH1????FSTL1???POSTN???THY1????SPP1????TNC?TFPI2???NID2????ITGB5???MMP3????VIM?LOXL1???FBLN5???COL12A1?ELN?CDH11???COMP????SPOCK1??BMP1????IL32????LAMA3???TIMP1???QSOX1???TIMP3???VCAM1???CCN1????EDIL3???CALD1???MAGEE1??FBLN1???SGCB????ECM1????LAMA2???FSTL3???TPM2????INHBA???DAB2????EMP3????BASP1???ITGA5???MGP?VEGFA???CXCL1???WNT5A???SDC1????PLOD2???PCOLCE??GREM1???ITGB1???COL5A3??RHOB????HTRA1???FGF2????SNTB1???GADD45A?MEST????LRRC15??TNFRSF11B???CD59????ACTA2???EFEMP2??MATN2???PCOLCE2?SERPINE2????GPC1????ABI3BP??FUCA1???SLIT3???LAMA1???PMEPA1??COL8A2??FBN2????IGFBP2??PFN2????SDC4????CD44????GADD45B?CXCL8???GLIPR1??ANPEP???P3H1????VEGFC???MMP14???SGCD????PLOD1???MATN3???MYL9????SLC6A8??CALU????PRRX1???TNFRSF12A???FMOD????ID2?GEM?PLAUR???MYLK????TGFB1???SFRP1???PLOD3???IL6?APLP1???FBLN2???MSX1????PTX3????FZD8????JUN?FERMT2??DKK1????SNAI2???DST?TPM4????DCN?GJA1????PMP22???IGFBP4??COPA????LRP1????ITGA2???FLNA????MFAP5???PTHLH???TGFBR3??SFRP4???LGALS1??RGS4????CDH6????SAT1????NT5E????DPYSL3??PPIB????TGM2????SGCG????ITGB3???PDLIM4??CTHRC1??ECM2????CRLF1???AREG????IL15????MCM7????GAS1????PRSS2???CADM1???OXTR????SCG2????CXCL6???MMP1????TNFAIP3?CAPG????CAP2????MXRA5???FOXC2???NTM?ENO2????FAS?BDNF????ADAM12??PVR?CXCL12??PDGFRB??SLIT2???NOTCH2??COLGALT1????GPX7????WIPF1 msigdb.v7.0.symbols.gmt:HALLMARK_HYPOXIA????http://www.gsea-msigdb.org/gsea/msigdb/cards/HALLMARK_HYPOXIAPGK1???PDK1????GBE1????PFKL????ALDOA???ENO2????PGM1????NDRG1???HK2?ALDOC???GPI?MXI1????SLC2A1??P4HA1ADM????P4HA2???ENO1????PFKP????AK4?FAM162A?PFKFB3??VEGFA???BNIP3L??TPI1????ERO1A???KDM3A???CCNG2???LDHA????GYS1????GAPDH???BHLHE40?ANGPTL4?JUN?SERPINE1????LOX?GCK?PPFIA4??MAFF????DDIT4???SLC2A3??IGFBPNFIL3??FOS?RBPJ????HK1?CITED2??ISG20???GALK1???WSB1????PYGM????STC1????ZNF292??BTG1????PLIN2???CSRP2VLDLR??JMJD6???EXT1????F3??PDK3????ANKZF1??UGP2????ALDOB???STC2????ERRFI1??ENO3????PNRC1???HMOX1???PGF?GAPDHS??CHST2???TMEM45A?BCAN????ATF3????CAV1????AMPD3???GPC3????NDST1???IRS2????SAP30???GAA?SDC4????STBD1IER3???PKLR????IGFBP1??PLAUR???CAVIN3??CCN5????LARGE1??NOCT????S100A4??RRAGD???ZFP36???EGFR????EDN2????IDS?CDKN1A??RORA????DUSP1???MIF?PPP1R3C?DPYSL4??KDELR3??DTNA????ADORA2B?HS3ST1??CAVIN1??NR3C1???KLF6????GPC4????CCN1????TNFAIP3?CA12????HEXA????BGN?PPP1R15A????PGM2????PIM1????PRDX5???NAGK????CDKN1B??BRS3????TKTL1MT1E???ATP7A???MT2A????SDC3????TIPARP??PKP1????ANXA2???PGAM2???DDIT3???PRKCA???SLC37A4?CXCR4???EFNA3???CP??KLF7????CCN2????CHST3???TPD52???LXN?B4GALNT2????PPARGC1A????BCL2????GCNT2???HAS1????KLHL24??SCARBSLC25A1????SDC2????CASP6???VHL?FOXO3???PDGFB???B3GALT6?SLC2A5??SRPX????EFNA1???GLRX????ACKR3???PAM?TGFBIDCN????SIAH2???PLAC8???FBP1????TPST2???PHKG1???MYH9????CDKN1C??GRHPR???PCK1????INHA????HSPA5???NDST2???NEDD4TPBG???XPNPEP1?IL6?SLC6A6??MAP3K1??LDHC????AKAP12??TES?KIF5A???LALBA???COL5A1??GPC1????HDLBP???ILVBLNCAN???TGM2????ETS1????HOXB9???SELENBP1????FOSL2???SULT2B1?TGFB3 msigdb.v7.0.symbols.gmt:HALLMARK_ANGIOGENESIS????http://www.gsea-msigdb.org/gsea/msigdb/cards/HALLMARK_ANGIOGENESIS??VCAN????POSTN???FSTL1???LRPAP1??STC1????LPL?VEGFA???PF4?THBD????FGFR1???TNFRSF21????CCND2COL5A2?ITGAV???SERPINA5????KCNJ8???APP?JAG1????COL3A1??SPP1????NRP1????OLR1????PDGFA???PTK2????SLCO2A1?PGLYRP1?VAV2????S100A4??MSX1????VTN?TIMP1???APOH????PRG2????JAG2????LUM?CXCL6 msigdb.v7.0.symbols.gmt:HALLMARK_PI3K_AKT_MTOR_SIGNALING????http://www.gsea-msigdb.org/gsea/msigdb/cards/HALLMARK_PI3K_AKT_MTOR_SIGNALING???MAPK8???PIK3R3??GRB2????NFKBIB??MAP2K6??MAPK9???AKT1????MAPK1???PLCG1???TRIB3GSK3B??MAP2K3??CDKN1A??RAC1????RIPK1???AKT1S1??ACTR2???PRKAR2A?YWHAB???HRAS????PDK1????PIKFYVE?TBK1????ACTR3E2F1???MYD88???ITPR2???SQSTM1??RPS6KA1?PTPN11??MAPKAP1?PLCB1???RAF1????CAMK4???RPTOR???CFL1????CDK4????TRAF2GNGT1??UBE2N???ADCY2???CDKN1B??VAV3????FGF6????ECSIT???RALB????ARF1????MKNK1???CDK1????PTEN????ARHGDIA?GRK2????FGF17???DDIT3???IRAK4???TIAM1???CDK2????SFN?PRKCB???GNA14???EIF4E???CLTC????TSC2????FGF22???PPP1CA??DUSP3HSP90B1????IL4?STAT2???SLA?EGFR????PLA2G12A????MAPK10??CALR????THEM4???RIT1????MKNK2???PPP2R1B?CAB39ARPC3??PITX2???NCK1????IL2RG???PFN1????FASLG???NOD1????DAPP1???UBE2D3??CAB39???AP2M1???MAP3K7??PRKAG1??CSNK2PRKAA2?ATF1????SLC2A1??PIN1????TNFRSF1A????LCK?RPS6KA3?NGF?CXCR4???ACACA???SMAD2???PAK4 msigdb.v7.0.symbols.gmt:BIOCARTA_ECM_PATHWAY????http://www.gsea-msigdb.org/gsea/msigdb/cards/BIOCARTA_ECM_PATHWAY???MAPK3???PIK3CG??HRAS????MYL2????PIK3R1??MAP2K1??RAF1????RHOA????ROCK1???FYN?MAPK1???SHC1????TLN1????PFN1????GSN?DIAPH1??PIK3CA??ARHGAP5?ITGB1結果如👇
大標題-3.GSVA
有了上面的文件就可以利用GSVA計算了
library(GSVA) library(stringr) gs=readLines("GSVA_input.gmt")#讀入基因集 gs1?<-?lapply(gs,?function(x)??{x=strsplit(x,"\t")[[1]]???#通過\t分割y=x[3:length(x)]??????????#不取前兩行,因為不是基因return(y)})???????????????#返回y值 X=expr es.max?<-?gsva(X,?gs1,mx.diff=FALSE,?verbose=FALSE,?parallel.sz=1) rownames(es.max)=unlist(lapply(gs,?function(x)? {x=strsplit(x,"\t")[[1]][1] x=strsplit(x,":")[[1]][2]})) es.dif?<-?gsva(X,?gs1,?mx.diff=TRUE,?verbose=FALSE,?parallel.sz=1) rownames(es.dif)=rownames(es.max)???????#行名命名 save(es.max,es.dif,gs1,file?=?"es_output.Rdata") save(es.max,es.dif,gs1,file?=?"es_output.Rdata")👆是兩種算法的差異結果,具體我也沒有看太懂,根據文章設置參數就好。
大標題-4.生存分析
兩部分:es.max/es.dif
es.max
library(survival) library(survminer) phe$event=ifelse(phe$OS_STATUS=='DECEASED',1,0)?#分為死亡與生存2種情況 phe$time=as.numeric(phe$OS_MONTHS)?#根據月份劃分 rownames(phe)=gsub("-",".",phe$PATIENT_ID)?#同上面一樣需要變一下分割符 ids=intersect(rownames(phe),colnames(es.max))?#取交集 phe=phe[ids,]?#取子集 es.max=es.max[,ids]?#取子集?
篩選完成后只剩398個樣本
根據文獻大于0.1為“high,小于”-0.1“為low,方法就是用的GSVA,所以這里我們對其進行分組
for?(i?in?1:5){ phe1=phe[abs(es.max[i,])?>?0.1,]? es.max1=es.max[,abs(es.max[i,])?>?0.1] phe1$group_list=ifelse(es.max1[i,]?>?0.1,'high','low') #?利用ggsurvplot快速繪制漂亮的生存曲線圖 sfit?<-?survfit(Surv(time,?event)~group_list,?data=phe1) p=ggsurvplot(sfit,?conf.int=F,?pval=TRUE,legend.title=rownames(es.max)[i]) print(p$plot) ggsave(filename?=?paste0("0.1",rownames(es.max)[i],".pdf"))} dev.off()結果如👇,沒有一個顯著
?
但當我把參數調為0.2的時候居然第一個基因集ECM顯著,但是并不是文章中顯著的2個基因集(AKT、HYPOXIA)
?
下面我用另一種方法試一下
es.dif
方法跟👆差不多
phe=clin1[clin1$OS_STATUS?%in%?c('DECEASED','LIVING'),] phe$event=ifelse(phe$OS_STATUS=='DECEASED',1,0) phe$time=as.numeric(phe$OS_MONTHS) colnames(phe) rownames(phe)=gsub("-",".",phe$PATIENT_ID) ids1=intersect(rownames(phe),colnames(es.dif)) phe=phe[ids1,] es.dif=es.dif[,ids1] for?(i?in?1:5){phe2=phe[abs(es.dif[i,])?>?0.1,]?es.dif1=es.dif[,abs(es.dif[i,])?>?0.1]phe2$group_list=ifelse(es.dif1[i,]?>?0.1,'high','low')#?利用ggsurvplot快速繪制漂亮的生存曲線圖sfit?<-?survfit(Surv(time,?event)~group_list,?data=phe2)p=ggsurvplot(sfit,?conf.int=F,?pval=TRUE,legend.title=rownames(es.max)[i])print(p$plot)ggsave(filename?=?paste0("0.1_i",rownames(es.dif)[i],".pdf"))}?結果如👇
參數為0.1,只有第一個基因集ECM顯著
?
參數為0.2時,也是只有第一個基因集ECM顯著
大標題-插曲
前面在下載gmt數據那里做了兩次,第一次在上面已經講解了,下面是第二次的探索過程,第二次會總結第一次的錯誤。
那么為什么有如此大的差別呢?方法與文中一致,那么只可能是數據出現的問題。檢查文章,果然發現篩選的基因集不一致。前面我們主要篩選的是HALLMARK的基因集,是由多個已知的基因集構成的超基因集
?
而文章里用的是以前文章發表的基因集
?
那應該怎么搜索基因集呢?我找到了規律,只需要搜索作者名字+基因集名稱就可以了
grep?HARRIS_HYPOXIA?*結果如下面
?
這就是我們所需要的基因集了,其他基因集方法類似
grep?HARRIS_HYPOXIA?* #msigdb.v7.0.symbols.gmt:HARRIS_HYPOXIA????http://www.gsea-msigdb.org/gsea/msigdb/cards/HARRIS_HYPOXIA?PLAUR???PGK1????JUN?VEGFA???XRCC5???RP1?HIF1A???CA9?PGF?BNIP3???PKM?ALDOA???BNIP3L??BHLHE40?HGF?TFF3????ENO1????HK2?F3??CP??SPP1????FOS?HMOX1???DDIT3???HDAC9???AK3?EDN1????P4HA1???TH??EPAS1???IGFBP3??CCL2????STC1????LDHA????COL5A1??FLT1????CA12????PDGFB???TEK?IGF2????PTGS2???CD99????TXN?CDKN1A??PFKL????XRCC6???SAT1????TGFB1???FTL?TFRC????LRP8????CXCL8???NFKB1???TF??TAGLN???CCNG2???TGM2????EDN2????IL6?BIK?PFKP????ANGPT2??TGFA????EPO?SLC2A1??ADM?VIM?SLC2A3??IGFBP1??L1CAM???GAPDH???CDKN1B??MIF?TGFB3???IGFBP2??APEX1???FGF3????PRPS1???HK1?MMP13???ENPEP grep?CREIGHTON_AKT1_SIGNALING?* #msigdb.v7.0.symbols.gmt:CREIGHTON_AKT1_SIGNALING_VIA_MTOR_UP????http://www.gsea-msigdb.org/gsea/msigdb/cards/CREIGHTON_AKT1_SIGNALING_VIA_MTOR_UP???NEDD8???BRMS1???CDK16???SPINT1??DDR1????LASP1???CLSTN1??MVK?PRKCD???UBE2M???AC004156.1??NEU1????CORO1B??CYB561??KCTD5???SYNJ2BP?CDC34???TJP3????HNRNPAB?SLC37A1?NECTIN2?ADIPOR1?BSG?BIK?AKT1????DUSP10??MMP15???POR?GPX4????ARHGEF16????CLDN3???TMED10??TOLLIP??PMPCA #msigdb.v7.0.symbols.gmt:CREIGHTON_AKT1_SIGNALING_VIA_MTOR_DN????http://www.gsea-msigdb.org/gsea/msigdb/cards/CREIGHTON_AKT1_SIGNALING_VIA_MTOR_DN???TNFRSF12A???RGL2????PFKL????PPP4C???CIB1????ATP6V0B?PPP2R1A?KRT8????DHCR7???MRPS7???TUBB4B??YWHAB???ALDOA???PAFAH1B3????GOT1????TOM1????ATP6AP1?CTSA????ATP6V0C?MIF?GPI?ATP6V1F?TSPAN1 grep?ONDER_CDH1?* #msigdb.v7.0.symbols.gmt:ONDER_CDH1_TARGETS_3_UP????http://www.gsea-msigdb.org/gsea/msigdb/cards/ONDER_CDH1_TARGETS_3_UP????TSC22D3?KDELR3??ZBTB16??MAGED1??STC2????CYP1B1??MAGED2??LEPR????ALDH6A1?TCEA2???WIPI1???KCNMA1??RNASE4??S100A8??CRIP2???FTL?PRKCA???KLF9 #msigdb.v7.0.symbols.gmt:ONDER_CDH1_TARGETS_3_DN????http://www.gsea-msigdb.org/gsea/msigdb/cards/ONDER_CDH1_TARGETS_3_DN????CXCL1???EPN3????PTGS2???DSC2????NAV3????EDN1????SOX9????KRT13???CWH43???LCN2????TGFA????SCEL????HS3ST2??C1orf116????FST?HBEGF???CEACAM6?SPRR2D??IL1R2???MFAP5???TMPRSS11E???CRCT1???IVL?KLK10???IL36G???EHF?ADGRE2??SERPINB2????P2RY2???SERPINB13???ATP12A??IL1B????CXCL8???KRT8????MYO5C???KLK11???KLK7????SPAG1???ARL4C???MARC1???CYB5R2??CXCL2???ST6GALNAC5??ANO1????KRT15???PI3?GPRC5A??SERPINB1????TP63????PTHLH???CSF3????DEPP1???DUSP6???ARHGAP25????CXCL3???CSF2????VGLL1???MMP10???S100A7其他的3個基因集文章沒有并給文獻,只能默認用HALLMARK的基因集
下面是第二次的workflow
其實跟第一次類似,不同的地方做注釋,相似的地方不在贅述。這里以hypoxia基因集作為🌰,其他的基因集類似
rm(list?=?ls()) options(stringsAsFactors?=?F) load("data/metabric_clinical.Rdata") load("data/metabric_expression.Rdata") clin1=clin[clin$CLAUDIN_SUBTYPE%in%c("Basal","claudin-low"),] class(expr) class(expr1) library(GSVA) library(stringr) gs=readLines("harris_hypoxia.gmt")?#記得這里要替換成不同基因集的gmt gs1?<-?lapply(gs,?function(x)? {x=strsplit(x,"\t")[[1]] y=x[3:length(x)] return(y)}) X=expr hy.max?<-?gsva(X,?gs1,mx.diff=FALSE,?verbose=FALSE,?parallel.sz=1)#max法 rownames(hy.max)=unlist(lapply(gs,?function(x)? {x=strsplit(x,"\t")[[1]][1] x=strsplit(x,":")[[1]][2]})) hy.dif?<-?gsva(X,?gs1,?mx.diff=TRUE,?verbose=FALSE,?parallel.sz=1)#diff法 rownames(hy.dif)=rownames(hy.max) save(hy.max,hy.dif,gs1,file?=?"hy_hypoxia.Rdata")#存儲一下 load("hy_hypoxia.Rdata") library(survival) library(survminer) library(stringr) dim(clin) clin[1:4,1:4] #?終點事件(outcome?event)又稱失效事件(failure?event)?或死亡事件(death?event)?? #?這種分組資料的生存分析常采用壽命表法(life-table?method) #?生存分析也經常采用Kaplan-Meier曲線及log-rank檢驗 table(clin$VITAL_STATUS) table(clin$OS_STATUS) #?理論上這個時候需要區分?Died?of?Disease和Other?Caushy phe=clin1[clin1$OS_STATUS?%in%?c('DECEASED','LIVING'),] phe$event=ifelse(phe$OS_STATUS=='DECEASED',1,0) phe$time=as.numeric(phe$OS_MONTHS) colnames(phe) rownames(phe)=gsub("-",".",phe$PATIENT_ID) ids=intersect(rownames(phe),colnames(hy.max)) phe=phe[ids,] hy.max=as.matrix(hy.max)#因為只有一列,需要轉換成矩陣 hy.max=hy.max[,ids] hy.max=as.matrix(hy.max)#因為只有一列,需要轉換成矩陣 phe[1:4,1:4] phe1=phe[abs(hy.max[1,])?>?0.1,]? hy.max1=hy.max[,abs(hy.max[1,])?>?0.1] hy.max1=t(as.matrix(hy.max1))?#因為只有一列,需要轉換成矩陣并轉置,不然會報錯 phe1$group_list=ifelse(hy.max1[1,]?>?0.1,'high','low') #?利用ggsurvplot快速繪制漂亮的生存曲線圖 sfit1?<-?survfit(Surv(time,?event)~group_list,?data=phe1) p1=ggsurvplot(sfit1,?conf.int=F,?pval=TRUE,legend.title="hypoxia") p1 dev.off()###diff phe=clin1[clin1$OS_STATUS?%in%?c('DECEASED','LIVING'),] phe$event=ifelse(phe$OS_STATUS=='DECEASED',1,0) phe$time=as.numeric(phe$OS_MONTHS) colnames(phe) rownames(phe)=gsub("-",".",phe$PATIENT_ID) ids1=intersect(rownames(phe),colnames(hy.dif)) phe=phe[ids1,] hy.dif=hy.dif[,ids1] hy.dif=as.matrix(hy.dif) phe2=phe[abs(hy.dif[1,])?>?0.1,]? hy.dif1=hy.dif[,abs(hy.dif[1,])?>?0.1] hy.dif1=t(as.matrix(hy.dif1)) phe2$group_list=ifelse(hy.dif1[1,]?>?0,'high','low') #?利用ggsurvplot快速繪制漂亮的生存曲線圖 sfit2?<-?survfit(Surv(time,?event)~group_list,?data=phe2) p2=ggsurvplot(sfit2,?conf.int=F,?pval=TRUE,legend.title="hypoxia") p2 dev.off()最終結果如👇
0.1的閾值,MAX法結果只有文章中顯著的1個基因集(AKT)顯著,另一個HYPOXIA并不顯著
?
0.1的閾值,diff法結果文章中顯著的1個基因集(AKT)顯著,另一個HYPOXIA并不顯著,另外ECM基因集也顯著
總結
- 上一篇: MySQL中character的意思_m
- 下一篇: mysql character_set_