當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

TCGA_改版后STAR-count处理方法

發布時間：2023/12/8 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了 TCGA_改版后STAR-count处理方法小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

TCGA改版后，workflow.type只有STAR-counts數據，先對所嘗試的幾種處理方法進行記錄：

R version 4.1.2 ； TCGAbiolinks version?2.23.11

方法1

最新版TCGA 矩陣整理，百分百復現成功_sayhello1025的博客-CSDN博客

一、從TCGA網站上下載tsv文件和json文件

1 tsv文件?

query <- GDCquery(project = "TCGA-PRAD", #項目名data.category = "Transcriptome Profiling",data.type = "Gene Expression Quantification",workflow.type = "STAR - Counts")t_samplesDown <- getResults(query,cols=c("cases")) #從sampleDown檢索出TP t_dataSmTP <- TCGAquery_SampleTypes(barcode = t_samplesDown,typesample = "TP") #可選#篩選完成的query queryDown <- GDCquery(project = "TCGA-PRAD",data.category = "Transcriptome Profiling",data.type = "Gene Expression Quantification", workflow.type = "STAR - Counts", barcode = t_dataSmTP) #下載GDCquery的結果 GDCdownload(queryDown,method = "api",directory = "GDCdata_star_count",files.per.chunk = 6) #網速慢可以設小一點

?跟如下從download 中下載cart效果一樣

2 json文件

點擊Metadata下載json文件

?二、整合tsv文件、json文件到一個文件夾

1 右上角搜索“.tsv”

2?復制到新文件夾里面，整理如下：

本機路徑：'E:/R/PRAD Data Mining/PRAD_data_mining/TCGA/GDCdata_star_count.tsv/all/'

?三、提取count矩陣

rm(list=ls()) options(stringsAsFactors = F)library("rjson")

1 整理“all”中的文件

result <- fromJSON(file = "E:/R/PRAD Data Mining/PRAD_data_mining/TCGA/GDCdata_star_count.tsv/metadata.cart.2022-04-17.json") metadata <- data.frame(t(sapply(result,function(x){id <- x$associated_entities[[1]]$entity_submitter_idfile_name <- x$file_nameall <- cbind(id,file_name) }))) rownames(metadata) <- metadata[,2]t_dir <- 'E:/R/PRAD Data Mining/PRAD_data_mining/TCGA/GDCdata_star_count.tsv/all/' t_samples=list.files(t_dir) sampledir <- paste0(t_dir,t_samples) #各個文件路徑

View(metadata)

metadata中記錄barcode文件名的對應關系

example <- data.table::fread('E:/R/PRAD Data Mining/PRAD_data_mining/TCGA/GDCdata_star_count.tsv/all/005d2b9e-722c-40bd-aa5c-bd4e8842cb04.rna_seq.augmented_star_gene_counts.tsv',data.table = F)#讀入一個tsv文件，查看需要的列數，“unstranded”raw <- do.call(cbind,lapply(sampledir, function(x){rt <- data.table::fread(x,data.table = F) #data.table::fread函數rownames(rt) <- rt[,1]rt <- rt[,4]###第4列為“unstranded” }))

View(example)

載入一個例子，需要的 ”unstranded count”?值數據是第4列；同時第5行之后才可讀

?View(raw)

行名、列名未注釋??

2?替換列名和行名

重點說一下sapply(strsplit(sampledir,'/'),'[',8) #數字可選。意思是將每個sampledir按“/”分隔后，取第“8”個，對應的是tsv文件的文件名。比方說我的sampledir[1]：

'E:/R/PRAD Data Mining/PRAD_data_mining/TCGA/GDCdata_star_count.tsv/all/005d2b9e-722c-40bd-aa5c-bd4e8842cb04.rna_seq.augmented_star_gene_counts.tsv'

按“/”分隔后，第“8”個部分是tsv文件的文件名，所以我這里的參數為“8”

colnames(raw)=sapply(strsplit(sampledir,'/'),'[',8) #數字可選,'8'為文件名005d2b9e-722c-40bd-aa5c-bd4e8842cb04.rna_seq.augmented_star_gene_counts.tsv rownames(raw) <- example$gene_id ##行名 raw_t <- t(raw)t_same <- intersect(rownames(metadata),rownames(raw_t))dataPrep2 <- cbind(metadata[t_same,],raw_t[t_same,]) rownames(dataPrep2) <- dataPrep2[,1] dataPrep2 <- t(dataPrep2) dataPrep2 <-dataPrep2[-c(1:6),] #dataPrep2為未注釋count矩陣

View(raw)

View(dataPrep2)

?dataPrep2為未基因注釋的count矩陣，格式為“matrix”

3?dataPrep2中數據類型為“character”，需要轉為“numeric”

puried_data=apply(dataPrep2,2,as.numeric)

?View(puried_data)

行名沒了?

4 基因注釋

puried_data的格式為“matrix”：“matrix”行名可以重復，因此直接替換行名。

此處仍然借用example進行基因注釋

rownames(puried_data)=example[5:nrow(example),'gene_name']

View(puried_data)

dim(puried_data)

?[1] 60660 ? 490

5 去除重復基因名

取行平均count值最大的行

t_index=order(rowMeans(puried_data),decreasing = T)#計算所有行平均值，按降序排列 t_data_order=puried_data[t_index,]#調整表達譜的基因順序 keep=!duplicated(rownames(t_data_order))#對于有重復的基因，保留第一次出現的那個，即行平均值大的那個 puried_data=t_data_order[keep,]#得到最后處理之后的表達譜矩陣

dim(puried_data)

?[1] 59427 ? 490

6 預處理

需要注意DESeq2包要求數據未經過標準化

dataFilt =puried_data[rowMeans(puried_data)>10,] #剔除低表達基因write.csv(dataFilt,file = "dataFilt.csv",quote = FALSE)

得到?dataFilt?后就可以正常分析了！

?方法2（未成功）

嘗試使用TCGAbiolinksR包

library("BiocManager") library("TCGAbiolinks") library("SummarizedExperiment")BiocManager::install("BioinformaticsFMRP/TCGAbiolinksGUI.data") BiocManager::install("BioinformaticsFMRP/TCGAbiolinks") #比BiocManager::install(TCGAbiolinks)安裝更新版的TCGAbiolinks包 query <- GDCquery(project = "TCGA-PRAD", #項目名data.category = "Transcriptome Profiling",data.type = "Gene Expression Quantification",workflow.type = "STAR - Counts")t_samplesDown <- getResults(query,cols=c("cases")) #從sampleDown檢索出TP t_dataSmTP <- TCGAquery_SampleTypes(barcode = t_samplesDown,typesample = "TP") #可選#篩選完成的query queryDown <- GDCquery(project = "TCGA-PRAD",data.category = "Transcriptome Profiling",data.type = "Gene Expression Quantification", workflow.type = "STAR - Counts", barcode = t_dataSmTP) #下載GDCquery的結果 GDCdownload(queryDown,method = "api",directory = "GDCdata_star_count",files.per.chunk = 6) #網速慢可以設小一點dataPrep1 <- GDCprepare(query = queryDown, save = T,directory = 'GDCdata_star_count', #默認為“GDCdata”save.filename ="dataPrep1_PRAD_star.rda")dataPrep2 <- TCGAanalyze_Preprocessing(object = dataPrep1,cor.cut = 0.6#datatype = "HTSeq - Counts")

View(dataPrep2)

t_purityDATA <- TCGAtumor_purity(colnames(dataPrep1), 0, 0, 0, 0, 0.6) t_purity.barcodes<-t_purityDATA$pure_barcodes #腫瘤樣本barcodes，為“character” t_normal.barcodes<-t_purityDATA$filtered #正常組織的數據barcodes，為“character”puried_data <-dataPrep2[,t_purity.barcodes]#篩選后數據rownames(puried_data)<-rowData(dataPrep1)$external_gene_name

rownames(puried_data)<-rowData(dataPrep1)$external_gene_name （基因注釋步驟）

View(puried_data)

?基因注釋未成功，暫時沒有找到解決方法

總結

以上是生活随笔為你收集整理的TCGA_改版后STAR-count处理方法的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：功能测试与抓包工具Fiddler(htt
下一篇：微信第三方开发行业解决方案