【高通量测序】.dna文件批量读取CAG重复序列长度
生活随笔
收集整理的這篇文章主要介紹了
【高通量测序】.dna文件批量读取CAG重复序列长度
小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.
191008 CAG讀取思路
#!/usr/bin/env python3 # -*- coding: utf-8 -*- import ref = open(r'C:\Users\kagami\Desktop\TEST\m54045_190927_083209 1.txt',encoding='gb18030', errors = "ignore") f.readline() f.readline()primertext = f.readline() secondtext = str(primertext)f.close() text = re.sub(r"[0123456789 ]","", secondtext) #輸出純凈DNA序列textsearch = re.search("GTCCTTCCAG(\S{0,1000})CCGCCGCCACCG",text)CAGnumber = len(textsearch.group(1)) PolyQnumber = int(CAGnumber/3)print(textsearch.group(1)) print(CAGnumber, PolyQnumber)照著這個思路先把所有的.dna文件修改成了.txt文件,修改代碼如下:
#!/usr/bin/env python3 # -*- coding: utf-8 -*-import os import sys#獲取目錄下文件名清單 files = os.listdir(r"E:\TEST\Origin") #對文件名清單里的每一個文件名進(jìn)行處理 for filename in files:portion = os.path.splitext(filename)#portion為名稱和后綴分離后的列表if portion[1] ==".dna":#如果為JPG則更改名字newname = portion[0]+".txt"#要改的新后綴#改好的新名字print(filename)#打印出要更改的文件名os.chdir(r"E:\TEST\Origin")#修改工作路徑os.rename(filename,newname)(其實可以改動得更好,按照上面的方面會把所有的文件直接改名成.txt格式,源文件無法保留,而且還是在源文件夾下進(jìn)行操作)
之后可以按照以上的思路,批量讀取文件夾內(nèi)的每一個txt文件,輸出一張csv表,包含文件名、CAG數(shù)目、polyQ數(shù)目、CAG序列片段信息。
#!/usr/bin/env python3 # -*- coding: utf-8 -*- import os import re import linecache import csv import codecsListName = list() ListOldPolyQnumber = list() ListNewPolyQnumber = list() ListNewAgainPolyQnumber = list() ListSequencenumber = list()#抽取單個文本內(nèi)CAG序列文本 def ReadPolyQnumber(text):Strtext = str(text)DNAsequence = re.sub(r"[0123456789 ]","", Strtext) #輸出測序完整DNA序列print(DNAsequence)CAGsearch = ""if re.search("GTCCTTCCAG", DNAsequence): #正向序列if re.search("CCGCCGCCACCG", DNAsequence):CAGsearch = re.search("GTCCTTCCAG(\S{0,10000})CCGCCGCCACCG", DNAsequence).group(1) # 輸出全部CAG序列else:if re.search("CCGCCACCGCCG", DNAsequence):CAGsearch = re.search("GTCCTTCCAG(\S{0,10000})CCGCCACCGCCG", DNAsequence).group(1)# 輸出全部CAG序列else:CAGsearch = ""print("正向匹配錯誤1")else:if re.search("CTGGAAGGAC", DNAsequence): # 反向序列if re.search("CGGTGGCGGCGG", DNAsequence):CAGsearch = re.search("CGGTGGCGGCGG(\S{0,10000})CTGGAAGGAC", DNAsequence).group(1) # 輸出全部CAG序列else:if re.search("CGGCGGTGGCGG", DNAsequence):CAGsearch = re.search("CGGCGGTGGCGG(\S{0,10000})CTGGAAGGAC", DNAsequence).group(1) # 輸出全部CAG序列else:CAGsearch = ""print("反向匹配錯誤1")else:CAGsearch = ""print("!!![匹配錯誤]!!!")CAGsequnece = CAGsearchreturn (CAGsequnece)#原始方法計算序列中PolyQ數(shù)量 def OldPolyQnumber(Str):CAGnumber = int(len(Str)/3)return CAGnumber#計算CAG序列文本中PolyQ數(shù)量-不計入錯配 def ChcekPolyQtext(Str):if re.search("CAGCAG", Str): #匹配正向序列print("Enter Forward Change")FirstChangetext = re.sub("CAG","Q",Str) #Chang CAG sequence to QSecondChangetext = re.sub("CAA","Q",FirstChangetext) #Chang CAA sequence to Qelse:if re.search("TGCTGC", Str): #匹配反向序列print("Enter Reverse Change")FirstChangetext = re.sub("TGC", "Q", Str)SecondChangetext = re.sub("TTC", "Q", FirstChangetext)else:print("Enter NO Change")SecondChangetext = "Q" * 9999print("CAG to Q:",SecondChangetext)return SecondChangetextdef CheckPolyQnumber(Str):PolyQnumber = int(Str.count("Q"))return PolyQnumberdef CheckagainPolyQnumber(Str):if re.search(r"QQ(\S{0,10000})QQ", Str):PolyQtext = re.search(r"QQ(\S{0,10000})QQ", Str).group(1)PolyQtoCAGtext = PolyQtext.replace("Q", "CAG")else:PolyQtoCAGtext = StrPolyQnumber = int((len(PolyQtoCAGtext)/3)+4)return PolyQnumberpath = r"C:\Users\kagami\Desktop\Origin" files = os.listdir(path)for file in files:print(file)Oldfilename = str(file)Newfilename = Oldfilename.replace(" ", "_")ListName.append(Newfilename)f = open(path+"/"+file,encoding='gb18030', errors = "ignore")f.readline()f.readline()Textinformation = f.readline()if len(Textinformation)<200:Textinformation = f.readline()f.close()fileCAGTEXT = ReadPolyQnumber(Textinformation)Sequencingnumber = int(len(fileCAGTEXT))ListSequencenumber.append(Sequencingnumber)print(fileCAGTEXT)OldCAGnumber = OldPolyQnumber(fileCAGTEXT)ListOldPolyQnumber.append(OldCAGnumber)PolyQTEXT = ChcekPolyQtext(fileCAGTEXT)print(PolyQTEXT)NewPolyQnumber = CheckPolyQnumber(PolyQTEXT)ListNewPolyQnumber.append(NewPolyQnumber)NewAgainPolyQnumber = CheckagainPolyQnumber(PolyQTEXT)ListNewAgainPolyQnumber.append(NewAgainPolyQnumber)def f(j,k,l,i,m):return j,k,l,i,mFinallist = list(map(f,[j for j in ListName], [k for k in ListOldPolyQnumber], [l for l in ListNewPolyQnumber],[i for i in ListNewAgainPolyQnumber], [m for m in ListSequencenumber]))print(len(Finallist))def data_write_csv(file_name, datas):#file_name為寫入CSV文件的路徑,datas為要寫入數(shù)據(jù)列表file_csv = codecs.open(file_name,'w+','utf-8')#追加writer = csv.writer(file_csv, delimiter=' ', quotechar=' ', quoting=csv.QUOTE_MINIMAL)for data in datas:writer.writerow(data)print("保存文件成功,處理結(jié)束")data_write_csv(r"C:\Users\kagami\Desktop\Result\result-3rd.csv", Finallist)R劃線模擬GEL電泳
利用R中的lines()功能,劃線模擬PCR產(chǎn)物電泳情況,具體實現(xiàn)代碼如下:
PolyQnumber <- read.csv("C:/Users/Administrator/Desktop/Result/Result-Sort.csv",colClasses = c("NULL","NULL",NA), header = FALSE)PolyQ <- PolyQnumber$V3plot(c(0,500),c(0,600),type="n",xlab=" ",ylab="Length of Product")for (i in PolyQ)lines(c(100,300),c(i,i), col = rgb(255, 0, 0, 5, maxColorValue=255)) print(PolyQ)R最終模擬的產(chǎn)物電泳圖如下:
總結(jié)
以上是生活随笔為你收集整理的【高通量测序】.dna文件批量读取CAG重复序列长度的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: C语言之判断直角三角形
- 下一篇: GNN图神经网络