Python在生物学领域的简单应用——处理DNA序列
DNA的反向互補序列
??假設我們有一串DNA序列,存在一個名為“dna.txt”的文本文檔中。那么,我們該如何用Python輸出它的反向序列、互補序列以及反向互補序列呢?
??在這之前,我們不妨定義一個函數,用來打開并讀取txt文件。我們把這個函數命名為read_seq(),這個函數的參數為我們的文件路徑。需要注意的一點是,在dna.txt文件中,存在著換行符\n和回車符\r(如下圖所示),而我們只需要用到代表堿基序列的大寫字母。我們可以用replace()方法來替換掉它們。
??最終,我們定義的read_seq()函數如下:
??接下來,我們需要定義dna_complement(),dna_reverse()和dna_revcomp()三個函數。在定義dna_complement()函數的時候,和上面替換換行符和回車符同樣的道理,我們可以用replace()方法來替換字符串中代表堿基的大寫字母。而定義后兩個函數明顯容易得多,我們只需要讓字符串反向輸出就可以了。我們定義的dna_complement(),dna_reverse()和dna_revcomp()三個函數如下所示:
def dna_complement(seq):seq = seq.upper()seq = seq.replace('A', 'T')seq = seq.replace('T', 'A')seq = seq.replace('C', 'G')seq = seq.replace('G', 'C')return seq def dna_reverse(seq):seq = seq.upper()return seq[::-1] def dna_revcomp(seq):seq = seq.upper()return dna_complement(seq)[::-1]??完整的代碼如下所示:
def dna_complement(seq):seq = seq.upper()seq = seq.replace('A', 'T')seq = seq.replace('T', 'A')seq = seq.replace('C', 'G')seq = seq.replace('G', 'C')return seqdef dna_reverse(seq):seq = seq.upper()return seq[::-1]def dna_revcomp(seq):seq = seq.upper()return dna_complement(seq)[::-1]def read_seq(inputfile):file = open(inputfile, "r")seq = file.read()seq = seq.replace("\n", "")seq = seq.replace("\r", "")return seqif __name__ == '__main__':dna = read_seq("E:\\python_pycharm\\一些Python程序練習\\DNA\\dna.txt")print(dna) # 原DNA序列print(dna_complement(dna)) # DNA互補序列print(dna_reverse(dna)) # DNA反向序列print(dna_revcomp(dna)) # DNA反向互補序列??輸出結果如下:
GGTCAGAAAAAGCCCTCTCCATGTCTACTCACGATACATCCCTGAAAACCACTGAGGAAGTGGCTTTTCAGATCATCTTGCTTTGCCAGTTTGGGGTTGGGACTTTTGCCAATGTATTTCTCTTTGTCTATAATTTCTCTCCAATCTCGACTGGTTCTAAACAGAGGCCCAGACAAGTGATTTTAAGACACATGGCTGTGGCCAATGCCTTAACTCTCTTCCTCACTATATTTCCAAACAACATGATGACTTTTGCTCCAATTATTCCTCAAACTGACCTCAAATGTAAATTAGAATTCTTCACTCGCCTCGTGGCAAGAAGCACAAACTTGTGTTCAACTTGTGTTCTGAGTATCCATCAGTTTGTCACACTTGTTCCTGTTAATTCAGGTAAAGGAATACTCAGAGCAAGTGTCACAAACATGGCAAGTTATTCTTGTTACAGTTGTTGGTTCTTCAGTGTCTTAAATAACATCTACATTCCAATTAAGGTCACTGGTCCACAGTTAACAGACAATAACAATAACTCTAAAAGCAAGTTGTTCTGTTCCACTTCTGATTTCAGTGTAGGCATTGTCTTCTTGAGGTTTGCCCATGATGCCACATTCATGAGCATCATGGTCTGGACCAGTGTCTCCATGGTACTTCTCCTCCATAGACATTGTCAGAGAATGCAGTACATATTCACTCTCAATCAGGACCCCAGGGGCCAAGCAGAGACCACAGCAACCCATACTATCCTGATGCTGGTAGTCACATTTGTTGGCTTTTATCTTCTAAGTCTTATTTGTATCATCTTTTACACCTATTTTATATATTCTCATCATTCCCTGAGGCATTGCAATGACATTTTGGTTTCGGGTTTCCCTACAATTTCTCCTTTACTGTTGACCTTCAGAGACCCTAAGGGTCCTTGTTCTGTGTTCTTCAACTGTTGAAAGCCAGAGTCACTAAAAATGCCAAACACAGAAGACAGCTTTGCTAATACCATTAAATACTTTATTCCATAAATATGTTTTTAAAAGCTTGTATGAACAAGGTATGGTGCTCACTGCTATACTTATAAAAGAGTAAGGTTATAATCACTTGTTGATATGAAAAGATTTCTGGTTGGAATCTGATTGAAACAGTGAGTTATTCACCACCCTCCATTCTCT CCACACAAAAACCCCACACCAACACAACACACCAAACAACCCACAAAACCACACACCAACACCCAAAACACAACAACAACCAAACCCACAAACCCCAACCCACAAAACCCAAACAAAAACACAAACACAAAAAAAACACACCAAACACCACACCAACAAAACACACCCCCACACAACACAAAAAAACACACAACCCACACCCCAAACCCAAAACACACAACCACACAAAAAAACCAAACAACAACAACACAAAACCACCAAAAAAACCACAAACACACCACAAAACAAAAAAACAAAACAACACACCCCACCACCCAACAACCACAAACAACACAACAACAACACAACACACAAACCAACACAAACACACACAACAACCACAAAAAACACCAAAACCAAAACACACACCAACACACACAAACAACCCAACAAAAACAACAAACACAACAACCAACAACACACACAAAAAAAACAACAACAAACCAAAAAACCACACACCACCACACAAAACACACAAAAACAAAAACACAAAAACCAACAACAACACAACCACAACACAAAACACACAACCCAAACACAACAACACCAAACCCCAACAACCCACAAACAACACCAACAACCACACCACCACACACACCAACCAACAACACCACCAAACACAAACACACACAAACCACAACAAAAACACACACAAACACCACCCCACCCCCCAACCACACACCACACCAACCCAAACAAACCACAACCACCAACACACAAAACAACCCAAAAAACAACAAACACAAAAAACAAACAACAAAAACACCAAAAAAAAAAAAACACAACAAACCCACACCCAAACCAAACACAAAAACCAAACCCCAAACCCAACAAAAACACCAAAACACAACACCAACACACACCCAAACCCACCAACAACACACAACAACAACACAACAAACCCACACACACAAAAAAACCCAAACACACAACACACCAAACCAAAAACCAAAAAAAACAAAAAACCAAAAAAAACAAAAAAAAACCAACAAACAACAACCAAACCACCACACACCAAAACAAAAAAAACACAAACCAAAAAAACACAACAACAAAACAAAACAAAACACCAACCAAACACAAACAAACACACACAAAAACACCACCCACCAAACACA TCTCTTACCTCCCACCACTTATTGAGTGACAAAGTTAGTCTAAGGTTGGTCTTTAGAAAAGTATAGTTGTTCACTAATATTGGAATGAGAAAATATTCATATCGTCACTCGTGGTATGGAACAAGTATGTTCGAAAATTTTTGTATAAATACCTTATTTCATAAATTACCATAATCGTTTCGACAGAAGACACAAACCGTAAAAATCACTGAGACCGAAAGTTGTCAACTTCTTGTGTCTTGTTCCTGGGAATCCCAGAGACTTCCAGTTGTCATTTCCTCTTTAACATCCCTTTGGGCTTTGGTTTTACAGTAACGTTACGGAGTCCCTTACTACTCTTATATATTTTATCCACATTTTCTACTATGTTTATTCTGAATCTTCTATTTTCGGTTGTTTACACTGATGGTCGTAGTCCTATCATACCCAACGACACCAGAGACGAACCGGGGACCCCAGGACTAACTCTCACTTATACATGACGTAAGAGACTGTTACAGATACCTCCTCTTCATGGTACCTCTGTGACCAGGTCTGGTACTACGAGTACTTACACCGTAGTACCCGTTTGGAGTTCTTCTGTTACGGATGTGACTTTAGTCTTCACCTTGTCTTGTTGAACGAAAATCTCAATAACAATAACAGACAATTGACACCTGGTCACTGGAATTAACCTTACATCTACAATAAATTCTGTGACTTCTTGGTTGTTGACATTGTTCTTATTGAACGGTACAAACACTGTGAACGAGACTCATAAGGAAATGGACTTAATTGTCCTTGTTCACACTGTTTGACTACCTATGAGTCTTGTGTTCAACTTGTGTTCAAACACGAAGAACGGTGCTCCGCTCACTTCTTAAGATTAAATGTAAACTCCAGTCAAACTCCTTATTAACCTCGTTTTCAGTAGTACAACAAACCTTTATATCACTCCTTCTCTCAATTCCGTAACCGGTGTCGGTACACAGAATTTTAGTGAACAGACCCGGAGACAAATCTTGGTCAGCTCTAACCTCTCTTTAATATCTGTTTCTCTTTATGTAACCGTTTTCAGGGTTGGGGTTTGACCGTTTCGTTCTACTAGACTTTTCGGTGAAGGAGTCACCAAAAGTCCCTACATAGCACTCATCTGTACCTCTCCCGAAAAAGACTGG ACACAAACCACCCACCACAAAAACACACACAAACAAACACAAACCAACCACAAAACAAAACAAAACAACAACACAAAAAAACCAAACACAAAAAAAACAAAACCACACACCACCAAACCAACAACAAACAACCAAAAAAAAACAAAAAAAACCAAAAAACAAAAAAAACCAAAAACCAAACCACACAACACACAAACCCAAAAAAACACACACACCCAAACAACACAACAACAACACACAACAACCACCCAAACCCACACACAACCACAACACAAAACCACAAAAACAACCCAAACCCCAAACCAAAAACACAAACCAAACCCACACCCAAACAACACAAAAAAAAAAAAACCACAAAAACAACAAACAAAAAACACAAACAACAAAAAACCCAACAAAACACACAACCACCAACACCAAACAAACCCAACCACACCACACACCAACCCCCCACCCCACCACAAACACACACAAAAACAACACCAAACACACACAAACACAAACCACCACAACAACCAACCACACACACCACCACACCAACAACCACAACAAACACCCAACAACCCCAAACCACAACAACACAAACCCAACACACAAAACACAACACCAACACAACAACAACCAAAAACACAAAAACAAAAACACACAAAACACACCACCACACACCAAAAAACCAAACAACAACAAAAAAAACACACACAACAACCAACAACACAAACAACAAAAACAACCCAACAAACACACACAACCACACACAAAACCAAAACCACAAAAAACACCAACAACACACACAAACACAACCAAACACACAACACAACAACAACACAACAAACACCAACAACCCACCACCCCACACAACAAAACAAAAAAACAAAACACCACACAAACACCAAAAAAACCACCAAAACACAACAACAACAAACCAAAAAAACACACCAACACACAAAACCCAAACCCCACACCCAACACACAAAAAAACACAACACACCCCCACACAAAACAACCACACCACAAACCACACAAAAAAAACACAAACACAAAAACAAACCCAAAACACCCAACCCCAAACACCCAAACCAACAACAACACAAAACCCACAACCACACACCAAAACACCCAACAAACCACACAACACAACCACACCCCAAAAACACACC把DNA編碼鏈轉錄為mRNA
??我們同樣通過定義函數的方式來實現DNA編碼鏈的轉錄。根據生物學相關知識,我們只需要把DNA編碼鏈的T替換成U即可得到mRNA(在此我們忽略DNA的啟動子與終止子)。代碼實現如下:
def transcription(seq):# 這里我們假設給定DNA序列為編碼鏈seq = seq.upper()seq = seq.replace('T', 'U')return seqdef read_seq(inputfile):file = open(inputfile, "r")seq = file.read()seq = seq.replace("\n", "")seq = seq.replace("\r", "")return seqif __name__ == '__main__':dna = read_seq("E:\\python_pycharm\\一些Python程序練習\\DNA\\dna.txt")print(dna) # 原DNA序列(編碼鏈)print(transcription(dna)) # 轉錄出的mRNA序列根據DNA編碼鏈序列或mRNA序列翻譯出蛋白質序列
??我們先來討論根據DNA編碼鏈翻譯蛋白質。定義一個新函數dna_translate()。在定義函數的過程中,我們需要思考一個問題:眾所周知,mRNA翻譯蛋白質是從起始密碼子開始翻譯,到終止密碼子停止翻譯。如果不考慮這個問題的話很有可能會得到錯誤的結果。定義函數如下:
import redef dna_translate(seq):table = {'ATA': 'I', 'ATC': 'I', 'ATT': 'I', 'ATG': 'M','ACA': 'T', 'ACC': 'T', 'ACG': 'T', 'ACT': 'T','AAC': 'N', 'AAT': 'N', 'AAA': 'K', 'AAG': 'K','AGC': 'S', 'AGT': 'S', 'AGA': 'R', 'AGG': 'R','CTA': 'L', 'CTC': 'L', 'CTG': 'L', 'CTT': 'L','CCA': 'P', 'CCC': 'P', 'CCG': 'P', 'CCT': 'P','CAC': 'H', 'CAT': 'H', 'CAA': 'Q', 'CAG': 'Q','CGA': 'R', 'CGC': 'R', 'CGG': 'R', 'CGT': 'R','GTA': 'V', 'GTC': 'V', 'GTG': 'V', 'GTT': 'V','GCA': 'A', 'GCC': 'A', 'GCG': 'A', 'GCT': 'A','GAC': 'D', 'GAT': 'D', 'GAA': 'E', 'GAG': 'E','GGA': 'G', 'GGC': 'G', 'GGG': 'G', 'GGT': 'G','TCA': 'S', 'TCC': 'S', 'TCG': 'S', 'TCT': 'S','TTC': 'F', 'TTT': 'F', 'TTA': 'L', 'TTG': 'L','TAC': 'Y', 'TAT': 'Y', 'TAA': '_', 'TAG': '_','TGC': 'C', 'TGT': 'C', 'TGA': '_', 'TGG': 'W'}start_sit = re.search('ATG', seq)protein = ""for sit in range(start_sit.end() - 3, len(seq), 3):protein += table[seq[sit:sit + 3]]if table[seq[sit:sit + 3]] == '_':breakreturn protein??完整代碼如下:
import redef dna_translate(seq):table = {'ATA': 'I', 'ATC': 'I', 'ATT': 'I', 'ATG': 'M','ACA': 'T', 'ACC': 'T', 'ACG': 'T', 'ACT': 'T','AAC': 'N', 'AAT': 'N', 'AAA': 'K', 'AAG': 'K','AGC': 'S', 'AGT': 'S', 'AGA': 'R', 'AGG': 'R','CTA': 'L', 'CTC': 'L', 'CTG': 'L', 'CTT': 'L','CCA': 'P', 'CCC': 'P', 'CCG': 'P', 'CCT': 'P','CAC': 'H', 'CAT': 'H', 'CAA': 'Q', 'CAG': 'Q','CGA': 'R', 'CGC': 'R', 'CGG': 'R', 'CGT': 'R','GTA': 'V', 'GTC': 'V', 'GTG': 'V', 'GTT': 'V','GCA': 'A', 'GCC': 'A', 'GCG': 'A', 'GCT': 'A','GAC': 'D', 'GAT': 'D', 'GAA': 'E', 'GAG': 'E','GGA': 'G', 'GGC': 'G', 'GGG': 'G', 'GGT': 'G','TCA': 'S', 'TCC': 'S', 'TCG': 'S', 'TCT': 'S','TTC': 'F', 'TTT': 'F', 'TTA': 'L', 'TTG': 'L','TAC': 'Y', 'TAT': 'Y', 'TAA': '_', 'TAG': '_','TGC': 'C', 'TGT': 'C', 'TGA': '_', 'TGG': 'W'}start_sit = re.search('ATG', seq)protein = ""for sit in range(start_sit.end() - 3, len(seq), 3):protein += table[seq[sit:sit + 3]]if table[seq[sit:sit + 3]] == '_':breakreturn proteindef read_seq(inputfile):file = open(inputfile, "r")seq = file.read()seq = seq.replace("\n", "")seq = seq.replace("\r", "")return seqif __name__ == '__main__':dna = read_seq("E:\\python_pycharm\\一些Python程序練習\\DNA\\dna.txt")print(dna)print(dna_translate(dna)[:-1])??在上面程序的第43行print(dna_translate(dna)[:-1]),在最后加上[:-1]的目的是“刪除”掉DNA序列末尾的_(下劃線)。輸出結果如下:
GGTCAGAAAAAGCCCTCTCCATGTCTACTCACGATACATCCCTGAAAACCACTGAGGAAGTGGCTTTTCAGATCATCTTGCTTTGCCAGTTTGGGGTTGGGACTTTTGCCAATGTATTTCTCTTTGTCTATAATTTCTCTCCAATCTCGACTGGTTCTAAACAGAGGCCCAGACAAGTGATTTTAAGACACATGGCTGTGGCCAATGCCTTAACTCTCTTCCTCACTATATTTCCAAACAACATGATGACTTTTGCTCCAATTATTCCTCAAACTGACCTCAAATGTAAATTAGAATTCTTCACTCGCCTCGTGGCAAGAAGCACAAACTTGTGTTCAACTTGTGTTCTGAGTATCCATCAGTTTGTCACACTTGTTCCTGTTAATTCAGGTAAAGGAATACTCAGAGCAAGTGTCACAAACATGGCAAGTTATTCTTGTTACAGTTGTTGGTTCTTCAGTGTCTTAAATAACATCTACATTCCAATTAAGGTCACTGGTCCACAGTTAACAGACAATAACAATAACTCTAAAAGCAAGTTGTTCTGTTCCACTTCTGATTTCAGTGTAGGCATTGTCTTCTTGAGGTTTGCCCATGATGCCACATTCATGAGCATCATGGTCTGGACCAGTGTCTCCATGGTACTTCTCCTCCATAGACATTGTCAGAGAATGCAGTACATATTCACTCTCAATCAGGACCCCAGGGGCCAAGCAGAGACCACAGCAACCCATACTATCCTGATGCTGGTAGTCACATTTGTTGGCTTTTATCTTCTAAGTCTTATTTGTATCATCTTTTACACCTATTTTATATATTCTCATCATTCCCTGAGGCATTGCAATGACATTTTGGTTTCGGGTTTCCCTACAATTTCTCCTTTACTGTTGACCTTCAGAGACCCTAAGGGTCCTTGTTCTGTGTTCTTCAACTGTTGAAAGCCAGAGTCACTAAAAATGCCAAACACAGAAGACAGCTTTGCTAATACCATTAAATACTTTATTCCATAAATATGTTTTTAAAAGCTTGTATGAACAAGGTATGGTGCTCACTGCTATACTTATAAAAGAGTAAGGTTATAATCACTTGTTGATATGAAAAGATTTCTGGTTGGAATCTGATTGAAACAGTGAGTTATTCACCACCCTCCATTCTCT MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC??類似地,我們可以把上面幾個程序綜合一下,實現“DNA—(轉錄)→mRNA—(翻譯)→蛋白質”的完整過程。代碼如下:
import redef mrna_translate(seq):table = {'AUA': 'I', 'AUC': 'I', 'AUU': 'I', 'AUG': 'M','ACA': 'T', 'ACC': 'T', 'ACG': 'T', 'ACU': 'T','AAC': 'N', 'AAU': 'N', 'AAA': 'K', 'AAG': 'K','AGC': 'S', 'AGU': 'S', 'AGA': 'R', 'AGG': 'R','CUA': 'L', 'CUC': 'L', 'CUG': 'L', 'CUU': 'L','CCA': 'P', 'CCC': 'P', 'CCG': 'P', 'CCU': 'P','CAC': 'H', 'CAU': 'H', 'CAA': 'Q', 'CAG': 'Q','CGA': 'R', 'CGC': 'R', 'CGG': 'R', 'CGU': 'R','GUA': 'V', 'GUC': 'V', 'GUG': 'V', 'GUU': 'V','GCA': 'A', 'GCC': 'A', 'GCG': 'A', 'GCU': 'A','GAC': 'D', 'GAU': 'D', 'GAA': 'E', 'GAG': 'E','GGA': 'G', 'GGC': 'G', 'GGG': 'G', 'GGU': 'G','UCA': 'S', 'UCC': 'S', 'UCG': 'S', 'UCU': 'S','UUC': 'F', 'UUU': 'F', 'UUA': 'L', 'UUG': 'L','UAC': 'Y', 'UAU': 'Y', 'UAA': '_', 'UAG': '_','UGC': 'C', 'UGU': 'C', 'UGA': '_', 'UGG': 'W'}start_sit = re.search('AUG', seq)protein = ""for sit in range(start_sit.end() - 3, len(seq), 3):protein += table[seq[sit:sit + 3]]if table[seq[sit:sit + 3]] == '_':breakreturn proteindef transcription(seq):# 這里我們假設給定DNA序列為編碼鏈seq = seq.upper()seq = seq.replace('T', 'U')return seqdef read_seq(inputfile):file = open(inputfile, "r")seq = file.read()seq = seq.replace("\n", "")seq = seq.replace("\r", "")return seqif __name__ == '__main__':dna = read_seq("E:\\python_pycharm\\一些Python程序練習\\DNA\\dna.txt")print(dna) # 原DNA序列(編碼鏈)mrna = transcription(dna)print(mrna) # 轉錄出的mRNA序列print(mrna_translate(mrna)[:-1]) # 翻譯出的蛋白質序列寫在最后
??如果你有什么更好的想法,歡迎給我留言。
??我的郵箱:1398635912@qq.com
總結
以上是生活随笔為你收集整理的Python在生物学领域的简单应用——处理DNA序列的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 原创:武丁的二个妻子的故事,妇好、妇妌竟
- 下一篇: 原创:封神演义中伯邑考注定失败,姜子牙如