2017-04-21 139 views
3

我有一个文件,如下面的小例子。每4行都与一个ID相关。每个ID的第二行以N开头。我想在行首开始删除N,其他所有内容都保持不变。 我想在python中做到这一点。你知道怎么做吗?如何在Python中编辑文本(.fastq)文件

例如:

@SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50 
NGCGACCTCAGATCAGACGTGGCGACC 
+SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50 
#<<ABGGGGGGGGGGGGGGGGGGGGGG 
@SRR2163140.3 HISEQ:148:C670LANXX:3:1101:1381:1997 length=50 
NGCCGACATCGAAGGATCAA 
+SRR2163140.3 HISEQ:148:C670LANXX:3:1101:1381:1997 length=50 
#<<ABFGGGGGGGGGGGGGG 
@SRR2163140.4 HISEQ:148:C670LANXX:3:1101:1705:1940 length=50 
NACAAACCCTTGTGTCGAGGGC 
+SRR2163140.4 HISEQ:148:C670LANXX:3:1101:1705:1940 length=50 
#=ABBGGGGGGGGGGGGGGGGG 
@SRR2163140.7 HISEQ:148:C670LANXX:3:1101:1704:1965 length=50 
NGGGACATGACAGCCTGGACCATCG 
+SRR2163140.7 HISEQ:148:C670LANXX:3:1101:1704:1965 length=50 
#=ABBGGGGGGGGGGGGGGGGGGGG 

输出:

@SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50 
GCGACCTCAGATCAGACGTGGCGACC 
+SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50 
#<<ABGGGGGGGGGGGGGGGGGGGGGG 
@SRR2163140.3 HISEQ:148:C670LANXX:3:1101:1381:1997 length=50 
GCCGACATCGAAGGATCAA 
+SRR2163140.3 HISEQ:148:C670LANXX:3:1101:1381:1997 length=50 
#<<ABFGGGGGGGGGGGGGG 
@SRR2163140.4 HISEQ:148:C670LANXX:3:1101:1705:1940 length=50 
ACAAACCCTTGTGTCGAGGGC 
+SRR2163140.4 HISEQ:148:C670LANXX:3:1101:1705:1940 length=50 
#=ABBGGGGGGGGGGGGGGGGG 
@SRR2163140.7 HISEQ:148:C670LANXX:3:1101:1704:1965 length=50 
GGGACATGACAGCCTGGACCATCG 
+SRR2163140.7 HISEQ:148:C670LANXX:3:1101:1704:1965 length=50 
#=ABBGGGGGGGGGGGGGGGGGGGG 
+1

请注意,要获得有效的fastq格式,您还需要删除质量行的第一个字符。你想要的不会保留基础和品质之间的匹配。 – bli

回答

4

如果我会做你问到底是什么(请从每个序列的起始N),那么这将离开FASTQ file不一致的状态。

FASTQ文件的每一行都包含较早序列两行的质量值。所以,如果您从序列中删除第一个字符,则还需要使用质量值从行中删除第一个字符。

你可以做一些在纯Python非常简单的像

with open("example.fastq") as f: 
    for idx, line in enumerate(f.read().splitlines()): 
     if idx % 2: 
      print(line[1:]) 
     else: 
      print(line) 

,但如果你要与生物数据能正常运行,你真的应该开始使用生物信息学模块像BioPython。它会警告你,如果你试图做的事情会导致文件的形状不一致或没有意义。

然后将溶液看起来像:

from Bio import SeqIO 
from Bio import Seq 

new_records = [] 
for record in SeqIO.parse("example.fastq", "fastq"): 
    sequence = str(record.seq) 
    letter_annotations = record.letter_annotations 

    # You first need to empty the existing letter annotations 
    record.letter_annotations = {} 

    new_sequence = sequence[1:] 
    record.seq = Seq.Seq(new_sequence) 


    new_letter_annotations = {'phred_quality': letter_annotations['phred_quality'][1:]} 
    record.letter_annotations = new_letter_annotations 

    new_records.append(record) 


with open('without_starting_N.fastq', 'w') as output_handle: 
    SeqIO.write(new_records, output_handle, "fastq") 

其上的每个第三行输出

@SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50 
GCGACCTCAGATCAGACGTGGCGACC 
+ 
<<ABGGGGGGGGGGGGGGGGGGGGGG 
@SRR2163140.3 HISEQ:148:C670LANXX:3:1101:1381:1997 length=50 
GCCGACATCGAAGGATCAA 
+ 
<<ABFGGGGGGGGGGGGGG 
@SRR2163140.4 HISEQ:148:C670LANXX:3:1101:1705:1940 length=50 
ACAAACCCTTGTGTCGAGGGC 
+ 
=ABBGGGGGGGGGGGGGGGGG 
@SRR2163140.7 HISEQ:148:C670LANXX:3:1101:1704:1965 length=50 
GGGACATGACAGCCTGGACCATCG 
+ 
=ABBGGGGGGGGGGGGGGGGGGGG 

(即“+”字符是任选随后通过从两个相同序列标识符和描述前面的行)