从SeqIO.index生成的字典中删除项目

我正在使用Python 2.6.6，并且我试图删除中的，它们与file1中的读取重叠（即相同）。这里是代码我想实现：从SeqIO.index生成的字典中删除项目

ref_reads = SeqIO.index("file1.fastq", "fastq") 
spk_reads = SeqIO.index("file2.fastq", "fastq") 

for spk in spk_reads: 
    if spk in ref_reads: 
    del ref_reads[spk]

不过，我得到这个错误与我使用的del：

AttributeError的：_IndexedSeqFileDict实例没有属性 '__delitem__'

是有可能使用目前的公式删除一个项目？如何从使用SeqIO.index()生成的字典中删除项目？

我也试过如下：

# import read data 
ref_reads = SeqIO.index("main.fastq", "fastq") 
spk_reads = SeqIO.index("over.fastq", "fastq") 

# note that ref_reads.keys() doesn't return a list but a 'dictionary-  keyiterator', 
# so we turn it into a set to work with it 
ref_keys = set(ref_reads.keys()) 
spk_keys = set(spk_reads.keys()) 

# loop to remove overlap reads 
for spk in spk_keys: 
    if spk in ref_keys: 
     del ref_keys[spk] 

# output data 
output_handle = open(fname_out, "w") 
SeqIO.write(ref_reads[ref_keys], output_handle, "fastq") 
output_handle.close()

来源

2017-09-13 wa3j

SeqIO.index（）不返回一个真正的字典，但a dictionary like object, giving the SeqRecord objects as values：

Note that this pseudo dictionary will not support all the methods of a true Python dictionary, for example values() is not defined since this would require loading all of the records into memory at once.

这本词典就像对象是_IndexedSeqFileDict实例。文档字符串提到：

Note that this dictionary is essentially read only. You cannot add or change values, pop values, nor clear the dictionary.

所以，你需要使用SeqIO.parse()和SeqIO.to_dict()您的fastq文件转换为一个内存中的Python字典：

from Bio import SeqIO 

ref_reads = SeqIO.parse("file1.fastq", "fastq") 
spk_reads = SeqIO.parse("file1.fastq", "fastq") 

ref_reads_dict = SeqIO.to_dict(ref_reads) 

for spk in spk_reads: 
    if spk.id in ref_reads_dict: 
     del ref_reads_dict[spk.id]

如果你的文件是如此之大，与SeqIO.parse()工作是不可行的，那么我会做这样的事情：

from Bio import SeqIO 

ref_reads = SeqIO.index("file1.fastq", "fastq") 
spk_reads = SeqIO.index("file2.fastq", "fastq") 

# note that ref_reads.keys() doesn't return a list but a 'dictionary-keyiterator', 
# so we turn it into a set to work with it 
ref_keys = set(ref_reads.keys()) 
spk_keys = set(spk_reads.keys()) 

unique_ref_keys = ref_keys - spk_keys 

# this step might take a long time if your files are large 
unique_ref_reads = {key: ref_reads[key] for key in unique_ref_keys}

编辑，回答您的评论：

how can I again solve the original problem of deleting items from SeqIO.index("file1.fastq", "fastq")?

就像我上文所述，SeqIO.index("file1.fastq", "fastq")返回一个只读_IndexedSeqFileDict对象。所以你不能，通过设计，从它删除项目。

下面更新的代码显示了如何创建一个新的fastq文件，其中重叠的读取被删除。

如果你真的想要一个新的SeqIO.index()对象，那么你可以用SeqIO.index()再次读这个文件。

from Bio import SeqIO 

ref_reads = SeqIO.index("file1.fastq", "fastq") 
spk_reads = SeqIO.index("file2.fastq", "fastq") 

ref_keys = set(ref_reads.keys()) 
spk_keys = set(spk_reads.keys()) 

unique_ref_keys = ref_keys - spk_keys 

# conserve memory by using a generator expression 
unique_ref_records = (ref_reads[key] for key in unique_ref_keys) 

# output new file with overlapping reads removed 
with open(fname_out, "w") as output_handle: 
    SeqIO.write(unique_ref_records , output_handle, "fastq") 

# optionally, create a new SeqIO.index() object 
unique_ref_reads = SeqIO.index(fname_out, "fastq")

来源

2017-09-13 12:20:37 BioGeek

请您提供有用的建议。第一种解决方案起作用，但与我试图改进的代码相比，速度很慢。您能否提供有关您的第二个代码块的其他信息？我试图从file1.fastq中删除也在file2.fastq中的读取。根据你的第二个解决方案，我该如何再次解决从SeqIO.index（“file1.fastq”，“fastq”）删除项目的原始问题？我更新了这个问题以反映我最近的尝试。 – wa3j

@ wa3j：看我上面的编辑。 – BioGeek

从SeqIO.index生成的字典中删除项目

回答

相关问题