SeqIO.index()不返回一个真正的字典,但a dictionary like object, giving the SeqRecord objects as values:
Note that this pseudo dictionary will not support all the methods of a true Python dictionary, for example values() is not defined since this would require loading all of the records into memory at once.
这本词典就像对象是_IndexedSeqFileDict
实例。文档字符串提到:
Note that this dictionary is essentially read only. You cannot add or change values, pop values, nor clear the dictionary.
所以,你需要使用SeqIO.parse()
和SeqIO.to_dict()
您的fastq文件转换为一个内存中的Python字典:
from Bio import SeqIO
ref_reads = SeqIO.parse("file1.fastq", "fastq")
spk_reads = SeqIO.parse("file1.fastq", "fastq")
ref_reads_dict = SeqIO.to_dict(ref_reads)
for spk in spk_reads:
if spk.id in ref_reads_dict:
del ref_reads_dict[spk.id]
如果你的文件是如此之大,与SeqIO.parse()
工作是不可行的,那么我会做这样的事情:
from Bio import SeqIO
ref_reads = SeqIO.index("file1.fastq", "fastq")
spk_reads = SeqIO.index("file2.fastq", "fastq")
# note that ref_reads.keys() doesn't return a list but a 'dictionary-keyiterator',
# so we turn it into a set to work with it
ref_keys = set(ref_reads.keys())
spk_keys = set(spk_reads.keys())
unique_ref_keys = ref_keys - spk_keys
# this step might take a long time if your files are large
unique_ref_reads = {key: ref_reads[key] for key in unique_ref_keys}
编辑,回答您的评论:
how can I again solve the original problem of deleting items from SeqIO.index("file1.fastq", "fastq")?
就像我上文所述,SeqIO.index("file1.fastq", "fastq")
返回一个只读_IndexedSeqFileDict
对象。所以你不能,通过设计,从它删除项目。
下面更新的代码显示了如何创建一个新的fastq文件,其中重叠的读取被删除。
如果你真的想要一个新的SeqIO.index()
对象,那么你可以用SeqIO.index()
再次读这个文件。
from Bio import SeqIO
ref_reads = SeqIO.index("file1.fastq", "fastq")
spk_reads = SeqIO.index("file2.fastq", "fastq")
ref_keys = set(ref_reads.keys())
spk_keys = set(spk_reads.keys())
unique_ref_keys = ref_keys - spk_keys
# conserve memory by using a generator expression
unique_ref_records = (ref_reads[key] for key in unique_ref_keys)
# output new file with overlapping reads removed
with open(fname_out, "w") as output_handle:
SeqIO.write(unique_ref_records , output_handle, "fastq")
# optionally, create a new SeqIO.index() object
unique_ref_reads = SeqIO.index(fname_out, "fastq")
请您提供有用的建议。第一种解决方案起作用,但与我试图改进的代码相比,速度很慢。您能否提供有关您的第二个代码块的其他信息?我试图从file1.fastq中删除也在file2.fastq中的读取。根据你的第二个解决方案,我该如何再次解决从SeqIO.index(“file1.fastq”,“fastq”)删除项目的原始问题?我更新了这个问题以反映我最近的尝试。 – wa3j
@ wa3j:看我上面的编辑。 – BioGeek