1
我有一个BLAST表格输出,有数百万个匹配.Con是我的序列,P是蛋白质命中。我有兴趣区分与以下3种情况对应的点击。它们应该全部打印在3个独立的新文件中,并且在文件1中的重叠群不应该在文件2,3等中。如何做到这一点?查找重叠区域和非重叠区域
con1 ----------------------- (Contigs with both overlapping and non overlapping hits)
p1---- p2 ------ p4---
p3-----
con2 --------------------- (only overlapping) con3 ----------------(only non overlp)
p1 ----- p1 ---- p2 -----
p2 -------
p3 -----
如果我知道蛋白质开始和结束位点,就有可能确定重叠或非重叠;如果S1 < E2 < S2和E1 < S2 < E2或S2 - E1> 0 我输入文件,即
contig protein start end
con1 P1 481 931
con1 P2 140 602
con1 P3 232 548
con2 P4 335 406
con2 P5 642 732
con2 P6 2282 2433
con2 P7 729 812
con3 P8 17 148
con3 P9 289 45
我的脚本(这我打印仅命中不重叠)
from itertools import groupby
def nonoverlapping(hits):
"""Returns a list of non-overlapping hits."""
nonover = []
overst = False
for i in range(1,len(hits)):
(p, c) = hits[i-1], hits[i]
if c[2] > p[3]:
if not overst: nonover.append(p)
nonover.append(c)
overst = True
return nonover
fh = open('file.txt')
oh = open('result.txt', 'w')
for qid, grp in groupby(fh, lambda l: l.split()[0]):
hits = []
for line in grp:
hsp = line.split()
hsp[2], hsp[3] = int(hsp[2]), int(hsp[3])
hits.append(hsp)
if len(hits) > 1:
hits.sort(key=lambda x: x[2])
for hit in nonoverlapping(hits):
oh.write('\t'.join([str(f) for f in hit])+'\n')
谢谢你,我把你的算法的路线,但不反刍运行script..what如果我只补充,如果C [2] <= P [3]或c [2]> p [3]:行? – user3224522
我测试了整个事情,现在就把它。这个对我有用。问题是,对于每个重叠群,你需要看它是否有重叠的读取,看看你想写在哪个文件中,对吗?这就是我所理解的。让我知道如果这可以解决问题:) – cnluzon
你知道,它似乎工作,我只需要仔细检查,因为我的数据是如此巨大,非常感谢你! – user3224522