2014-04-02 51 views
1

我有一个BLAST表格输出,有数百万个匹配.Con是我的序列,P是蛋白质命中。我有兴趣区分与以下3种情况对应的点击。它们应该全部打印在3个独立的新文件中,并且在文件1中的重叠群不应该在文件2,3等中。如何做到这一点?查找重叠区域和非重叠区域

con1 ----------------------- (Contigs with both overlapping and non overlapping hits) 
    p1---- p2 ------ p4--- 
       p3----- 
con2 --------------------- (only overlapping) con3 ----------------(only non overlp) 
    p1 -----          p1 ---- p2 ----- 
     p2 ------- 
      p3 ----- 

如果我知道蛋白质开始和结束位点,就有可能确定重叠或非重叠;如果S1 < E2 < S2和E1 < S2 < E2或S2 - E1> 0 我输入文件,即

contig protein start end 
con1 P1 481 931 
con1 P2 140 602 
con1 P3 232 548 
con2 P4 335 406 
con2 P5 642 732 
con2 P6 2282 2433 
con2 P7 729 812 
con3 P8 17 148 
con3 P9 289 45 

我的脚本(这我打印仅命中不重叠)

from itertools import groupby 

def nonoverlapping(hits): 
    """Returns a list of non-overlapping hits.""" 
    nonover = [] 
    overst = False 
    for i in range(1,len(hits)): 
     (p, c) = hits[i-1], hits[i] 
     if c[2] > p[3]: 
      if not overst: nonover.append(p) 
      nonover.append(c) 
      overst = True 
    return nonover 

fh = open('file.txt') 
oh = open('result.txt', 'w') 
for qid, grp in groupby(fh, lambda l: l.split()[0]): 
    hits = [] 
    for line in grp: 
     hsp = line.split() 
     hsp[2], hsp[3] = int(hsp[2]), int(hsp[3]) 
     hits.append(hsp) 
    if len(hits) > 1: 
     hits.sort(key=lambda x: x[2]) 
     for hit in nonoverlapping(hits): 
      oh.write('\t'.join([str(f) for f in hit])+'\n') 

回答

2

我会做这样的事情。为两个命中定义一个“重叠”函数,然后测试每个重叠群是全部,部分还是不重叠。然后写所有的重叠群到所需的文件:

from itertools import groupby 

def overlaps(a, b): 
    result = True 
    # Supposing a[2] is the start, a[3] the end. 
    # If end before start, they are not overlapping 
    if a[3] < b[2] or b[3] < a[2]: 
     result = False 

    return result 

def test_overlapping(hits): 
    overlapping = 'None' 
    overlapping_count = 0 
    for i in range(len(hits)-1): 
     if overlaps(hits[i], hits[i+1]): 
      overlapping_count += 1 


    if overlapping_count == 0: 
     overlapping = 'None' 
    elif overlapping_count == len(hits) -1: 
     overlapping = 'All' 
    else: 
     overlapping = 'Some' 

    return overlapping 


fh = open('file.txt') 

file_all = open('result_all.txt', 'w') 
file_some = open('result_some.txt', 'w') 
file_none = open('result_none.txt', 'w') 

line = fh.readline() # quit header 

for qid, grp in groupby(fh, lambda l: l.split()[0]): 
    hits = [] 
    for line in grp: 
     hsp = line.split() 
     hsp[2], hsp[3] = int(hsp[2]), int(hsp[3]) 
     hits.append(hsp) 
    if len(hits) > 1: 
     hits.sort(key=lambda x: x[2]) 
     overlapping = test_overlapping(hits) 
     out_file = file_none 
     if overlapping == 'All': 
      out_file = file_all 
     elif overlapping == 'Some': 
      out_file = file_some 

     for h in hits: 
      out_file.write('\t'.join([str(v) for v in h])) 
      out_file.write('\n') 


file_all.close() 
file_some.close() 
file_none.close() 
+0

谢谢你,我把你的算法的路线,但不反刍运行script..what如果我只补充,如果C [2] <= P [3]或c [2]> p [3]:行? – user3224522

+0

我测试了整个事情,现在就把它。这个对我有用。问题是,对于每个重叠群,你需要看它是否有重叠的读取,看看你想写在哪个文件中,对吗?这就是我所理解的。让我知道如果这可以解决问题:) – cnluzon

+1

你知道,它似乎工作,我只需要仔细检查,因为我的数据是如此巨大,非常感谢你! – user3224522