我有这样制表符分隔的文件,分组和排序的文件在python
gene_name length
Traes_3AS_4F141FD24.2 24.8
Traes_4AL_A00EF17B2.1 0.0
Traes_4AL_A00EF17B2.1 0.9
Traes_4BS_6943FED4B.1 4.5
Traes_4BS_6943FED4B.1 42.9
UCW_Tt-k25_contig_29046 0.4
UCW_Tt-k25_contig_29046 2.8
UCW_Tt-k25_contig_29046 11.4
UCW_Tt-k25_contig_29046 12.3
UCW_Tt-k25_contig_29046 14.4
UCW_Tt-k25_contig_29046 14.2
UCW_Tt-k25_contig_29046 19.6
UCW_Tt-k25_contig_29046 19.6
UCW_Tt-k25_contig_29046 21.1
UCW_Tt-k25_contig_29046 23.7
UCW_Tt-k25_contig_29046 23.7
我需要组由gene_name,并且在3个文件分文件:1)如果gene_name是独特2)如果所述差异在组内的基因之间的长度是> 10 3)如果组内的长度中的差异是< 10. 这是我的尝试,
from itertools import groupby
def iter_hits(hits):
for i in range(1,len(hits)):
(p, c) = hits[i-1], hits[i]
yield p, c
def is_overlap(hits):
for p, c in iter_hits(hits):
if c[1] - p[1] > 10:
return True
fh = open('my_file','r')
oh1 = open('a', 'w')
oh2 = open('b', 'w')
oh3 = open('c', 'w')
for qid, grp in groupby(fh, lambda l: l.split()[0]):
hits = []
for line in grp:
hsp = line.split()
hsp[1]= float(hsp[1])
hits.append(hsp)
hits.sort(key=lambda x: x[1])
if len(hits)==1:
oh = oh3
elif is_overlap(hits):
oh = oh1
else:
oh = oh2
for hit in hits:
oh.write('\t'.join([str(f) for f in hit])+'\n')
我需要的输出是:
c)Traes_3AS_4F141FD24.2 24.8 b)Traes_4AL_A00EF17B2.1 0.0
Traes_4AL_A00EF17B2.1 0.9
a)Traes_4BS_6943FED4B.1 4.5
Traes_4BS_6943FED4B.1 42.9
UCW_Tt-k25_contig_29046 0.4
UCW_Tt-k25_contig_29046 2.8
UCW_Tt-k25_contig_29046 11.4
UCW_Tt-k25_contig_29046 12.3
UCW_Tt-k25_contig_29046 14.4
UCW_Tt-k25_contig_29046 14.2
UCW_Tt-k25_contig_29046 19.6
UCW_Tt-k25_contig_29046 19.6
UCW_Tt-k25_contig_29046 21.1
UCW_Tt-k25_contig_29046 23.7
UCW_Tt-k25_contig_29046 23.7
P.S.我很抱歉有这么长的一个问题,但否则我很难解释清楚。
你想说什么马上?你有什么错误吗? –
基因UCW_Tt-k25_contig_29046导致文件b,我想这是bcos我正在做一个从previou基因长度的减法,如何改进? – user3224522
如果有两个值大于10的值,你需要它们在'c'文件中结束吗? –