拆分,然后跳过一切,直到第gb
;下一个元素是ID:
from itertools import dropwhile
text = iter(text.split('|'))
next(dropwhile(lambda s: s != 'gb', text))
id = next(text)
演示:
>>> text = '>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]'
>>> text = iter(text.split('|'))
>>> next(dropwhile(lambda s: s != 'gb', text))
'gb'
>>> id = next(text)
>>> id
'EDL26483.1'
换句话说,没有必要为一个正则表达式。
制作成生成方法,这让所有的ID:
from itertools import dropwhile
def extract_ids(text):
text = iter(text.split('|'))
while True:
next(dropwhile(lambda s: s != 'gb', text))
yield next(text)
这给:
>>> text = '>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]'
>>> list(extract_ids(text))
['EDL26483.1', 'AAI37799.1']
,或者你可以在一个简单的循环使用它:
for id in extract_ids(text):
print id
是文本的'>'字符的一部分? – 2013-02-13 20:56:56
沿着\ | gb \ |(。*?\ |) –
dutt
2013-02-13 20:59:23