解析ID从文本与Python

我有这样的文字：解析ID从文本与Python

>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]

从这段文字我想解析来后ID | GB |并将其写入列表中。

我尝试使用正则表达式，但一直未能成功完成。

来源

2013-02-13 Leo.peis

是文本的'>'字符的一部分？ – 2013-02-13 20:56:56

沿着\ | gb \ |（。*？\ |） – dutt 2013-02-13 20:59:23

正则表达式应该工作在|管

import re 
re.findall('gb\|([^\|]*)\|', 'gb|AB1234|')

来源

2013-02-13 20:59:40 Hoopdady

拆分，然后跳过一切，直到第gb;下一个元素是ID：

from itertools import dropwhile 

text = iter(text.split('|')) 
next(dropwhile(lambda s: s != 'gb', text)) 
id = next(text)

演示：

>>> text = '>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]' 
>>> text = iter(text.split('|')) 
>>> next(dropwhile(lambda s: s != 'gb', text)) 
'gb' 
>>> id = next(text) 
>>> id 
'EDL26483.1'

换句话说，没有必要为一个正则表达式。

制作成生成方法，这让所有的ID：

from itertools import dropwhile 

def extract_ids(text): 
    text = iter(text.split('|')) 
    while True: 
     next(dropwhile(lambda s: s != 'gb', text)) 
     yield next(text)

这给：

>>> text = '>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]' 
>>> list(extract_ids(text)) 
['EDL26483.1', 'AAI37799.1']

，或者你可以在一个简单的循环使用它：

for id in extract_ids(text): 
    print id

来源

2013-02-13 21:00:15

的一些东西看起来像一个简单的正则表达式可以工作的很多工作。 – Hoopdady 2013-02-13 21:06:09

@Hoopdady：没有;我用了更多的文字来解释它是如何工作的，但是这个方法全部都是4行。这是另一种方法，除此之外，它可以很好地工作。 – 2013-02-13 21:07:26

但你说得对，这可能不值得投票 – Hoopdady 2013-02-13 21:07:34

In [1]: import re 

In [2]: text = ">gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]" 

In [3]: re.findall(r'gb\|([^\|]+)', text)[0] 
Out[3]: 'EDL26483.1'

来源

2013-02-13 21:01:38 brwnj

re.findall('gi\|([0-9]+)\|', u'''>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]''')适用于我： [u'124486857', u'341941060', u'148694536', u'223460980']

来源

2013-02-13 21:02:36 hd1

这是错误的信息; id是在'gb'键之后，而不是'gi'。 – 2013-02-13 21:05:54

在这种情况下，您可以得到没有正则表达式，只需拆分'| gb |'，然后将第二部分拆分为'|'并采取第一项：

s = 'the string from the question' 
r = s.split('|gb|') 
r.split('|')[0]

当然，你将不得不增加检查，如果有更多/小于2个项目，但我认为首先分开的返回列表会比使用正则表达式更快。

来源

2013-02-13 21:03:35

>>> import re 
>>> match_object = re.findall("\|gb\|(.*?)\|", ">gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]") 
>>> print match_object 
['EDL26483.1', 'AAI37799.1']

正则表达式的意思就是“匹配任何字符（。），多次（*），但尽可能少他们的（？），并只保存该组（括号），他们必须立即跟从'| GB |'并紧挨着另一个“|”。“

我用“\ |”因为“|”字符表示正则表达式中的替代匹配。

来源

2013-02-13 21:04:50 rkday

假设a是保存您的字符串变量...

>>> import re 
>>> a = ">gi|124486857|ref|NP_001074751.1| ..." 
>>> re.findall(r"(?:\|gb\|)([a-zA-Z0-9.]+)(?:\|)", a) 
['EDL26483.1', 'AAI37799.1']

来源

2013-02-13 21:10:27 obimod

解析ID从文本与Python

回答

相关问题