如何文本挖掘特定数据

我有一个ID与长分隔的描述以分号分隔的列表。以下是一个ID及其描述的示例。如何文本挖掘特定数据

ID  Description 
O95831 activation of cysteine-type endopeptidase activity involved in apoptotic process; apoptotic DNA fragmentation; apoptotic process; cell redox homeostasis; chromosome condensation; DNA catabolic process; intrinsic apoptotic signaling pathway in response to endoplasmic reticulum stress; mitochondrial respiratory chain complex I assembly; NAD(P)H oxidase activity; neuron apoptotic process; neuron differentiation; oxidoreductase activity, acting on NAD(P)H; positive regulation of apoptotic process; regulation of apoptotic DNA fragmentation

问题：想出一个办法，以文本挖掘其中表达“线粒体”或“线粒体”或“线粒体”中提到的描述。将regex用于解决这个问题吗？或者其他可能有用的方法？

预期结果：提取其中的那句 “线粒体” 中提到

O95831 ;mitochondrial respiratory chain complex I assembly;

您的帮助表示赞赏的描述，

来源

2014-11-24 MEhsan

我熟悉Python/Perl的 – MEhsan 2014-11-24 16:58:33

Ruby的正则表达式完美选择描述。如果在分号后没有提及“线粒体”或“线粒体”或“线粒体”的描述，该怎么办？“;让我们以此为例： 'P55957 \t;凋亡线粒体变化;凋亡过程;大脑发育; ' – MEhsan 2014-11-24 17:15:44

我使用了这样一个正则表达式，因为您的预期输出被一个';'所代替。如果它可能发生在任何地方，那么只要从正则表达式中移除';'就可以在那里匹配。 '（\ d +）*（\ S（？：线粒体|线粒体|线粒体）[^] +;）' – nu11p01n73R 2014-11-24 17:17:45

您可以使用正则表达式像

(\d+).*(.\s(?:mitochondria|mitochondrial|mitochondrion)[^;]+;)

捕获组1和2将包含

O95831 ;mitochondrial respiratory chain complex I assembly;

实施例：http://regex101.com/r/mR8xA7/1

Python代码会像

>>> re.findall(r"""(\d+).*(.\s(?:mitochondria|mitochondrial|mitochondrion)[^;]+;)""", str) 
[('095831', '; mitochondrial respiratory chain complex I assembly;')]

来源

2014-11-24 16:54:42 nu11p01n73R

如何文本挖掘特定数据

回答

相关问题