正则表达式搜索使用列表元素在大文档中查找匹配

这是我的第一个脚本，我试图比较两个基因组文件，其中一个比另一个具有更多的数据点。正则表达式搜索使用列表元素在大文档中查找匹配

文件的内容是这样的：

rs3094315  1  742429 AA 
rs12562034  1  758311 GG 
rs3934834  1  995669 CC

有每个字段之间的制表符。每个文件中大约有500,000行。

为了方便比较，我只想保留两个文件都包含的数据点，并放弃其中任何一个都独有的任何数据点。为此，我创建了所有独特的DNA位置列表，现在我试图搜索原始数据文件的每一行，并将不包含这些独特DNA位置的所有行打印到新文件中。

我的代码中的所有东西都已经运行起来，直到我尝试使用正则表达式来搜索基因组文件以打印所有非独特的DNA位置。我能拿到剧本打印for循环内的LaurelSNP_left列表中的所有项目，但是当我尝试使用re.match每个项目，我收到此错误信息：

Traceback (most recent call last): 
    File "/Users/laurelhochstetler/scripts/identify_SNPs.py", line 57, in <module> 
    if re.match(item,"(.*)", Line): 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 137, in match 
    return _compile(pattern, flags).match(string) 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 242, in _compile 
    p = sre_compile.compile(pattern, flags) 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sre_compile.py", line 500, in compile 
    p = sre_parse.parse(p, flags) 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sre_parse.py", line 673, in parse 
    p = _parse_sub(source, pattern, 0) 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sre_parse.py", line 308, in _parse_sub 
    itemsappend(_parse(source, state)) 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sre_parse.py", line 401, in _parse 
    if state.flags & SRE_FLAG_VERBOSE: 
TypeError: unsupported operand type(s) for &: 'str' and 'int'

我的问题是双重的：

如何在正则表达式中使用我的列表？
有没有更好的方法来完成我在这里要做的事情？

这里是我的代码：

#!/usr/bin/env python 
import re #this imports regular expression module 
import collections 

MomGenome=open('/Users/laurelhochstetler/Documents/genetics fun/genome_Mary_Maloney_Full_20110514145353.txt', 'r') 
LaurelGenome=open('/Users/laurelhochstetler/Documents/genetics fun/genome_Laurel_Hochstetler_Full_20100411230740.txt', 'r') 
LineNumber = 0 
momSNP = [] 
LaurelSNP = [] 
f = open("mom_edit.txt","w") 
for Line in MomGenome: 
    if LineNumber > 0: 
     Line=Line.strip('\n') 
     ElementList=Line.split('\t') 

     momSNP.append(ElementList[0]) 

     LineNumber = LineNumber + 1 
MomGenome.close() 
for Line in LaurelGenome: 
    if LineNumber > 0: 
     Line=Line.strip('\n') 
     ElementList=Line.split('\t') 

     LaurelSNP.append(ElementList[0]) 

     LineNumber = LineNumber + 1 
momSNP_multiset = collections.Counter(momSNP)    
LaurelSNP_multiset = collections.Counter(LaurelSNP) 
overlap = list((momSNP_multiset and LaurelSNP_multiset).elements()) 
momSNP_left = list((momSNP_multiset - LaurelSNP_multiset).elements()) 
LaurelSNP_left = list((LaurelSNP_multiset - momSNP_multiset).elements()) 
LaurelGenome=open('/Users/laurelhochstetler/Documents/genetics fun/genome_Laurel_Hochstetler_Full_20100411230740.txt', 'r') 
i = 0 
for Line in LaurelGenome: 
    for item in LaurelSNP_left: 
      if i < 1961: 
       if re.match(item, Line): 
        pass 

       else: 
        print Line 

      i = i + 1 
    LineNumber = LineNumber + 1

来源

2012-02-18 blizpix

你能提供一个输入文件的例子吗？ – Thomas 2012-02-18 22:26:26

酷，看到人们使用他们的23和我的原始数据！ – Stedy 2012-02-18 22:27:41

好的，在文件中添加了一些行的例子！是的，Stedy，我为了家谱目的而疯狂地使用等位基因数据！ – blizpix 2012-02-18 22:41:12

要打印的文件，从2的每一行，其ID不会出现在文件1.一组文件1的ID，并通过文件2把它们作为你的循环：

momSNP = set() 
for line in MomGenome: 
    snp, rest = line.split(None, 1) # Split into two pieces only 
    momSNP.add(snp) 

for line in MyGenome: 
    snp, rest = line.split(None, 1) 
    if snp in momSNP: 
     print line

这只需要存储500k个SNP，因此记忆方面不应该太多。

来源

2012-02-20 13:06:27 alexis

这是最棒的。我没有意识到我的代码是多么的低效，但是它会一直持续下去。这正是我想要它在很短的时间内完成的，非常感谢！ – blizpix 2012-02-27 21:17:52

re.match的第三个参数是对于选择（请参阅手册）。你打电话给一些虚假（行号）

来源

2012-02-18 22:30:31 alexis

好吧，我实际输入了错误信息，我已将其更正为原来的内容：if re.match（item，Line）： – blizpix 2012-02-18 22:41:52

然后您仍然无法获取错误消息。这是否解决了你的问题？ – alexis 2012-02-18 22:48:55

你是对的，它不再给我那个错误了！但是，我相信我的嵌套循环有一些不妥之处，因为一旦它开始运行，就会一遍又一遍地打印相同的行。 – blizpix 2012-02-18 23:04:00

简短回答：我不认为你需要正则表达式来处理你正在做的事情。

长答案：让我们来分析一下你的代码。

在开始的时候有：

LineNumber = 0 
MomGenome = open('20110514145353.txt', 'r') 
for Line in MomGenome: 
    if LineNumber > 0: 
     Line = Line.strip('\n') 
     ElementList = Line.split('\t') 

     momSNP.append(ElementList[0]) 

     LineNumber = LineNumber + 1 

MomGenome.close()

，它可以与with说法得到改善，我想你的行计数只是没有跳过某些类型的头，我将使用next()为：

with open('20110514145353.txt') as mom_genome: 
    next(mom_genome) # skipping the first line 
    for line in mom_genome: 
     elements = line.strip().split('\t') 
     mom_SNP.append(elements[0])

如果你注意到我也试图避免使用CamelCase名变量，这样你会遵循一些风格指南。我也将.strip('\n')更改为.strip()检查官方str.strip()，看看是否仍然做你想要的。
以上内容也可以在您的其他文件上完成。

在阅读这些文件有这样一行：

overlap = list((momSNP_multiset and LaurelSNP_multiset).elements())

你确定这是否你想要什么？
难道不应该and是&，如：

overlap = list((momSNP_multiset & LaurelSNP_multiset).elements())

让我们来看看下面这个例子：

>>> from collections import Counter 
>>> a = Counter(a=4, b=2, c=0, d=-2) 
>>> b = Counter(a=2, b=0, c=0) 
>>> a 
Counter({'a': 4, 'b': 2, 'c': 0, 'd': -2}) 
>>> b 
Counter({'a': 2, 'c': 0, 'b': 0}) 
>>> a and b # This will return b 
Counter({'a': 2, 'c': 0, 'b': 0}) 
>>> c & d # this will return the common elements 
Counter({'a': 2})

a and b它将返回b因为bool(a)被evaulated到True，看一看的official doc。

之后它来匹配，这真的不清楚。你这样做：

LaurelGenome = open('20100411230740.txt', 'r') 
i = 0 
for Line in LaurelGenome: 
    for item in LaurelSNP_left: 
     if i < 1961: 
      if re.match(item, Line): 
       pass 
      else: 
       print Line 

     i = i + 1 
    LineNumber = LineNumber + 1

所以正如我在开始时所说，我认为你根本不需要regexp。
我想你想这样做：

with open('20100411230740.txt') as laural_genome: 
    for line in laureal_genome: 
     i = 0 
     for item in laurelSNP_left: 
      if i > 1960: 
       break 

      if line.strip().split('\t')[0] == item: 
       print line 

      i += 1

我已经猜到这个答案期间很多，所以随时给更多的信息，并告诉我在哪里，我猜错了:)

来源

2012-02-18 23:16:09

这很漂亮。我还没有尝试别人发布的其他建议，但是这个工作很有魅力。我真的很感谢你为我简单地写出了一些东西，作为一名初学者，这是一个真正的帮助。非常感谢！ – blizpix 2012-02-27 20:25:23

我是否错过了关于这个问题的批评？我觉得这个解决方案比现有解决方案涉及的要少得多。对于两种不同的基因组中的文件，我有：

with file('zzz.txt') as f1: 
    first = frozenset([i.strip() for i in f1 if i.strip()]) 

with file('yyy.txt') as f2: 
    common = [i.strip().split('\t') for i in f2 if i.strip() in first] 

genomes = {} 
for i in common: 
    genomes[i[0]] = i[1:]

这应该打印出所有副本（条目两者共同文件），而无需比第一读取文件的大小更多的空间。因此，您可以通过事先检查哪个文件较小（可能是文件大小）来加快速度，以最大限度地减少内存影响。

正则表达式在这里似乎没有必要 - 如果不是这个解决方案，frozensets也有交集，如果你不喜欢使用列表解析。

编辑：已更新每个迭代在Python字典中。

来源

2012-02-18 23:59:06 hexparrot

作为旁注，四个.strip（）调用仅在这里，因为示例内容似乎包含空行，但如果这些新行是复制/粘贴工件，则可以删除这些过程中的每一个。 – hexparrot 2012-02-19 00:09:57

您需要在选项卡和索引上按照第一个字段进行拆分，请参阅他对SNP的评论。 – alexis 2012-02-20 12:54:57

正则表达式搜索使用列表元素在大文档中查找匹配

回答

相关问题