2010-03-04 71 views
2

我正在使用python,并且遇到一些重定义错误,我知道它们是重定义的,但从逻辑上讲,它不可能达到那个值。有没有办法解决这个问题?我感谢所有帮助提前正则表达式重定义错误

/python-2.5/lib/python2.5/re.py”,线路233,在_compile 引发错误,V#无效表达 sre_constants.error:组名的重新定义“ ID”作为组9;被组6


import re 

DOB_RE = "(^|;)DOB +(?P<dob>\d{2}-\d{2}-\d{4})" 
ID_RE = "(^|;)ID +(?P<id>[A-Z0-9]{12})" 
INFO_RE = "- (?P<info>.*)" 

PERSON_RE = "((" + DOB_RE + ".*" + ID_RE + ")|(" + \ 
        ID_RE + ".*" + DOB_RE + ")|(" + \ 
        DOB_RE + "|" + ID_RE + ")).*(" + INFO_RE + ")*" 

PARSER = re.compile(PERSON_RE) 

samplestr1 = garbage;DOB 10-10-2010;more garbage\nID PARI12345678;more garbage 
samplestr2 = garbage;ID PARI12345678;more garbage\nDOB 10-10-2010;more garbage 
samplestr3 = garbage;DOB 10-10-2010 
samplestr4 = garbage;ID PARI12345678;more garbage- I am cool 

回答

2

正则表达式语法根本不允许相同名字组的多次出现 - 基团不是‘到达’被定义为‘空’(无)

所以你必须改变这些名字,例如dob0,dob1, dob2id0,id1,id2(然后你可以很容易地“折叠”这些键组来制作你想要的字典,你有一个匹配的组字典后)。

例如,使DOB_RE的功能,而不是一个恒定的,说:

def DOB_RE(i): return "(^|;)DOB +(?P<dob%s>\d{2}-\d{2}-\d{4})" % i 

同样地,对于其他人,并改变这三个事件的DOB_RE在你计算PERSON_REDOB_RE(0)DOB_RE(1)等语句(和其他类似)。

1

也许在这种情况下,最好是遍历正则表达式列表。

>>> strs=[ 
... "garbage;DOB 10-10-2010;more garbage\nID PARI12345678;more garbage", 
... "garbage;ID PARI12345678;more garbage\nDOB 10-10-2010;more garbage", 
... "garbage;DOB 10-10-2010", 
... "garbage;ID PARI12345678;more garbage- I am cool"] 
>>> import re 
>>> 
>>> DOB_RE = "(^|;|\n)DOB +(?P<dob>\d{2}-\d{2}-\d{4})" 
>>> ID_RE = "(^|;|\n)ID +(?P<id>[A-Z0-9]{12})" 
>>> INFO_RE = "(- (?P<info>.*))?" 
>>> 
>>> REGEX = map(re.compile,[DOB_RE + ".*" + ID_RE + "[^-]*" + INFO_RE, 
...       ID_RE + ".*" + DOB_RE + "[^-]*" + INFO_RE, 
...       DOB_RE + "[^-]*" + INFO_RE, 
...       ID_RE + "[^-]*" + INFO_RE]) 
>>> 
>>> def get_person(s): 
...  for regex in REGEX: 
...   res = re.search(regex,s) 
...   if res: 
...    return res.groupdict() 
... 
>>> for s in strs: 
...  print get_person(s) 
... 
{'dob': '10-10-2010', 'info': None, 'id': 'PARI12345678'} 
{'dob': '10-10-2010', 'info': None, 'id': 'PARI12345678'} 
{'dob': '10-10-2010', 'info': None} 
{'info': 'I am cool', 'id': 'PARI12345678'} 
2

我最初打算使用的每个类(其中挑选出的表达,可以是任何顺序)张贴pyparsing例子,但后来我看到有混合垃圾,通过使用searchString您的字符串,以便搜索似乎更合适。这让我很感兴趣,因为searchString返回一个ParseResults序列,每个匹配一个(包括任何相应的命名结果)。所以我想,“如果我将使用总和返回的ParseResults结合起来怎么办 - 什么是破解!”,呃,“多么新奇!”所以这里有一个以前从未见过,pyparsing黑客:

from pyparsing import * 
# define the separate expressions to be matched, with results names 
dob_ref = "DOB" + Regex(r"\d{2}-\d{2}-\d{4}")("dob") 
id_ref = "ID" + Word(alphanums,exact=12)("id") 
info_ref = "-" + restOfLine("info") 

# create an overall expression 
person_data = dob_ref | id_ref | info_ref 

for test in (samplestr1,samplestr2,samplestr3,samplestr4,): 
    # retrieve a list of separate matches 
    separate_results = person_data.searchString(test) 

    # combine the results using sum 
    # (NO ONE HAS EVER DONE THIS BEFORE!) 
    person = sum(separate_results, ParseResults([])) 

    # now we have a uber-ParseResults object! 
    print person.id 
    print person.dump() 
    print 

给予这样的输出:

PARI12345678 
['DOB', '10-10-2010', 'ID', 'PARI12345678'] 
- dob: 10-10-2010 
- id: PARI12345678 

PARI12345678 
['ID', 'PARI12345678', 'DOB', '10-10-2010'] 
- dob: 10-10-2010 
- id: PARI12345678 


['DOB', '10-10-2010'] 
- dob: 10-10-2010 

PARI12345678 
['ID', 'PARI12345678', '-', ' I am cool'] 
- id: PARI12345678 
- info: I am cool 

但我也讲正则表达式。以下是使用re的类似方法。

import re 

# define each individual re, with group names 
dobRE = r"DOB +(?P<dob>\d{2}-\d{2}-\d{4})" 
idRE = r"ID +(?P<id>[A-Z0-9]{12})" 
infoRE = r"- (?P<info>.*)" 

# one re to rule them all 
person_dataRE = re.compile('|'.join([dobRE, idRE, infoRE])) 

# using findall with person_dataRE will return a 3-tuple, so let's create 
# a tuple-merger 
merge = lambda a,b : tuple(aa or bb for aa,bb in zip(a,b)) 

# let's create a Person class to collect the different data bits 
# (or if you are running Py2.6, use a namedtuple 
class Person: 
    def __init__(self,*args): 
     self.dob, self.id, self.info = args 
    def __str__(self): 
     return "- id: %s\n- dob: %s\n- info: %s" % (self.id, self.dob, self.info) 

for test in (samplestr1,samplestr2,samplestr3,samplestr4,): 
    # could have used reduce here, but let's err on the side of explicity 
    persontuple = ('','','') 
    for data in person_dataRE.findall(test): 
     persontuple = merge(persontuple,data) 

    # make a person 
    person = Person(*persontuple) 

    # print out the collected results 
    print person.id 
    print person 
    print 

有了这个输出:

PARI12345678 
- id: PARI12345678 
- dob: 10-10-2010 
- info: 

PARI12345678 
- id: PARI12345678 
- dob: 10-10-2010 
- info: 


- id: 
- dob: 10-10-2010 
- info: 

PARI12345678 
- id: PARI12345678 
- dob: 
- info: I am cool 
+0

@保罗:是还pyparsing对于Python 3? – 2010-03-04 11:08:38

+0

@Tim:是的,当前版本包含一个pyparsing_py3模块,如果您运行的是Python 3,将会安装这个模块(这是一个良性的安装错误,我将在下一个版本中修复这个错误)。 – PaulMcG 2010-03-04 13:04:29