2017-08-17 56 views
2

我无法弄清楚如何用pyparsing组合文本中的零个或多个重复部分。换句话说,我想将多个匹配合并到一个命名的结果集中。请注意,我想使用pyparsing,因为我有很多不同规则的部分。使用Pyparsing编组多个部分(匹配)

from pyparsing import *  

input_text = """ 
Projects 
project a created in c# 

Education 
university of college 

Projects 
project b created in python 
""" 

project_marker = LineStart() + Literal('Projects') + LineEnd() 
education_marker = LineStart() + Literal('Education') + LineEnd() 
markers = project_marker^education_marker 

project_section = Group(
    project_marker + SkipTo(markers | stringEnd).setResultsName('project') 
).setResultsName('projects') 
education_section = Group(
    education_marker + SkipTo(markers | stringEnd).setResultsName('education') 
).setResultsName('educations') 
sections = project_section^education_section 

text = StringStart() + SkipTo(sections | StringEnd()) 
doc = Optional(text) + ZeroOrMore(sections) 
result = doc.parseString(input_text) 

print(result) 
# ['', ['Projects', '\n', 'project a created in c#'], ['Education', '\n', 'virginia tech'], ['Projects', '\n', 'project b created in python']] 
print(result.projects) 
# ['Projects', '\n', 'project b created in python'] 
print(result.projects[0].project) 
# AttributeError: 'str' object has no attribute 'project' 
+0

'项目'部分是否可以包含多条描述项目的行?对于“教育”部分也是如此? –

+1

在调用project_section和education_section上的'setResultsName'时,添加'listAllMatches = True'。然后我认为你的代码将按原样运行。 – PaulMcG

+0

@PaulMcG添加'listAllMatches = True'将结果分组在一起,谢谢! –

回答

0

由于@PaulMcG的解决方案是增加listAllMatches=TruesetResultsName,看https://pythonhosted.org/pyparsing/pyparsing.ParserElement-class.html#setResultsName

project_marker = LineStart() + Literal('Projects') + LineEnd() 
education_marker = LineStart() + Literal('Education') + LineEnd() 
markers = project_marker^education_marker 

project_section = Group(
    project_marker + SkipTo(markers | stringEnd).setResultsName('project') 
).setResultsName('projects', listAllMatches=True) 
education_section = Group(
    education_marker + SkipTo(markers | stringEnd).setResultsName('education') 
).setResultsName('educations', listAllMatches=True) 
sections = project_section^education_section 

text = StringStart() + SkipTo(sections | StringEnd()) 
doc = Optional(text) + ZeroOrMore(sections) 
result = doc.parseString(input_text) 
2

这是我的试探性答案,不是我为它感到自豪。我从https://stackoverflow.com/a/5824309/131187中分一大堆。

>>> import pyparsing as pp 
>>> pp.ParserElement.setDefaultWhitespaceChars(" \t") 
>>> EOL = pp.LineEnd().suppress() 
>>> keyword = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')]) 
>>> line = pp.LineStart() + pp.NotAny(keyword) + pp.SkipTo(pp.LineEnd(), failOn=pp.LineStart()+pp.LineEnd()) + EOL 
>>> lines = pp.OneOrMore(line) 
>>> section = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')])('section') + EOL + lines('lines') 
>>> sections = pp.OneOrMore(section) 
>>> r = sections.parseString(input_text) 

正如你可以看到下方的这句话,解析器成功地正确收集信息,并在这样一种方式,它可以被组装,这将在目前所示的收集它。但是,我无法找到访问parseString中清晰可用的所有结果的方法。

我采用了eval来代替repr。做完这些之后,我能够挑选出所有的部分,并将它们分配给一个类似于字典的对象。

说实话,没有pyparsing就可以做到这一点。阅读一行,注意它是否是关键字。如果是,请记住它。然后,直到您阅读另一个关键字,只需将您阅读的所有行放在最近的关键字下的字典中即可。

>>> repr(r) 
"(['Projects', 'project 1', 'project 2', 'project 3', '', 'Education', 'institution 1', 'institution 2', 'institution 3', 'institution 4', '', 'Projects', 'assignment 5', 'assignment 8', 'assignment 10', ''], {'lines': [(['project 1', 'project 2', 'project 3', ''], {}), (['institution 1', 'institution 2', 'institution 3', 'institution 4', ''], {}), (['assignment 5', 'assignment 8', 'assignment 10', ''], {})], 'section': ['Projects', 'Education', 'Projects']})" 
>>> evil_r = eval(repr(r)) 
>>> evil_r 
(['Projects', 'project 1', 'project 2', 'project 3', '', 'Education', 'institution 1', 'institution 2', 'institution 3', 'institution 4', '', 'Projects', 'assignment 5', 'assignment 8', 'assignment 10', ''], {'lines': [(['project 1', 'project 2', 'project 3', ''], {}), (['institution 1', 'institution 2', 'institution 3', 'institution 4', ''], {}), (['assignment 5', 'assignment 8', 'assignment 10', ''], {})], 'section': ['Projects', 'Education', 'Projects']}) 
>>> evil_r[1]['lines'] 
[(['project 1', 'project 2', 'project 3', ''], {}), (['institution 1', 'institution 2', 'institution 3', 'institution 4', ''], {}), (['assignment 5', 'assignment 8', 'assignment 10', ''], {})] 
>>> evil_r[1]['section'] 
['Projects', 'Education', 'Projects'] 
>>> from collections import defaultdict 
>>> section_info = defaultdict(list) 
>>> for k, kind in enumerate(evil_r[1]['section']): 
...  section_info[kind].extend(evil_r[1]['lines'][k][0][:-1]) 
>>> for section in section_info: 
...  section, section_info[section] 
...  
('Education', ['institution 1', 'institution 2', 'institution 3', 'institution 4']) 
('Projects', ['project 1', 'project 2', 'project 3', 'assignment 5', 'assignment 8', 'assignment 10']) 

编辑:或者你可以做到这一点。需要整理。至少它不使用任何非正统的东西。

>>> input_text = open('temp.txt').read() 
>>> import pyparsing as pp 
>>> pp.ParserElement.setDefaultWhitespaceChars(" \t") 
>>> from collections import defaultdict 
>>> class Accum: 
...  def __init__(self): 
...   self.current_section = None 
...   self.result = defaultdict(list) 
...  def __call__(self, s): 
...   if s[0] in ['Projects', 'Education']: 
...    self.current_section = s[0] 
...   else: 
...    self.result[self.current_section].extend(s[:-1]) 
... 
>>> accum = Accum() 
>>> EOL = pp.LineEnd().suppress() 
>>> keyword = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')]) 
>>> line = pp.LineStart() + pp.NotAny(keyword) + pp.SkipTo(pp.LineEnd(), failOn=pp.LineStart()+pp.LineEnd()) + EOL 
>>> lines = pp.OneOrMore(line) 
>>> section = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')]).setParseAction(accum) + EOL + lines.setParseAction(accum) 
>>> sections = pp.OneOrMore(section) 
>>> r = sections.parseString(input_text) 
>>> accum.result['Education'] 
['institution 1', 'institution 2', 'institution 3', 'institution 4'] 
>>> accum.result['Projects'] 
['project 1', 'project 2', 'project 3', 'assignment 5', 'assignment 8', 'assignment 10'] 
+0

增加了另一种方法。 –

+0

请使用'|'或'^'运算符组合替代表达式,这会清理语法一点:'keyword = pp.Keyword('Projects')| pp.Keyword( '教育')'。 Pyparsing包含'restOfLine'作为读取所有内容到下一个新行的简单方法,所以行简化为:'line = pp.LineStart()+ pp.NotAny(keyword)+ pp.restOfLine + EOL'。您需要的最后一件事是使用Group类对每个关键字及其相关行进行分组:'section = pp.Group(keyword('section')+ EOL + lines('lines'))'。 – PaulMcG

+0

最后,使用print(r.dump())打印结果。这显示了如何访问每个已分析的部分,例如'for section_data in r:print section_data.section'。不需要repr或eval。在https://pythonhosted.org/pyparsing/pyparsing.ParseResults-class.html上阅读更多关于pyparsing的'ParseResults'类。 – PaulMcG