使用Pyparsing编组多个部分（匹配）

我无法弄清楚如何用pyparsing组合文本中的零个或多个重复部分。换句话说，我想将多个匹配合并到一个命名的结果集中。请注意，我想使用pyparsing，因为我有很多不同规则的部分。使用Pyparsing编组多个部分（匹配）

from pyparsing import *  

input_text = """ 
Projects 
project a created in c# 

Education 
university of college 

Projects 
project b created in python 
""" 

project_marker = LineStart() + Literal('Projects') + LineEnd() 
education_marker = LineStart() + Literal('Education') + LineEnd() 
markers = project_marker^education_marker 

project_section = Group(
    project_marker + SkipTo(markers | stringEnd).setResultsName('project') 
).setResultsName('projects') 
education_section = Group(
    education_marker + SkipTo(markers | stringEnd).setResultsName('education') 
).setResultsName('educations') 
sections = project_section^education_section 

text = StringStart() + SkipTo(sections | StringEnd()) 
doc = Optional(text) + ZeroOrMore(sections) 
result = doc.parseString(input_text) 

print(result) 
# ['', ['Projects', '\n', 'project a created in c#'], ['Education', '\n', 'virginia tech'], ['Projects', '\n', 'project b created in python']] 
print(result.projects) 
# ['Projects', '\n', 'project b created in python'] 
print(result.projects[0].project) 
# AttributeError: 'str' object has no attribute 'project'

来源

2017-08-17 scott.werner.vt

'项目'部分是否可以包含多条描述项目的行？对于“教育”部分也是如此？ –

在调用project_section和education_section上的'setResultsName'时，添加'listAllMatches = True'。然后我认为你的代码将按原样运行。 – PaulMcG

@PaulMcG添加'listAllMatches = True'将结果分组在一起，谢谢！ –

由于@PaulMcG的解决方案是增加listAllMatches=True到setResultsName，看https://pythonhosted.org/pyparsing/pyparsing.ParserElement-class.html#setResultsName。

project_marker = LineStart() + Literal('Projects') + LineEnd() 
education_marker = LineStart() + Literal('Education') + LineEnd() 
markers = project_marker^education_marker 

project_section = Group(
    project_marker + SkipTo(markers | stringEnd).setResultsName('project') 
).setResultsName('projects', listAllMatches=True) 
education_section = Group(
    education_marker + SkipTo(markers | stringEnd).setResultsName('education') 
).setResultsName('educations', listAllMatches=True) 
sections = project_section^education_section 

text = StringStart() + SkipTo(sections | StringEnd()) 
doc = Optional(text) + ZeroOrMore(sections) 
result = doc.parseString(input_text)

来源

2017-08-18 17:21:29

这是我的试探性答案，不是我为它感到自豪。我从https://stackoverflow.com/a/5824309/131187中分一大堆。

>>> import pyparsing as pp 
>>> pp.ParserElement.setDefaultWhitespaceChars(" \t") 
>>> EOL = pp.LineEnd().suppress() 
>>> keyword = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')]) 
>>> line = pp.LineStart() + pp.NotAny(keyword) + pp.SkipTo(pp.LineEnd(), failOn=pp.LineStart()+pp.LineEnd()) + EOL 
>>> lines = pp.OneOrMore(line) 
>>> section = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')])('section') + EOL + lines('lines') 
>>> sections = pp.OneOrMore(section) 
>>> r = sections.parseString(input_text)

正如你可以看到下方的这句话，解析器成功地正确收集信息，并在这样一种方式，它可以被组装，这将在目前所示的收集它。但是，我无法找到访问parseString中清晰可用的所有结果的方法。

我采用了eval来代替repr。做完这些之后，我能够挑选出所有的部分，并将它们分配给一个类似于字典的对象。

说实话，没有pyparsing就可以做到这一点。阅读一行，注意它是否是关键字。如果是，请记住它。然后，直到您阅读另一个关键字，只需将您阅读的所有行放在最近的关键字下的字典中即可。

>>> repr(r) 
"(['Projects', 'project 1', 'project 2', 'project 3', '', 'Education', 'institution 1', 'institution 2', 'institution 3', 'institution 4', '', 'Projects', 'assignment 5', 'assignment 8', 'assignment 10', ''], {'lines': [(['project 1', 'project 2', 'project 3', ''], {}), (['institution 1', 'institution 2', 'institution 3', 'institution 4', ''], {}), (['assignment 5', 'assignment 8', 'assignment 10', ''], {})], 'section': ['Projects', 'Education', 'Projects']})" 
>>> evil_r = eval(repr(r)) 
>>> evil_r 
(['Projects', 'project 1', 'project 2', 'project 3', '', 'Education', 'institution 1', 'institution 2', 'institution 3', 'institution 4', '', 'Projects', 'assignment 5', 'assignment 8', 'assignment 10', ''], {'lines': [(['project 1', 'project 2', 'project 3', ''], {}), (['institution 1', 'institution 2', 'institution 3', 'institution 4', ''], {}), (['assignment 5', 'assignment 8', 'assignment 10', ''], {})], 'section': ['Projects', 'Education', 'Projects']}) 
>>> evil_r[1]['lines'] 
[(['project 1', 'project 2', 'project 3', ''], {}), (['institution 1', 'institution 2', 'institution 3', 'institution 4', ''], {}), (['assignment 5', 'assignment 8', 'assignment 10', ''], {})] 
>>> evil_r[1]['section'] 
['Projects', 'Education', 'Projects'] 
>>> from collections import defaultdict 
>>> section_info = defaultdict(list) 
>>> for k, kind in enumerate(evil_r[1]['section']): 
...  section_info[kind].extend(evil_r[1]['lines'][k][0][:-1]) 
>>> for section in section_info: 
...  section, section_info[section] 
...  
('Education', ['institution 1', 'institution 2', 'institution 3', 'institution 4']) 
('Projects', ['project 1', 'project 2', 'project 3', 'assignment 5', 'assignment 8', 'assignment 10'])

编辑：或者你可以做到这一点。需要整理。至少它不使用任何非正统的东西。

>>> input_text = open('temp.txt').read() 
>>> import pyparsing as pp 
>>> pp.ParserElement.setDefaultWhitespaceChars(" \t") 
>>> from collections import defaultdict 
>>> class Accum: 
...  def __init__(self): 
...   self.current_section = None 
...   self.result = defaultdict(list) 
...  def __call__(self, s): 
...   if s[0] in ['Projects', 'Education']: 
...    self.current_section = s[0] 
...   else: 
...    self.result[self.current_section].extend(s[:-1]) 
... 
>>> accum = Accum() 
>>> EOL = pp.LineEnd().suppress() 
>>> keyword = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')]) 
>>> line = pp.LineStart() + pp.NotAny(keyword) + pp.SkipTo(pp.LineEnd(), failOn=pp.LineStart()+pp.LineEnd()) + EOL 
>>> lines = pp.OneOrMore(line) 
>>> section = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')]).setParseAction(accum) + EOL + lines.setParseAction(accum) 
>>> sections = pp.OneOrMore(section) 
>>> r = sections.parseString(input_text) 
>>> accum.result['Education'] 
['institution 1', 'institution 2', 'institution 3', 'institution 4'] 
>>> accum.result['Projects'] 
['project 1', 'project 2', 'project 3', 'assignment 5', 'assignment 8', 'assignment 10']

来源

2017-08-17 20:21:10

增加了另一种方法。 –

请使用'|'或'^'运算符组合替代表达式，这会清理语法一点：'keyword = pp.Keyword（'Projects'）| pp.Keyword（ '教育'）'。 Pyparsing包含'restOfLine'作为读取所有内容到下一个新行的简单方法，所以行简化为：'line = pp.LineStart（）+ pp.NotAny（keyword）+ pp.restOfLine + EOL'。您需要的最后一件事是使用Group类对每个关键字及其相关行进行分组：'section = pp.Group（keyword（'section'）+ EOL + lines（'lines'））'。 – PaulMcG

最后，使用print（r.dump（））打印结果。这显示了如何访问每个已分析的部分，例如'for section_data in r：print section_data.section'。不需要repr或eval。在https://pythonhosted.org/pyparsing/pyparsing.ParseResults-class.html上阅读更多关于pyparsing的'ParseResults'类。 – PaulMcG

使用Pyparsing编组多个部分（匹配）

回答

相关问题