这是我的试探性答案,不是我为它感到自豪。我从https://stackoverflow.com/a/5824309/131187中分一大堆。
>>> import pyparsing as pp
>>> pp.ParserElement.setDefaultWhitespaceChars(" \t")
>>> EOL = pp.LineEnd().suppress()
>>> keyword = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')])
>>> line = pp.LineStart() + pp.NotAny(keyword) + pp.SkipTo(pp.LineEnd(), failOn=pp.LineStart()+pp.LineEnd()) + EOL
>>> lines = pp.OneOrMore(line)
>>> section = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')])('section') + EOL + lines('lines')
>>> sections = pp.OneOrMore(section)
>>> r = sections.parseString(input_text)
正如你可以看到下方的这句话,解析器成功地正确收集信息,并在这样一种方式,它可以被组装,这将在目前所示的收集它。但是,我无法找到访问parseString
中清晰可用的所有结果的方法。
我采用了eval
来代替repr
。做完这些之后,我能够挑选出所有的部分,并将它们分配给一个类似于字典的对象。
说实话,没有pyparsing就可以做到这一点。阅读一行,注意它是否是关键字。如果是,请记住它。然后,直到您阅读另一个关键字,只需将您阅读的所有行放在最近的关键字下的字典中即可。
>>> repr(r)
"(['Projects', 'project 1', 'project 2', 'project 3', '', 'Education', 'institution 1', 'institution 2', 'institution 3', 'institution 4', '', 'Projects', 'assignment 5', 'assignment 8', 'assignment 10', ''], {'lines': [(['project 1', 'project 2', 'project 3', ''], {}), (['institution 1', 'institution 2', 'institution 3', 'institution 4', ''], {}), (['assignment 5', 'assignment 8', 'assignment 10', ''], {})], 'section': ['Projects', 'Education', 'Projects']})"
>>> evil_r = eval(repr(r))
>>> evil_r
(['Projects', 'project 1', 'project 2', 'project 3', '', 'Education', 'institution 1', 'institution 2', 'institution 3', 'institution 4', '', 'Projects', 'assignment 5', 'assignment 8', 'assignment 10', ''], {'lines': [(['project 1', 'project 2', 'project 3', ''], {}), (['institution 1', 'institution 2', 'institution 3', 'institution 4', ''], {}), (['assignment 5', 'assignment 8', 'assignment 10', ''], {})], 'section': ['Projects', 'Education', 'Projects']})
>>> evil_r[1]['lines']
[(['project 1', 'project 2', 'project 3', ''], {}), (['institution 1', 'institution 2', 'institution 3', 'institution 4', ''], {}), (['assignment 5', 'assignment 8', 'assignment 10', ''], {})]
>>> evil_r[1]['section']
['Projects', 'Education', 'Projects']
>>> from collections import defaultdict
>>> section_info = defaultdict(list)
>>> for k, kind in enumerate(evil_r[1]['section']):
... section_info[kind].extend(evil_r[1]['lines'][k][0][:-1])
>>> for section in section_info:
... section, section_info[section]
...
('Education', ['institution 1', 'institution 2', 'institution 3', 'institution 4'])
('Projects', ['project 1', 'project 2', 'project 3', 'assignment 5', 'assignment 8', 'assignment 10'])
编辑:或者你可以做到这一点。需要整理。至少它不使用任何非正统的东西。
>>> input_text = open('temp.txt').read()
>>> import pyparsing as pp
>>> pp.ParserElement.setDefaultWhitespaceChars(" \t")
>>> from collections import defaultdict
>>> class Accum:
... def __init__(self):
... self.current_section = None
... self.result = defaultdict(list)
... def __call__(self, s):
... if s[0] in ['Projects', 'Education']:
... self.current_section = s[0]
... else:
... self.result[self.current_section].extend(s[:-1])
...
>>> accum = Accum()
>>> EOL = pp.LineEnd().suppress()
>>> keyword = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')])
>>> line = pp.LineStart() + pp.NotAny(keyword) + pp.SkipTo(pp.LineEnd(), failOn=pp.LineStart()+pp.LineEnd()) + EOL
>>> lines = pp.OneOrMore(line)
>>> section = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')]).setParseAction(accum) + EOL + lines.setParseAction(accum)
>>> sections = pp.OneOrMore(section)
>>> r = sections.parseString(input_text)
>>> accum.result['Education']
['institution 1', 'institution 2', 'institution 3', 'institution 4']
>>> accum.result['Projects']
['project 1', 'project 2', 'project 3', 'assignment 5', 'assignment 8', 'assignment 10']
'项目'部分是否可以包含多条描述项目的行?对于“教育”部分也是如此? –
在调用project_section和education_section上的'setResultsName'时,添加'listAllMatches = True'。然后我认为你的代码将按原样运行。 – PaulMcG
@PaulMcG添加'listAllMatches = True'将结果分组在一起,谢谢! –