2017-03-06 33 views
3

我有一堆句子,我需要解析并转换为相应的正则表达式搜索代码。我的句子的例子 -使用Pyparsing访问已分析的元素

LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH Therefore we

- 这意味着该行,phrase onephrase2phrase3在什么地方。此外,该行必须以Therefore we

LINE_CONTAINS abc {upto 4 words} xyz {upto 3 words} pqr

- 这意味着开始我需要允许高达第2个短语之间的4个字和最后2个短语

保罗麦圭尔使用帮助之间 高达3个字(here),下面的语法被写 -

from pyparsing import (CaselessKeyword, Word, alphanums, nums, MatchFirst, quotedString, 
    infixNotation, Combine, opAssoc, Suppress, pyparsing_common, Group, OneOrMore, ZeroOrMore) 

LINE_CONTAINS, LINE_STARTSWITH = map(CaselessKeyword, 
    """LINE_CONTAINS LINE_STARTSWITH """.split()) 

NOT, AND, OR = map(CaselessKeyword, "NOT AND OR".split()) 
BEFORE, AFTER, JOIN = map(CaselessKeyword, "BEFORE AFTER JOIN".split()) 

lpar=Suppress('{') 
rpar=Suppress('}') 

keyword = MatchFirst([LINE_CONTAINS, LINE_STARTSWITH, LINE_ENDSWITH, NOT, AND, OR, 
         BEFORE, AFTER, JOIN]) # declaring all keywords and assigning order for all further use 

phrase_word = ~keyword + (Word(alphanums + '_')) 

upto_N_words = Group(lpar + 'upto' + pyparsing_common.integer('numberofwords') + 'words' + rpar) 

phrase_term = Group(OneOrMore(phrase_word) + ZeroOrMore((upto_N_words) + OneOrMore(phrase_word)) 



phrase_expr = infixNotation(phrase_term, 
          [ 
          ((BEFORE | AFTER | JOIN), 2, opAssoc.LEFT,), # (opExpr, numTerms, rightLeftAssoc, parseAction) 
          (NOT, 1, opAssoc.RIGHT,), 
          (AND, 2, opAssoc.LEFT,), 
          (OR, 2, opAssoc.LEFT), 
          ], 
          lpar=Suppress('{'), rpar=Suppress('}') 
          ) # structure of a single phrase with its operators 

line_term = Group((LINE_CONTAINS | LINE_STARTSWITH | LINE_ENDSWITH)("line_directive") + 
        Group(phrase_expr)("phrase")) # basically giving structure to a single sub-rule having line-term and phrase 
line_contents_expr = infixNotation(line_term, 
            [(NOT, 1, opAssoc.RIGHT,), 
            (AND, 2, opAssoc.LEFT,), 
            (OR, 2, opAssoc.LEFT), 
            ] 
            ) # grammar for the entire rule/sentence 

sample1 = """ 
LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH Therefore we 
""" 
sample2 = """ 
LINE_CONTAINS abcd {upto 4 words} xyzw {upto 3 words} pqrs BEFORE something else 
""" 

我现在的问题是 - 如何访问解析的元素,以便将句子转换为我的正则表达式代码。对于这一点,我尝试以下 -

parsed = line_contents_expr.parseString(sample1)/(sample2) 
print (parsed[0].asDict()) 
print (parsed) 
pprint.pprint(parsed) 

sample1上面的代码的结果是 -

{}

[[['LINE_CONTAINS', [[['sentence', 'one'], 'BEFORE', [['sentence2'], 'AND', ['sentence3']]]]], 'AND', ['LINE_STARTSWITH', [['Therefore', 'we']]]]]

([([(['LINE_CONTAINS', ([([(['sentence', 'one'], {}), 'BEFORE', ([(['sentence2'], {}), 'AND', (['sentence3'], {})], {})], {})], {})], {'phrase': [(([([(['sentence', 'one'], {}), 'BEFORE', ([(['sentence2'], {}), 'AND', (['sentence3'], {})], {})], {})], {}), 1)], 'line_directive': [('LINE_CONTAINS', 0)]}), 'AND', (['LINE_STARTSWITH', ([(['Therefore', 'we'], {})], {})], {'phrase': [(([(['Therefore', 'we'], {})], {}), 1)], 'line_directive': [('LINE_STARTSWITH', 0)]})], {})], {})

sample2上面的代码的结果是 -

{'phrase': [[['abcd', {'numberofwords': 4}, 'xyzw', {'numberofwords': 3}, 'pqrs'], 'BEFORE', ['something', 'else']]], 'line_directive': 'LINE_CONTAINS'}

[['LINE_CONTAINS', [[['abcd', ['upto', 4, 'words'], 'xyzw', ['upto', 3, 'words'], 'pqrs'], 'BEFORE', ['something', 'else']]]]]

([(['LINE_CONTAINS', ([([(['abcd', (['upto', 4, 'words'], {'numberofwords': [(4, 1)]}), 'xyzw', (['upto', 3, 'words'], {'numberofwords': [(3, 1)]}), 'pqrs'], {}), 'BEFORE', (['something', 'else'], {})], {})], {})], {'phrase': [(([([(['abcd', (['upto', 4, 'words'], {'numberofwords': [(4, 1)]}), 'xyzw', (['upto', 3, 'words'], {'numberofwords': [(3, 1)]}), 'pqrs'], {}), 'BEFORE', (['something', 'else'], {})], {})], {}), 1)], 'line_directive': [('LINE_CONTAINS', 0)]})], {})

我基于上述输出的问题是 -

  1. 为什么pprint(漂亮打印)比普通打印具有更详细的解析?
  2. 为什么asDict()方法不给sample1输出,但为sample2输出?
  3. 无论何时我尝试使用print (parsed.numberofwords)parsed.line_directiveparsed.line_term访问解析元素,它都不会提供任何内容。我如何访问这些元素,以便使用它们来构建我的正则表达式代码?

回答

2

回答您的打印问题。 1)pprint是否可以打印出令牌的嵌套列表,而不显示任何结果名称 - 它基本上是调用pprint.pprint(results.asList())的一种包装。 2)asDict()是否有将您的解析结果转换为实际的Python字典,因此它只有显示结果名称(如果名称中有名称,则使用嵌套)。

要查看解析输出的内容,最好使用print(result.dump())dump()显示了结果嵌套沿途的任何命名结果。

result = line_contents_expr.parseString(sample2) 
print(result.dump()) 

我还建议使用expr.runTests给你dump()输出以及任何异常和异常定位器。你的代码,你可以最容易做到这一点使用:

line_contents_expr.runTests([sample1, sample2]) 

但我也建议你退一步,仔细考虑一下正是这{upto n words}业务的全部。查看你的样本并围绕线条绘制矩形,然后在线条内围绕短语术语绘制圆圈。 (这将是一个很好的练习,可以为你自己写一篇关于这个语法的BNF描述,我总是推荐你将这个语法作为一个问题的一步。)如果你将upto表达式看作另一个运营商?看到这一点,改变phrase_term回到你身边有它的方式:

phrase_term = Group(OneOrMore(phrase_word)) 

,然后改变你的第一优先级条目定义短语表达:

((BEFORE | AFTER | JOIN | upto_N_words), 2, opAssoc.LEFT,), 

还是考虑一下也许有upto运算符的优先级高于或低于BEFORE,AFTER和JOIN,并相应地调整优先级列表。

随着这一变化,我得到这样的输出从您的样品叫runTests:

LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH Therefore we 

[[['LINE_CONTAINS', [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]]], 'AND', ['LINE_STARTSWITH', [['Therefore', 'we']]]]] 
[0]: 
    [['LINE_CONTAINS', [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]]], 'AND', ['LINE_STARTSWITH', [['Therefore', 'we']]]] 
    [0]: 
    ['LINE_CONTAINS', [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]]] 
    - line_directive: 'LINE_CONTAINS' 
    - phrase: [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]] 
     [0]: 
     [['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]] 
     [0]: 
      ['phrase', 'one'] 
     [1]: 
      BEFORE 
     [2]: 
      [['phrase2'], 'AND', ['phrase3']] 
      [0]: 
      ['phrase2'] 
      [1]: 
      AND 
      [2]: 
      ['phrase3'] 
    [1]: 
    AND 
    [2]: 
    ['LINE_STARTSWITH', [['Therefore', 'we']]] 
    - line_directive: 'LINE_STARTSWITH' 
    - phrase: [['Therefore', 'we']] 
     [0]: 
     ['Therefore', 'we'] 



LINE_CONTAINS abcd {upto 4 words} xyzw {upto 3 words} pqrs BEFORE something else 

[['LINE_CONTAINS', [[['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']]]]] 
[0]: 
    ['LINE_CONTAINS', [[['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']]]] 
    - line_directive: 'LINE_CONTAINS' 
    - phrase: [[['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']]] 
    [0]: 
     [['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']] 
     [0]: 
     ['abcd'] 
     [1]: 
     ['upto', 4, 'words'] 
     - numberofwords: 4 
     [2]: 
     ['xyzw'] 
     [3]: 
     ['upto', 3, 'words'] 
     - numberofwords: 3 
     [4]: 
     ['pqrs'] 
     [5]: 
     BEFORE 
     [6]: 
     ['something', 'else'] 

您可以对这些结果进行迭代,并接他们分开,但你快速达到这种地步,你应该看看建筑来自不同优先级别的可执行节点 - 请参阅pyparsing wiki上的SimpleBool.py示例以了解如何执行此操作。

编辑:请查看phrase_expr的解析器的简化版本,以及它如何创建自己生成输出的Node实例。在UpToNode类中查看运营商如何访问numberofwords。看看“xyz abc”如何用隐式AND运算符解释为“xyz AND abc”。

from pyparsing import * 
import re 

UPTO, WORDS, AND, OR = map(CaselessKeyword, "upto words and or".split()) 
keyword = UPTO | WORDS | AND | OR 
LBRACE,RBRACE = map(Suppress, "{}") 
integer = pyparsing_common.integer() 

word = ~keyword + Word(alphas) 
upto_expr = Group(LBRACE + UPTO + integer("numberofwords") + WORDS + RBRACE) 

class Node(object): 
    def __init__(self, tokens): 
     self.tokens = tokens 

    def generate(self): 
     pass 

class LiteralNode(Node): 
    def generate(self): 
     return "(%s)" % re.escape(self.tokens[0]) 
    def __repr__(self): 
     return repr(self.tokens[0]) 

class AndNode(Node): 
    def generate(self): 
     tokens = self.tokens[0] 
     return '.*'.join(t.generate() for t in tokens[::2]) 

    def __repr__(self): 
     return ' AND '.join(repr(t) for t in self.tokens[0].asList()[::2]) 

class OrNode(Node): 
    def generate(self): 
     tokens = self.tokens[0] 
     return '|'.join(t.generate() for t in tokens[::2]) 

    def __repr__(self): 
     return ' OR '.join(repr(t) for t in self.tokens[0].asList()[::2]) 

class UpToNode(Node): 
    def generate(self): 
     tokens = self.tokens[0] 
     ret = tokens[0].generate() 
     word_re = r"\s+\S+" 
     space_re = r"\s+" 
     for op, operand in zip(tokens[1::2], tokens[2::2]): 
      # op contains the parsed "upto" expression 
      ret += "((%s){0,%d}%s)" % (word_re, op.numberofwords, space_re) + operand.generate() 
     return ret 

    def __repr__(self): 
     tokens = self.tokens[0] 
     ret = repr(tokens[0]) 
     for op, operand in zip(tokens[1::2], tokens[2::2]): 
      # op contains the parsed "upto" expression 
      ret += " {0-%d WORDS} " % (op.numberofwords) + repr(operand) 
     return ret 

IMPLICIT_AND = Empty().setParseAction(replaceWith("AND")) 

phrase_expr = infixNotation(word.setParseAction(LiteralNode), 
     [ 
     (upto_expr, 2, opAssoc.LEFT, UpToNode), 
     (AND | IMPLICIT_AND, 2, opAssoc.LEFT, AndNode), 
     (OR, 2, opAssoc.LEFT, OrNode), 
     ]) 

tests = """\ 
     xyz 
     xyz abc 
     xyz {upto 4 words} def""".splitlines() 

for t in tests: 
    t = t.strip() 
    if not t: 
     continue 
    print(t) 
    try: 
     parsed = phrase_expr.parseString(t) 
    except ParseException as pe: 
     print(' '*pe.loc + '^') 
     print(pe) 
     continue 
    print(parsed) 
    print(parsed[0].generate()) 
    print() 

打印:

xyz 
['xyz'] 
(xyz) 

xyz abc 
['xyz' AND 'abc'] 
(xyz).*(abc) 

xyz {upto 4 words} def 
['xyz' {0-4 WORDS} 'def'] 
(xyz)((\s+\S+){0,4}\s+)(def) 

展开这个来支持你的LINE_xxx表达式。

+0

Paul,您是否建议我从'results.dump()'字符串的内容中操作以便处理元素以便进一步工作? – user1993

+1

不,绝对不是!我只是简单地指导你使用'results.dump()'来显示'results'的内容。您应该能够在遍历列表时直接遍历'results',并且可以使用字典或对象语法按名称引用字段。 'dump()'输出应该指导您使用哪种模式以及何时使用。 – PaulMcG

+0

Paul,你提到我可以如何使用'dump'和'runTests'轻松地显示解析结果。但正如我在问题中提到的那样,我试图访问解析的元素来操纵它们成为正则表达式。我的问题3具体。你如何建议我为解析的行访问'numberofwords','line_term'等东西? – user1993