2011-12-27 87 views
17

我想实现像缩进语法的Python。ANTLR什么是最简单的方法来实现像缩进语法的Python?

来源例如:

ABC QWE 
    CDE EFG 
    EFG CDE 
    ABC 
    QWE ZXC 

我看到的,我需要的是实现两个记号INDENT和DEDENT语言,所以我可以写的东西,如:

grammar mygrammar; 
text: (ID | block)+; 
block: INDENT (ID|block)+ DEDENT; 
INDENT: ????; 
DEDENT: ????; 

有没有什么简单的方法来使用ANTLR来实现这一点?

(我更喜欢,如果有可能,使用标准的ANTLR词法分析器)

回答

17

我不知道什么是最简单的方法来处理它,但下面是一个比较简单的方法。每当在词法分析器中匹配换行符时,可以选择匹配一个或多个空格。如果换行符后有空格,请将这些空格的长度与当前的缩进大小进行比较。如果它大于当前缩进大小,则发出一个Indent标记,如果它小于当前缩进大小,则发出Dedent标记,如果它相同,则不要执行任何操作。

您还需要在文件的结尾发出了一些Dedent令牌让每一个Indent有一个匹配Dedent令牌。

为了这个正常工作,你必须前缘和后换行添加到您的输入源文件!

ANTRL3

一个快速演示:

grammar PyEsque; 

options { 
    output=AST; 
} 

tokens { 
    BLOCK; 
} 

@lexer::members { 

    private int previousIndents = -1; 
    private int indentLevel = 0; 
    java.util.Queue<Token> tokens = new java.util.LinkedList<Token>(); 

    @Override 
    public void emit(Token t) { 
    state.token = t; 
    tokens.offer(t); 
    } 

    @Override 
    public Token nextToken() { 
    super.nextToken(); 
    return tokens.isEmpty() ? Token.EOF_TOKEN : tokens.poll(); 
    } 

    private void jump(int ttype) { 
    indentLevel += (ttype == Dedent ? -1 : 1); 
    emit(new CommonToken(ttype, "level=" + indentLevel)); 
    } 
} 

parse 
: block EOF -> block 
; 

block 
: Indent block_atoms Dedent -> ^(BLOCK block_atoms) 
; 

block_atoms 
: (Id | block)+ 
; 

NewLine 
: NL SP? 
    { 
    int n = $SP.text == null ? 0 : $SP.text.length(); 
    if(n > previousIndents) { 
     jump(Indent); 
     previousIndents = n; 
    } 
    else if(n < previousIndents) { 
     jump(Dedent); 
     previousIndents = n; 
    } 
    else if(input.LA(1) == EOF) { 
     while(indentLevel > 0) { 
     jump(Dedent); 
     } 
    } 
    else { 
     skip(); 
    } 
    } 
; 

Id 
: ('a'..'z' | 'A'..'Z')+ 
; 

SpaceChars 
: SP {skip();} 
; 

fragment NL  : '\r'? '\n' | '\r'; 
fragment SP  : (' ' | '\t')+; 
fragment Indent : ; 
fragment Dedent : ; 

您可以用类测试解析器:

import org.antlr.runtime.*; 
import org.antlr.runtime.tree.*; 
import org.antlr.stringtemplate.*; 

public class Main { 
    public static void main(String[] args) throws Exception { 
    PyEsqueLexer lexer = new PyEsqueLexer(new ANTLRFileStream("in.txt")); 
    PyEsqueParser parser = new PyEsqueParser(new CommonTokenStream(lexer)); 
    CommonTree tree = (CommonTree)parser.parse().getTree(); 
    DOTTreeGenerator gen = new DOTTreeGenerator(); 
    StringTemplate st = gen.toDOT(tree); 
    System.out.println(st); 
    } 
}  

如果你现在把下面的一个叫in.txt文件:

 

AAA AAAAA 
    BBB BB B 
    BB BBBBB BB 
    CCCCCC C CC 
    BB BBBBBB 
    C CCC 
     DDD DD D 
     DDD D DDD 

(注意前后的换行符!)

然后你就会看到对应于以下AST输出:

enter image description here

注意,在继承我的演示也不会产生足够的dedents,就像从ccc dedenting到aaa(2 DEDENT语言需要标记):

aaa 
    bbb 
    ccc 
aaa 

您需要调整内部else if(n < previousIndents) { ... }代码可能发出超过1个DEDENT语言符号BAS编辑了npreviousIndents之间的差异。关闭我的头顶,这可能是这样的:

else if(n < previousIndents) { 
    // Note: assuming indent-size is 2. Jumping from previousIndents=6 
    // to n=2 will result in emitting 2 `Dedent` tokens 
    int numDedents = (previousIndents - n)/2; 
    while(numDedents-- > 0) { 
    jump(Dedent); 
    } 
    previousIndents = n; 
} 

ANTLR4

对于ANTLR4,做这样的事情:

grammar Python3; 

tokens { INDENT, DEDENT } 

@lexer::members { 
    // A queue where extra tokens are pushed on (see the NEWLINE lexer rule). 
    private java.util.LinkedList<Token> tokens = new java.util.LinkedList<>(); 
    // The stack that keeps track of the indentation level. 
    private java.util.Stack<Integer> indents = new java.util.Stack<>(); 
    // The amount of opened braces, brackets and parenthesis. 
    private int opened = 0; 
    // The most recently produced token. 
    private Token lastToken = null; 
    @Override 
    public void emit(Token t) { 
    super.setToken(t); 
    tokens.offer(t); 
    } 

    @Override 
    public Token nextToken() { 
    // Check if the end-of-file is ahead and there are still some DEDENTS expected. 
    if (_input.LA(1) == EOF && !this.indents.isEmpty()) { 
     // Remove any trailing EOF tokens from our buffer. 
     for (int i = tokens.size() - 1; i >= 0; i--) { 
     if (tokens.get(i).getType() == EOF) { 
      tokens.remove(i); 
     } 
     } 

     // First emit an extra line break that serves as the end of the statement. 
     this.emit(commonToken(Python3Parser.NEWLINE, "\n")); 

     // Now emit as much DEDENT tokens as needed. 
     while (!indents.isEmpty()) { 
     this.emit(createDedent()); 
     indents.pop(); 
     } 

     // Put the EOF back on the token stream. 
     this.emit(commonToken(Python3Parser.EOF, "<EOF>")); 
    } 

    Token next = super.nextToken(); 

    if (next.getChannel() == Token.DEFAULT_CHANNEL) { 
     // Keep track of the last token on the default channel. 
     this.lastToken = next; 
    } 

    return tokens.isEmpty() ? next : tokens.poll(); 
    } 

    private Token createDedent() { 
    CommonToken dedent = commonToken(Python3Parser.DEDENT, ""); 
    dedent.setLine(this.lastToken.getLine()); 
    return dedent; 
    } 

    private CommonToken commonToken(int type, String text) { 
    int stop = this.getCharIndex() - 1; 
    int start = text.isEmpty() ? stop : stop - text.length() + 1; 
    return new CommonToken(this._tokenFactorySourcePair, type, DEFAULT_TOKEN_CHANNEL, start, stop); 
    } 

    // Calculates the indentation of the provided spaces, taking the 
    // following rules into account: 
    // 
    // "Tabs are replaced (from left to right) by one to eight spaces 
    // such that the total number of characters up to and including 
    // the replacement is a multiple of eight [...]" 
    // 
    // -- https://docs.python.org/3.1/reference/lexical_analysis.html#indentation 
    static int getIndentationCount(String spaces) { 
    int count = 0; 
    for (char ch : spaces.toCharArray()) { 
     switch (ch) { 
     case '\t': 
      count += 8 - (count % 8); 
      break; 
     default: 
      // A normal space char. 
      count++; 
     } 
    } 

    return count; 
    } 

    boolean atStartOfInput() { 
    return super.getCharPositionInLine() == 0 && super.getLine() == 1; 
    } 
} 

single_input 
: NEWLINE 
| simple_stmt 
| compound_stmt NEWLINE 
; 

// more parser rules 

NEWLINE 
: ({atStartOfInput()}? SPACES 
    | ('\r'? '\n' | '\r') SPACES? 
    ) 
    { 
    String newLine = getText().replaceAll("[^\r\n]+", ""); 
    String spaces = getText().replaceAll("[\r\n]+", ""); 
    int next = _input.LA(1); 
    if (opened > 0 || next == '\r' || next == '\n' || next == '#') { 
     // If we're inside a list or on a blank line, ignore all indents, 
     // dedents and line breaks. 
     skip(); 
    } 
    else { 
     emit(commonToken(NEWLINE, newLine)); 
     int indent = getIndentationCount(spaces); 
     int previous = indents.isEmpty() ? 0 : indents.peek(); 
     if (indent == previous) { 
     // skip indents of the same size as the present indent-size 
     skip(); 
     } 
     else if (indent > previous) { 
     indents.push(indent); 
     emit(commonToken(Python3Parser.INDENT, spaces)); 
     } 
     else { 
     // Possibly emit more than 1 DEDENT token. 
     while(!indents.isEmpty() && indents.peek() > indent) { 
      this.emit(createDedent()); 
      indents.pop(); 
     } 
     } 
    } 
    } 
; 

// more lexer rules 

摘自:https://github.com/antlr/grammars-v4/blob/master/python3/Python3.g4

+0

Thanx,它的工作原理:) – Astronavigator 2011-12-27 11:32:34

+0

不客气@Astronavigator。 – 2011-12-27 11:54:52

+0

Hi @Bart Kiers,我该如何克服领先和尾随的linebreaks限制?我试图让它在开始解析之前以编程方式发出缩进标记,但没有运气。 – ains 2013-09-05 10:02:12

3

你是否看了Python ANTLR grammar

编辑:增加了对创建INDENT/DEDENT语言伪Python代码令牌

UNKNOWN_TOKEN = 0 
INDENT_TOKEN = 1 
DEDENT_TOKEN = 2 

# filestream has already been processed so that each character is a newline and 
# every tab outside of quotations is converted to 8 spaces. 
def GetIndentationTokens(filestream): 
    # Stores (indentation_token, line, character_index) 
    indentation_record = list() 
    line = 0 
    character_index = 0 
    column = 0 
    counting_whitespace = true 
    indentations = list() 
    for c in filestream: 
     if IsNewLine(c): 
      character_index = 0 
      column = 0 
      line += 1 
      counting_whitespace = true 
     elif c != ' ' and counting_whitespace: 
      counting_whitespace = false 
      if(len(indentations) == 0): 
       indentation_record.append((token, line, character_index)) 
      else: 
       while(len(indentations) > 0 and indentations[-1] != column: 
        if(column < indentations[-1]): 
         indentations.pop() 
         indentation_record.append((
          DEDENT, line, character_index)) 
        elif(column > indentations[-1]): 
         indentations.append(column) 
         indentation_record.append((
          INDENT, line, character_index)) 

     if not IsNewLine(c): 
      column += 1 

     character_index += 1 
    while(len(indentations) > 0): 
     indentations.pop() 
     indentation_record.append((DEDENT_TOKEN, line, character_index)) 
    return indentation_record 
+0

是的。该语法没有实现INDENT和DEDENT规则。看起来这个语法不使用标准的词法分析器...... – Astronavigator 2011-12-27 09:15:31

+0

@Astronavigator那么,看过[Python的词法分析的缩进方法](http://docs.python.org/reference/lexical_analysis.html#indentation),他们的INDENT和DEDENT令牌是在单独的过程中生成的(可以在传递给ANTLR之前执行)。当你以他们的方式来看待它时,它就更简单了。 – 2011-12-27 10:09:22

+0

Thanx for answer,JSPerfUnknown。那么,在传递给ANTLR之前执行INDENT和DEDENT令牌是一个很好的观点。我会考虑一下。现在我宁愿只使用标准词法分析器,所以接受Bart的答案。 – Astronavigator 2011-12-27 11:44:14

4

有ANTLR v4的开放源代码库antlr-denter,可帮助为您解析缩进和缩进。看看它的README如何使用它。

由于它是一个库,而不是将代码片段复制并粘贴到您的语法中,所以它的缩进处理可以与其他语法分开更新。

1

有一个相对简单的方法来做这个ANTLR,我写作一个实验:Dent.g4。该解决方案与本页提到的其他解决方案不同,这些解决方案由Kiers和Shavit撰写。它仅通过超驰Lexer的nextToken()方法与运行时间集成。它通过检查令牌来完成它的工作:(1)一个NEWLINE令牌触发“记录缩进”阶段的开始; (2)设置为通道HIDDEN的空格和注释分别在该阶段被计数和忽略;和(3)任何非HIDDEN令牌结束阶段。因此控制缩进逻辑是设置令牌通道的简单事情。

这个页面提到的两个解决方案都需要一个NEWLINE标记来获取所有后续的空白,但这样做不能处理中断该空白的多行注释。相反,Dent将NEWLINE和空白标记分开,并可处理多行注释。

你的语法会被设置成如下所示。请注意,NEWLINE和WS词法分析器规则具有控制pendingDent状态并通过indentCount变量跟踪缩进级别的操作。

grammar MyGrammar; 

tokens { INDENT, DEDENT } 

@lexer::members { 
    // override of nextToken(), see Dent.g4 grammar on github 
    // https://github.com/wevrem/wry/blob/master/grammars/Dent.g4 
} 

script : (NEWLINE | statement)* EOF ; 

statement 
    : simpleStatement 
    | blockStatements 
    ; 

simpleStatement : LEGIT+ NEWLINE ; 

blockStatements : LEGIT+ NEWLINE INDENT statement+ DEDENT ; 

NEWLINE : ('\r'? '\n' | '\r') { 
    if (pendingDent) { setChannel(HIDDEN); } 
    pendingDent = true; 
    indentCount = 0; 
    initialIndentToken = null; 
} ; 

WS : [ \t]+ { 
    setChannel(HIDDEN); 
    if (pendingDent) { indentCount += getText().length(); } 
} ; 

BlockComment : '/*' (BlockComment | .)*? '*/' -> channel(HIDDEN) ; // allow nesting comments 
LineComment : '//' ~[\r\n]* -> channel(HIDDEN) ; 

LEGIT : ~[ \t\r\n]+ ~[\r\n]*; // Replace with your language-specific rules... 
相关问题