2017-10-10 47 views
2

我试图创建一个词法分析器使用Java德尔福。这里的示例代码:爪哇 - 匹配重读单词

String[] keywords={"array","as","asm","begin","case","class","const","constructor","destructor","dispinterface","div","do","downto","else","end","except","exports","file","finalization","finally","for","function","goto","if","implementation","inherited","initialization","inline","interface","is","label","library","mod","nil","object","of","out","packed","procedure","program","property","raise","record","repeat","resourcestring","set","shl","shr","string","then","threadvar","to","try","type","unit","until","uses","var","while","with"}; 
String[] relation={"=","<>","<",">","<=",">="}; 
String[] logical={"and","not","or","xor"}; 
Matcher matcher = null; 
for(int i=0;i<keywords.length;i++){ 
    matcher=Pattern.compile(keywords[i]).matcher(line); 
    if(matcher.find()){ 
    System.out.println("Keyword"+"\t\t"+matcher.group()); 
    } 
} 
for(int i1=0;i1<logical.length;i1++){ 
    matcher=Pattern.compile(logical[i1]).matcher(line); 
    if(matcher.find()){ 
    System.out.println("logic_op"+"\t\t"+matcher.group()); 
    } 
}  
for(int i2=0;i2<relation.length;i2++){ 
    matcher=Pattern.compile(relation[i2]).matcher(line); 
    if(matcher.find()){ 
    System.out.println("relational_op"+"\t\t"+matcher.group()); 
    } 
} 

所以,当我运行程序,它的工作原理,但它重新阅读该程序认为是2令牌例如某些话说:记录是一个关键字,但重新读取当您令牌逻辑运算符是从REC“” d。我怎样才能取消重新阅读文字?谢谢!

回答

1

正如answer by EvanM所述,您需要在关键字前后添加一个\b字边界匹配器,以防止字符串内的子字符串匹配。

为了获得更好的性能,你也应该使用|逻辑正则表达式运算符来匹配多个值之一,而不是创建多个匹配器,所以你只需要扫描一次line,并且只需要一个编译正则表达式。

您甚至可以将您正在寻找的3种不同类型的标记组合在一个正则表达式中,并使用捕获组来区分它们,因此您只需要扫描line一次。

像这样:

String regex = "\\b(array|as|asm|begin|case|class|const|constructor|destructor|dispinterface|div|do|downto|else|end|except|exports|file|finalization|finally|for|function|goto|if|implementation|inherited|initialization|inline|interface|is|label|library|mod|nil|object|of|out|packed|procedure|program|property|raise|record|repeat|resourcestring|set|shl|shr|string|then|threadvar|to|try|type|unit|until|uses|var|while|with)\\b" + 
       "|(=|<[>=]?|>=?)" + 
       "|\\b(and|not|or|xor)\\b"; 
for (Matcher m = Pattern.compile(regex).matcher(line); m.find();) { 
    if (m.start(1) != -1) { 
     System.out.println("Keyword\t\t" + m.group(1)); 
    } else if (m.start(2) != -1) { 
     System.out.println("logic_op\t\t" + m.group(2)); 
    } else { 
     System.out.println("relational_op\t\t" + m.group(3)); 
    } 
} 

,你甚至可以通过结合常见的前缀,例如关键字进一步优化它as|asm可能成为asm?,即as任选随后m。会使关键字列表的可读性降低,但性能会更好。

在上面的代码中,我没有,对于逻辑OPS,以显示如何,并且还以固定的匹配误差在原代码,其中>=line会出现3次为=>>=在该顺序,这个问题类似于问题中要求的子关键字问题。

+0

谢谢!我发现它读取了某些组合符号,如你所说的那样,其中'> ='将会分成3个逻辑符号。这也帮助了我。谢谢! – quSci

3

添加\b为字之间中断你的正则表达式。所以:

Pattern.compile("\\b" + keywords[i] + "\\b") 

将确保您单词两边的字符不是字母。

这样“记录”将只匹配“的记载,”不是“或”。

+0

非常感谢!有效! – quSci

+1

虽然关键字是不太可能包含的特殊字符,你还是应该逃避它:'Pattern.compile( “\\ B” + Pattern.quote(关键字[1])+ “\\ B”)' – Andreas