正则表达式来解析注释的配置文件

编辑：我真的只是好奇，我怎么能得到这个正则表达式的工作。请不要告诉我有更简单的方法来做到这一点。这很明显！：P正则表达式来解析注释的配置文件

我正在写一个正则表达式（使用Python）来解析配置文件中的行。行看起来是这样的：

someoption1 = some value # some comment 
# this line is only a comment 
someoption2 = some value with an escaped \# hash 
someoption3 = some value with a \# hash # some comment

的想法是，散列符号之后什么被认为是一个评论，除非该散列以斜线逃脱。

我试图用正则表达式将每一行分成它的各个部分：引导空白，赋值的左侧，赋值的右侧和注释。对于本例中的第一行中，击穿将是：

空白： “”
分配左： “someoption1 =”
分配权： “一些值”
评论“＃一些评”

这是正则表达式我到目前为止：

^(\s)?(\S+\s?=)?(([^\#]*(\\\#)*)*)?(\#.*)?$

我很喜欢正则表达式，所以随时把它撕开！

使用Python的re.findAll()，这是回访：

第0指数：空白，因为它应该是
第一个指数：分配
第二个指标的左侧：右侧（不正确）
第5个索引：第一个散列，不论是否转义，以及其后的任何内容（这是不正确的）

对于我缺少的正则表达式，可能有一些基本的东西。如果有人可以解决这个问题，我会永远感谢...

来源

2010-09-24 apeace

这个问题是关于正则表达式还是关于用Python解析配置文件？如果是后者，那你为什么要编写一个配置文件解析器？ Python的标准ConfigParser模块（http://docs.python.org/library/configparser.html）应该可以做到！ – 2010-09-24 01:41:34

只是专门询问那个正则表达式。我只是想知道如何使用正则表达式来做到这一点。我意识到还有其他一些方法可以实现相同的目标，包括Python内置的configparser模块。虽然谢谢！ – apeace 2010-09-24 01:52:28

如果你不从头开始提供不是无能的证据，回答者将承担最坏的情况:-) – 2010-09-24 02:19:49

正则表达式不匹配的原因是因为正则表达式的贪婪匹配行为：每个部分将匹配最长的子字符串，使得字符串的其余部分仍然可以与正则表达式

这意味着您的一条线路与一名逃犯＃的情况什么的其余部分是：

的[^\#]*（有没有必要BTW逃跑＃）将第一哈希之前匹配所有，其中包括反斜杠
的(\\\#)*将不匹配任何东西，因为在这一点上字符串以＃
最后(\#.*)开始将匹配字符串

一个简单的例子的其它部分来强调这种潜在的非直观行为：在正则表达式(a*)(ab)?(b*)，则(ab)?绝不会匹配任何

我相信这个正则表达式（在原有基础上的）应该工作：^\s*(\S+\s*=([^\\#]|\\#?)*)?(#.*)?$

来源

2010-09-24 02:07:48 dave

感谢您的信息！ – apeace 2010-09-24 02:12:56

呵呵，哈希必须在Python中使用正则表达式。现在你知道了！ – apeace 2010-09-24 02:13:17

@apeace真的吗？ Python的文档没有提到这一点，我似乎可以使用未转义的#s而没有问题... – dave 2010-09-24 02:21:32

我不会用这个正则表达式，因为同样的原因，我不会尝试用热核弹头杀死苍蝇。

假设你正在读线的时间，只是：

如果第一个字符是一个#，设置为全行注释和空行。
否则，请在\之后找到#的第一个匹配项，然后将注释设置为加上该行的其余部分，并将该行设置为之前的所有内容。
用#替换所有出现的\#。

就是这样，你现在有一个正确的线和评论部分。一定要用正则表达式来分割新的线段。

例如：

import re 

def fn(line): 
    # Split line into non-comment and comment. 

    comment = "" 
    if line[0] == "#": 
     comment = line 
     line = "" 
    else: 
     idx = re.search (r"[^\\]#", line) 
     if idx != None: 
      comment = line[idx.start()+1:] 
      line = line[:idx.start()+1] 

    # Split non-comment into key and value. 

    idx = re.search (r"=", line) 
    if idx == None: 
     key = line 
     val = "" 
    else: 
     key = line[:idx.start()] 
     val = line[idx.start()+1:] 
    val = val.replace ("\\#", "#") 

    return (key.strip(),val.strip(),comment.strip()) 

print fn(r"someoption1 = some value # some comment") 
print fn(r"# this line is only a comment") 
print fn(r"someoption2 = some value with an escaped \# hash") 
print fn(r"someoption3 = some value with a \# hash # some comment")

生产：

('someoption1', 'some value', '# some comment') 
('', '', '# this line is only a comment') 
('someoption2', 'some value with an escaped # hash', '') 
('someoption3', 'some value with a # hash', '# some comment')

如果必须使用正则表达式（针对我的意见），您的具体问题就在这里：

[^\#]

（假设您的意思是正确转义的r"[^\\#]"）将尝试匹配除\或#之外的任何字符，而不是您想要的顺序\#。您可以使用排除查找屁股做到这一点，但我总是说，一旦正则表达式变得不可读的白痴着急，最好恢复到程序代码:-)

经过思考，一更好的方式来做到这一点是一个多层次的分裂（这样的正则表达式没有获得通过处理丢失的领域太可怕），具体如下：

def fn(line): 
    line = line.strip()       # remove spaces 
    first = re.split (r"\s*(?<!\\)#\s*", line, 1) # get non-comment/comment 
    if len(first) == 1: first.append ("")   # ensure we have a comment 
    first[0] = first[0].replace("\\#","#")   # unescape non-comment 

    second = re.split (r"\s*=\s*", first[0], 1) # get key and value 
    if len(second) == 1: second.append ("")  # ensure we have a value 
    second.append (first[1])      # create 3-tuple 
    return second         # and return it

它使用负前瞻正确匹配注释分隔符将非注释位分隔为键和值。空格可以在这一个正确处理为好，得到以下特性：

['someoption1', 'some value', 'some comment'] 
['', '', 'this line is only a comment'] 
['someoption2', 'some value with an escaped # hash', ''] 
['someoption3', 'some value with a # hash', 'some comment']

来源

2010-09-24 01:37:59 paxdiablo

采取了点。我意识到用另一种方式写它会很容易：P我主要只是想知道为什么该正则表达式不起作用。这就是为什么我问这个问题！ – apeace 2010-09-24 01:51:27

此解决方案（使用负向后视）不起作用。看到我的评论约翰Machin的答案遭受同样的问题。 – ridgerunner 2011-03-12 22:03:56

我已经离开这个问题的目的发表评论，但假设这个问题是纯粹的正则表达式，我还是会给出答案一射击。

假设你一次处理一行输入，我会把它作为一个两遍处理。这意味着你将有2个正则表达式。

沿东西的(.*?(?<!\\))#(.*)行：分割在第一#不受\ preceeded（见负回顾后文档）;
赋值语句表达式解析。

来源

2010-09-24 01:48:34

这似乎是我正在寻找。我会去查找负面的后顾之道。谢谢你的提示！ – apeace 2010-09-24 01:56:15

此解决方案（使用负向倒序）不起作用。看到我的评论约翰Machin的答案遭受同样的问题。 – ridgerunner 2011-03-12 21:54:18

试着将它分解成两个步骤：

转义处理，以识别真正的意见（第一＃不\（提示前面：“负回顾后发”）），除去真实的意见，然后更换r"\#""#"
处理无评论剩余部分。

BIG HINT：use re。VERBOSE与评论

来源

2010-09-24 01:48:56

在这种情况下，负向lookbehind不起作用。没有办法知道在＃之前有多少反斜杠，并且计数的含义发生了变化。（例如'＃'和'\\＃'和'\\\\＃'开始注释，而'\＃'和'\\\＃'不需要）。Gumbo的解决方案演示了解决这个问题的正确方法。但我同意1000％关于您的“VERBOSE模式与评论”的建议！ – ridgerunner 2011-03-12 21:46:17

我会在多行模式使用正则表达式：

^\s*([a-zA-Z_][a-zA-Z_0-9]*)\s*=\s*((?:[^\\#]|\\.)+)

这允许任何字符转义（\\.）。如果您只想要允许#，请改为使用\\#。

来源

2010-09-24 05:49:11 Gumbo

这是一个正确的解决方案，实际工作。它只是缺少一个表达式来匹配值后面的可选注释，以及字符串末尾的'$'锚点。此外，通过应用Friedl的“* unrolling-the-loop *”技术，'（？：[^ \\＃] | \\。）+'子表达式可以更加高效*'[^＃ \] *（？：\\ [^＃\\] *）*' – ridgerunner 2011-03-12 21:28:26

到目前为止提出的5个解决方案中，只有Gumbo的实际工作。这里是我的解决方案，这也适用，并且大量注释：

import re 

def fn(line): 
    match = re.search(
     r"""^   # Anchor to start of line 
     (\s*)   # $1: Zero or more leading ws chars 
     (?:   # Begin group for optional var=value. 
      (\S+)  # $2: Variable name. One or more non-spaces. 
      (\s*=\s*) # $3: Assignment operator, optional ws 
      (   # $4: Everything up to comment or EOL. 
      [^#\\]* # Unrolling the loop 1st normal*. 
      (?:  # Begin (special normal*)* construct. 
       \\.  # special is backslash-anything. 
       [^#\\]* # More normal*. 
      )*   # End (special normal*)* construct. 
     )   # End $4: Value. 
     )?    # End group for optional var=value. 
     ((?:\#.*)?) # $5: Optional comment. 
     $    # Anchor to end of line""", 
     line, re.MULTILINE | re.VERBOSE) 
    return match.groups() 

print (fn(r" # just a comment")) 
print (fn(r" option1 = value")) 
print (fn(r" option2 = value # no escape == IS a comment")) 
print (fn(r" option3 = value \# 1 escape == NOT a comment")) 
print (fn(r" option4 = value \\# 2 escapes == IS a comment")) 
print (fn(r" option5 = value \\\# 3 escapes == NOT a comment")) 
print (fn(r" option6 = value \\\\# 4 escapes == IS a comment"))

上述脚本生成以下（正确）的输出：（与Python 3.0.1测试）

(' ', None, None, None, '# just a comment') 
(' ', 'option1', ' = ', 'value', '') 
(' ', 'option2', ' = ', 'value ', '# no escape == IS a comment') 
(' ', 'option3', ' = ', 'value \\# 1 escape == NOT a comment', '') 
(' ', 'option4', ' = ', 'value \\\\', '# 2 escapes == IS a comment') 
(' ', 'option5', ' = ', 'value \\\\\\# 3 escapes == NOT a comment', '') 
(' ', 'option6', ' = ', 'value \\\\\\\\', '# 4 escapes == IS a comment')

请注意，此解决方案使用Jeffrey Friedl的“展开循环效率技术（消除慢速交替），它也不使用环视，且速度非常快。声称“知道”正则表达式。（当我说“知道”，我的意思是在Neo“我知道功夫！”感:) :)

来源

2011-03-12 23:21:02 ridgerunner

正则表达式来解析注释的配置文件

回答

相关问题