Python：如何查找多行字符串中的所有匹配项，但没有按照特定的词进行搜索？

我有SQL代码，我想在“插入”关键字后提取表名。Python：如何查找多行字符串中的所有匹配项，但没有按照特定的词进行搜索？

基本上，我想用下面的规则来提取：

包含单词“插入”
其次是字“到”，这是可选的
排除如果有一个“ - “（这是SQL中的单行注释），在插入（可选）关键字之前的任何地方。
如果插入（可选）关键字介于“/ *”和“* /”（它是SQL中的多行注释）之间，则排除。
获取插入之后的下一个字（表名）到（可选）关键字

例子：

import re 

lines = """begin insert into table_1 end 
    begin insert table_2 end 
    select 1 --This is will not insert into table_3 
    begin insert into 
     table_4 
    end 
    /* this is a comment 
    insert into table_5 
    */ 
    insert into table_6 
    """ 

p = re.compile(r'^((?!--).)*\binsert\b\s+(?:into\s*)?.*', flags=re.IGNORECASE | re.MULTILINE) 
for m in re.finditer(p, lines): 
    line = lines[m.start(): m.end()].strip() 

    starts_with_insert = re.findall('insert.*', line, flags=re.IGNORECASE|re.MULTILINE|re.DOTALL) 
    print re.compile('insert\s+(?:into\s+)?', flags=re.IGNORECASE|re.MULTILINE|re.DOTALL).split(' '.join(starts_with_insert))[1].split()[0]

实际结果：

table_1 
table_2 
table_4 
table_5 
table_6

预期结果：不应返回table_5因为它介于/ *和*/

table_1 
table_2 
table_4 
table_6

有没有一个优雅的方式来做到这一点？

在此先感谢。

编辑：感谢您的解决方案。是否可以使用纯粹的正则表达式而不需要从原始文本中剥离线？

我想显示的行号可以从原始字符串中找到表名称。

更新下面的代码：

import re 

lines = """begin insert into table_1 end 
    begin insert table_2 end 
    select 1 --This is will not insert into table_3 
    begin insert into 
     table_4 
    end 
    /* this is a comment 
    insert into table_5 
    */ 
    insert into table_6 
    """ 

p = re.compile(r'^((?!--).)*\binsert\b\s+(?:into\s*)?.*', flags=re.IGNORECASE | re.MULTILINE) 
for m in re.finditer(p, lines): 
    line = lines[m.start(): m.end()].strip() 
    line_no = str(lines.count("\n", 0, m.end()) + 1).zfill(6) 

    table_names = re.findall(r'(?:\binsert\s*(?:into\s*)?)(\S+)', line, flags=re.IGNORECASE|re.MULTILINE|re.DOTALL) 
    print '[line number: ' + line_no + '] ' + '; '.join(table_names)

使用超前/回顾后，以排除那些/ *和* /但它不是我的生产预期的结果之间的审判。

希望你的帮助。谢谢！

来源

2017-10-05 pren

你忘了所有的'--'和'/*'在insid时可能不会开始注释e字符串... –

我认为你应该了解'lookbehind assertion' –

在2个步骤，re.sub()和re.findall()功能：

# removing single line/multiline comments 
stripped_lines = re.sub(r'/\*[\s\S]+\*/\s*|.*--.*(?=\binsert).*\n?', '', lines, re.S | re.I) 

# extracting table names preceded by `insert` statement 
tbl_names = re.findall(r'(?:\binsert\s*(?:into\s*)?)(\S+)', stripped_lines, re.I) 
print(tbl_names)

输出：

['table_1', 'table_2', 'table_4', 'table_6']

来源

2017-10-05 11:11:24 RomanPerekhrest

嗨罗马，你使用findall的解决方案比我原来的更简单。用你的解决方案更新了上面的代码。这可能实现相同的结果，而不剥离原始文本？ – pren

import re 
import string 

lines = """begin insert into table_1 end 
    begin insert table_2 end 
    select 1 --This is will not insert into table_3 
    begin insert into 
     table_4 
    end 
    /* this is a comment 
    insert into table_5 
    */ 
    insert into table_6 
    """ 

# remove all /* */ and -- comments 
comments = re.compile('/\*(?:.*\n)+.*\*/|--.*?\n', flags=re.IGNORECASE | re.MULTILINE) 
for comment in comments.findall(lines): 
    lines = string.replace(lines, comment, '') 

fullSet = re.compile('insert\s+(?:into\s+)*(\S+)', flags=re.IGNORECASE | re.MULTILINE) 
print fullSet.findall(lines)

给

['table_1', 'table_2', 'table_4', 'table_6']

来源

2017-10-05 12:35:51

感谢卡尔文为您提供了不错的解决方案。这真的有用，但想知道是否可以直接使用正则表达式而不删除行？上面更新了问题。谢谢 – pren

正则表达式没有解密上下文的机制。完全删除评论，保证你永远不会找到他们。正如你可能会意识到正则表达式很快就会退化成不可读的混乱。我会进一步压缩我的答案。我认为如果没有移除步骤，你就无法得到它。如果情况存在太多的话。我会多玩一点。 –

Python：如何查找多行字符串中的所有匹配项，但没有按照特定的词进行搜索？

回答

相关问题