XML文件中的精确字符串搜索？

-1

我需要为以下文本序列准确（也在子文件夹中）搜索一些XML文件（它们都具有相同的名称，pom.xml），所以如果有人写了一些文本甚至空白，我必须得到警告：XML文件中的精确字符串搜索？

 <!-- 
    | Startsection 
    |-->   
    <!-- 
    | Endsection 
    |-->

我运行下面的Python脚本，但仍然没有精确匹配，我也得到警告，即使它的部分里面的文字：

import re 
import os 
from os.path import join 
comment=re.compile(r"<!--\s+| Startsection\s+|-->\s+<!--\s+| Endsection\s+|-->") 
tag="<module>" 

for root, dirs, files in os.walk("."): 

    if "pom.xml" in files: 
     p=join(root, "pom.xml") 
     print("Checking",p) 
     with open(p) as f: 
      s=f.read() 
     if tag in s and comment.search(s): 
      print("Matched",p)

更新＃3

我期待打印出来，th如果存在标签<module>电子内容|--> <!--之间

进入搜索：

<!-- 
| Startsection 
|-->   
<!-- 
| Endsection 
|-->

例如打印匹配后，文件的名称，也是在下面的情况下，打印“example.test1”：

 <!-- 
    | Startsection 
    |-->   
     <module>example.test1</module> 
    <!-- 
    | Endsection 
    |-->

UPDATE＃4

应该使用以下：

import re 
import os 
from os.path import join 
comment=re.compile(r"<!--\s+\| Startsection\s+\|-->\s+<!--\s+\| Endsection\s+\|-->", re.MULTILINE) 
tag="<module>" 

for root, dirs, files in os.walk("/home/temp/test_folder/"): 
for skipped in ("test1", "test2", ".repotest"): 
    if skipped in dirs: dirs.remove(skipped) 

if "pom.xml" in files: 
    p=join(root, "pom.xml") 
    print("Checking",p) 
    with open(p) as f: 
     s=f.read() 
     if tag in s and comment.search(s): 
      print("The following files are corrupted ",p)

更新＃5

import re 
import os 
import xml.etree.ElementTree as etree 
from bs4 import BeautifulSoup 
from bs4 import Comment 

from os.path import join 
comment=re.compile(r"<!--\s+\| Startsection\s+\|-->\s+<!--\s+\| Endsection\s+\|-->", re.MULTILINE) 
tag="<module>" 

for root, dirs, files in os.walk("myfolder"): 
for skipped in ("model", "doc"): 
    if skipped in dirs: dirs.remove(skipped) 

if "pom.xml" in files: 
    p=join(root, "pom.xml") 
    print("Checking",p) 
    with open(p) as f: 
     s=f.read() 
     if tag in s and comment.search(s): 
      print("ERROR: The following file are corrupted",p) 



bs = BeautifulSoup(open(p), "html.parser") 
# Extract all comments 
comments=soup.find_all(string=lambda text:isinstance(text,Comment)) 
for c in comments: 
    # Check if it's the start of the code 
    if "Start of user code" in c: 
     modules = [m for m in c.findNextSiblings(name='module')] 
     for mod in modules: 
      print(mod.text)

来源

2016-08-17 user2961008

请不要使用正则表达式解析XML。这是一个可怕的想法，它使有经验的程序员哭泣。尝试[BeautifulSoup]（https://www.crummy.com/software/BeautifulSoup/）或其底层库[lxml]（https://pypi.python.org/pypi/lxml） –

我想存储完全序列在外部文件中。我怎样才能实现它？你能帮助我吗？谢谢！ – user2961008

@AdamSmith，...这里的困难在于他们想要查找注释，所以它不是真正在DOM树中显示的内容。 –

不要用正则表达式解析XML文件。 The best Stackoverflow answer ever can explain you why

您可以使用BeautifulSoup来帮助这项任务

显得多么简单的是从你的代码片段的东西

from bs4 import BeautifulSoup 

content = """ 
    <!-- 
    | Start of user code (user defined modules) 
    |--> 

    <!-- 
    | End of user code 
    |--> 
""" 

bs = BeautifulSoup(content, "html.parser") 
print(''.join(bs.contents))

当然你也可以使用XML文件，而不是字面我的使用

bs = BeautifulSoup(open("pom.xml"), "html.parser")

一个小例子使用您的预期输入

from bs4 import BeautifulSoup 
from bs4 import Comment 

bs = BeautifulSoup(open(p), "html.parser") 
# Extract all comments 
comments=soup.find_all(string=lambda text:isinstance(text,Comment)) 
for c in comments: 
    # Check if it's the start of the code 
    if "Start of user code" in c: 
     modules = [m for m in c.findNextSiblings(name='module')] 
     for mod in modules: 
      print(mod.text)

但是，如果你的代码是始终处于模块标签，我不知道为什么你要关心的评论前/后，你可以找到模块标签上的代码直接

来源

2016-08-18 09:58:15 danielfranca

非常感谢！如何将其嵌入到我的代码中？ – user2961008

对于我们正在打印的这些情况是否可能是因为它们匹配，还要打印在| - > AND <！ - ??之间写入的内容。谢谢！ :)） – user2961008

是的，您可以调用.text或.find，请参阅文档以获取BS API的完整概述：https://www.crummy.com/software/BeautifulSoup/bs4/doc/ – danielfranca

的 “|（）” 字符必须进行转义，还添加re.MULTILINE的正则表达式。

comment=re.compile(r"\s+", re.MULTILINE)

编辑：您还可以将换行符在你的正则表达式：\ n

任意（或无）白色空间将是：\ S *

你可以找到更多信息在Python的正则表达式在这里：https://docs.python.org/2/library/re.html

来源

2016-08-17 23:38:11 user2592704

非常感谢！这是一个很好的解决方案，但它可以做到更严格？例如，如果我们在第3行和第4行之间写入一个ENTER？如果可能的话，我还想涵盖这种情况 – user2961008

有些提示请按照之前的评论来做？ – user2961008

在输入的第3行和第4行之间也可以检测到ENTER？我只能检测到有一些字符或多或少，我想检测空格或TAB。谢谢！ :)） – user2961008

XML文件中的精确字符串搜索？

回答

相关问题