我需要为以下文本序列准确(也在子文件夹中)搜索一些XML文件(它们都具有相同的名称,pom.xml),所以如果有人写了一些文本甚至空白,我必须得到警告:XML文件中的精确字符串搜索?
<!--
| Startsection
|-->
<!--
| Endsection
|-->
我运行下面的Python脚本,但仍然没有精确匹配,我也得到警告,即使它的部分里面的文字:
import re
import os
from os.path import join
comment=re.compile(r"<!--\s+| Startsection\s+|-->\s+<!--\s+| Endsection\s+|-->")
tag="<module>"
for root, dirs, files in os.walk("."):
if "pom.xml" in files:
p=join(root, "pom.xml")
print("Checking",p)
with open(p) as f:
s=f.read()
if tag in s and comment.search(s):
print("Matched",p)
更新#3
我期待打印出来,th如果存在标签<module>
电子内容|--> <!--
之间
进入搜索:
<!--
| Startsection
|-->
<!--
| Endsection
|-->
例如打印匹配后,文件的名称,也是在下面的情况下,打印“example.test1”:
<!--
| Startsection
|-->
<module>example.test1</module>
<!--
| Endsection
|-->
UPDATE#4
应该使用以下:
import re
import os
from os.path import join
comment=re.compile(r"<!--\s+\| Startsection\s+\|-->\s+<!--\s+\| Endsection\s+\|-->", re.MULTILINE)
tag="<module>"
for root, dirs, files in os.walk("/home/temp/test_folder/"):
for skipped in ("test1", "test2", ".repotest"):
if skipped in dirs: dirs.remove(skipped)
if "pom.xml" in files:
p=join(root, "pom.xml")
print("Checking",p)
with open(p) as f:
s=f.read()
if tag in s and comment.search(s):
print("The following files are corrupted ",p)
更新#5
import re
import os
import xml.etree.ElementTree as etree
from bs4 import BeautifulSoup
from bs4 import Comment
from os.path import join
comment=re.compile(r"<!--\s+\| Startsection\s+\|-->\s+<!--\s+\| Endsection\s+\|-->", re.MULTILINE)
tag="<module>"
for root, dirs, files in os.walk("myfolder"):
for skipped in ("model", "doc"):
if skipped in dirs: dirs.remove(skipped)
if "pom.xml" in files:
p=join(root, "pom.xml")
print("Checking",p)
with open(p) as f:
s=f.read()
if tag in s and comment.search(s):
print("ERROR: The following file are corrupted",p)
bs = BeautifulSoup(open(p), "html.parser")
# Extract all comments
comments=soup.find_all(string=lambda text:isinstance(text,Comment))
for c in comments:
# Check if it's the start of the code
if "Start of user code" in c:
modules = [m for m in c.findNextSiblings(name='module')]
for mod in modules:
print(mod.text)
请不要使用正则表达式解析XML。这是一个可怕的想法,它使有经验的程序员哭泣。尝试[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/)或其底层库[lxml](https://pypi.python.org/pypi/lxml) –
我想存储完全序列在外部文件中。我怎样才能实现它?你能帮助我吗?谢谢! – user2961008
@AdamSmith,...这里的困难在于他们想要查找注释,所以它不是真正在DOM树中显示的内容。 –