我需要发布大量的XHTML文件，我没有生成，所以我无法修复生成它的代码。我不能使用正则表达式来爆炸整个文件，只是高度选择性的部分，因为有链接和ID的数字，我不能全局更改。Python：BeautifulSoup修改文本

我简化了这个例子很多，因为原始文件有RTL文本。我只想修改可见文本中的数字，而不是标记。似乎有3种不同的情况。

案例1：：
从bk1.xhtml
片段的链接交叉引用，数字XT具有嵌入式bookref文本

<aside epub:type='footnote' id="FN96"><p class="x"><a class="notebackref" href="#bk1_21_9"><span class="notemark">*</span>text</a> 
<span class="xt"> <a class='bookref' href='bk50.xhtml#bk50_118_26'>some text with these digits: 26:118</a></span></p></aside>

情况2：无链接交叉参考 - 具有与XT没有数字嵌入式bookref文本

<aside epub:type='footnote' id="FN100"><p class="x"><a class="notebackref" href="#bk1_21_42"><span class="notemark">*</span>text</a> 
<span class="xt">some text with these digits: 26:118</span></p></aside>

案例3：脚注没有联系，但有英尺文本中位数

<aside epub:type='footnote' id="FN107"><p class="f"><a class="notebackref" href="#bk1_22_44"><span class="notemark">§</span>text</a> 
<span class="ft">some text with these digits: 22</span></p></aside>

我试图找出如何识别文本字符串是可见的用户部分内，这样我可以只修改相关数字：

案例1：我需要捕捉刚刚 <a class='bookref' href='bk1.xhtml#bk1_118_26'>some text 26:118</a>将“一些文本26：118”子字符串分配给一个变量并针对该变量运行正则表达式;然后将该子字符串替换回原来的文件中。情况2：我只需要捕获<span class="xt">some text 26:118</span>并更改“some text 26：118”子字符串中的数字，并针对该变量运行正则表达式;然后将该子字符串替换回原来的文件中。情况3：我只需要捕获<span class="ft">some text 22</span>，并更改“some text 22”子字符串中的数字，并针对该变量运行正则表达式;然后将该子字符串替换回原来的文件中。

我有成千上万的这些做跨越很多文件。我知道如何迭代文件。

在处理完一个文件中的所有模式后，我需要写出已更改的树。

我只是需要后处理它来修复文本。

我一直在谷歌搜索，阅读和看很多教程，我感到困惑。

感谢您的任何帮助。

来源

2017-08-09 rmcape

看来你想要的.replaceWith()方法，你必须先找到你要匹配的文本中所有出现：

from bs4 import BeautifulSoup 

cases = ''' 
<aside epub:type='footnote' id="FN96"><p class="x"><a class="notebackref" href="#bk1_21_9"><span class="notemark">*</span>text</a> 
<span class="xt"> <a class='bookref' href='bk50.xhtml#bk50_118_26'>some text with these digits: 26:118</a></span></p></aside> 

<aside epub:type='footnote' id="FN100"><p class="x"><a class="notebackref" href="#bk1_21_42"><span class="notemark">*</span>text</a> 
<span class="xt">some text with these digits: 26:118</span></p></aside> 

<aside epub:type='footnote' id="FN107"><p class="f"><a class="notebackref" href="#bk1_22_44"><span class="notemark">§</span>text</a> 
<span class="ft">some text with these digits: 22</span></p></aside> 
''' 

soup = BeautifulSoup(cases, 'lxml') 

case1 = soup.findAll('a',{'class':'bookref'}) 
case2 = soup.findAll('span',{'class':'xt'}) 
case3 = soup.findAll('span',{'class':'ft'}) 

for match in case1 + case2 + case3: 
    text = match.string 
    print(text) 
    if text: 
     newText = text.replace('some text', 'modified!') # this line is your regex things 
     text.replaceWith(newText)

的print(text)在循环打印：

some text with these digits: 26:118 
None 
some text with these digits: 26:118 
some text with these digits: 22

如果我们再次调用它，现在：

modified! with these digits: 26:118 
None 
modified! with these digits: 26:118 
modified! with these digits: 22

来源

2017-08-09 23:31:22

这是否解决了需求“在我处理完所有o f一个文件中的模式，我需要写出更改后的树“？ – LarsH

@LarsH我错过了这个需求，但我认为只需将'text'写入文件就可以轻松完成。 –

Python：BeautifulSoup修改文本

案例1：： 从bk1.xhtml 片段的链接交叉引用，数字XT具有嵌入式bookref文本

情况2：无链接交叉参考 - 具有与XT没有数字嵌入式bookref文本

案例3：脚注没有联系，但有英尺文本中位数

回答

相关问题

案例1：：
从bk1.xhtml
片段的链接交叉引用，数字XT具有嵌入式bookref文本