2014-09-20 45 views
2

如何删除所有“document.write('');”从<table> </table>使用beautifulsoup: 我旁边有原始的HTML如何删除所有“document.write('');” with beautifulsoup

document.write('<table>'); 
document.write(' 
<tr> 
    <td> 
    <span class="prod"> 
    some text 
    </span> 
    </td> 
    '); 
document.write(' 
    <td> 
    <span class="prod"> 
    7.70.022 
    </span> 
    </td> 
</tr> 
'); 
document.write('</table>'); 

我需要在下一次的结果与beautifulsoup:

<table> 
<tr> 
    <td> 
    <span class="prod"> 
    some text 
    </span> 
    </td> 
    <td> 
    <span class="prod"> 
    7.70 
    </span> 
    </td> 
</tr> 
</table> 

回答

0

你为什么不只是使用regexs删除的部分不这样做想要然后使用beautifulsoup解析它?

import re 

data = """document.write('<table>'); 
document.write(' 
<tr> 
    <td> 
    <span class="prod"> 
    some text 
    </span> 
    </td> 
    '); 
document.write(' 
    <td> 
    <span class="prod"> 
    7.70.022 
    </span> 
    </td> 
</tr> 
'); 
document.write('</table>');""" 

pattern = re.compile(r"document\.write\('\n?([^']*?)(?:\n\s*)?'\);") 
data = pattern.sub('\g<1>', data) 
print data 

输出

<table> 
<tr> 
    <td> 
    <span class="prod"> 
    some text 
    </span> 
    </td> 
    <td> 
    <span class="prod"> 
    7.70.022 
    </span> 
    </td> 
</tr> 
</table>