如何删除XML声明使用BeautifulSoup4

我有一个结构类似这样的XHTML文件：我使用BeautifulSoup如何删除XML声明使用BeautifulSoup4

<?xml version="1.0" encoding="UTF-8"?> 
<!DOCTYPE html> 
<html lang="en"> 
<head> 
... 
</head> 
<body> 
... 
</body> 
<html>

，我想从文件中删除XML声明，所以我看起来像这样：

<!DOCTYPE html> 
<html lang="en"> 
<head> 
... 
</head> 
<body> 
... 
</body> 
<html>

我找不到一种方法来获取XML声明以将其删除。据我所知，它似乎不是Doctype，声明，标记或NavigableString。有没有一种方法可以找到它来提取它？

作为工作的例子，我可以用这样的代码删除文档类型（假设该文件的文本是变量“HTML”）：

soup = BeautifulSoup(html) 
[item.extract() for item in soup.contents if isinstance(item, Doctype)]

来源

2015-10-19 Jason Champion

你可以用下面的办法：

import bs4 
from bs4 import BeautifulSoup 
soup = BeautifulSoup(html, 'html.parser') 

for e in soup: 
    if isinstance(e, bs4.element.ProcessingInstruction): 
     e.extract() 
     break

来源

2015-10-19 06:25:32

完美，谢谢。 :) –

如何删除XML声明使用BeautifulSoup4

回答

相关问题