lxml：clean_html用div替换html标签？

我使用LXML 3.1.0（带安装的easy_install），看到奇怪的结果：lxml：clean_html用div替换html标签？

> from lxml.html.clean import clean_html 
> clean_html("<html><body><h1>hi</h1></body></html>") 
'<div><body><h1>hi</h1></body></div>'

的html标签被用div取代。

同样与样品发生HTML按照http://lxml.de/lxmlhtml.html#cleaning-up-html

是怎么回事？我遇到了lxml的错误，还是版本与libxml2不兼容，或者这是以某种方式预期的？

来源

2013-03-21 ykaganovich

不，我不认为这是预期的行为... 你能发布一个更大的*代码片段吗？ – Murkantilism 2013-03-21 19:36:03

我想你需要一个Cleaner离开page_structure独自：

>>> from lxml.html.clean import Cleaner               
>>> cleaner = Cleaner(page_structure=False)           
>>> cleaner.clean_html("<html><body><h1>hi</h1></body></html>") 
'<html><body><h1>hi</h1></body></html>'

如上所述here，page_structure是默认为True。我怀疑您提供的网站上的文档不正确或过时。

编辑＃1：另一个确认，这是预期的行为可以在源代码中的this测试中找到。已提交A pull request以更正文档。

编辑＃2：pull request已于2013-04-28合并为主。

来源

2013-03-21 19:38:51 crayzeewulf

如果page_structure=True是默认值，则页面的结构部分（如<head>,<html>和<title>）将被删除。要改变这样的：

import lxml.html.clean as clean 
content = '<html><body><h1>hi</h1></body></html>' 
cleaner = clean.Cleaner(page_structure=False) 
cleaned = cleaner.clean_html(content) 
print(cleaned) 
# <html><body><h1>hi</h1></body></html>

为clean.Cleaner类查看文档字符串：

In [105]: clean.Cleaner? 
Type:  type 
String Form:<class 'lxml.html.clean.Cleaner'> 
File:  /usr/lib/python2.7/dist-packages/lxml/html/clean.py 
Definition: clean.Cleaner(self, doc) 
Docstring: 
Instances cleans the document of each of the possible offending 
elements. The cleaning is controlled by attributes; you can 
override attributes in a subclass, or set them in the constructor. 

``scripts``: 
    Removes any ``<script>`` tags. 

``javascript``: 
    Removes any Javascript, like an ``onclick`` attribute. 

``comments``: 
    Removes any comments. 

``style``: 
    Removes any style tags or attributes. 

``links``: 
    Removes any ``<link>`` tags 

``meta``: 
    Removes any ``<meta>`` tags 

``page_structure``: 
    Structural parts of a page: ``<head>``, ``<html>``, ``<title>``. 

``processing_instructions``: 
    Removes any processing instructions. 

``embedded``: 
    Removes any embedded objects (flash, iframes) 

``frames``: 
    Removes any frame-related tags 

``forms``: 
    Removes any form tags 

``annoying_tags``: 
    Tags that aren't *wrong*, but are annoying. ``<blink>`` and ``<marquee>`` 

``remove_tags``: 
    A list of tags to remove. 

``allow_tags``: 
    A list of tags to include (default include all). 

``remove_unknown_tags``: 
    Remove any tags that aren't standard parts of HTML. 

``safe_attrs_only``: 
    If true, only include 'safe' attributes (specifically the list 
    from `feedparser 
    <http://feedparser.org/docs/html-sanitization.html>`_). 

``add_nofollow``: 
    If true, then any <a> tags will have ``rel="nofollow"`` added to them. 

``host_whitelist``: 
    A list or set of hosts that you can use for embedded content 
    (for content like ``<object>``, ``<link rel="stylesheet">``, etc). 
    You can also implement/override the method 
    ``allow_embedded_url(el, url)`` or ``allow_element(el)`` to 
    implement more complex rules for what can be embedded. 
    Anything that passes this test will be shown, regardless of 
    the value of (for instance) ``embedded``. 

    Note that this parameter might not work as intended if you do not 
    make the links absolute before doing the cleaning. 

``whitelist_tags``: 
    A set of tags that can be included with ``host_whitelist``. 
    The default is ``iframe`` and ``embed``; you may wish to 
    include other tags like ``script``, or you may want to 
    implement ``allow_embedded_url`` for more control. Set to None to 
    include all tags. 

This modifies the document *in place*. 
Constructor information: 
Definition:clean.Cleaner(self, **kw)

来源

2013-03-21 19:38:31 unutbu

lxml：clean_html用div替换html标签？

回答

相关问题