malformed start tag error - Python，BeautifulSoup和Sipie - Ubuntu 10.04

我刚刚在我的Ubuntu 10.04机器上安装了python，mplayer，beautifulsoup和sipie来运行Sirius。我跟踪了一些看似简单的文档，但遇到了一些问题。我对Python并不熟悉，所以这可能会超出我的联盟。malformed start tag error - Python，BeautifulSoup和Sipie - Ubuntu 10.04

我能得到的一切安装完毕，但随后运行sipie给出了这样的：

/usr/bin/Sipie/Sipie/Config.py:12: DeprecationWarning: the md5 module is deprecated; use hashlib instead import md5
Traceback (most recent call last): File "/usr/bin/Sipie/sipie.py", line 22, in <module> Sipie.cliPlayer()
File "/usr/bin/Sipie/Sipie/cliPlayer.py", line 74, in cliPlayer completer = Completer(sipie.getStreams())
File "/usr/bin/Sipie/Sipie/Factory.py", line 374, in getStreams streams = self.tryGetStreams()
File "/usr/bin/Sipie/Sipie/Factory.py", line 298, in tryGetStreams soup = BeautifulSoup(data)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py", line 1499, in __init__ BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py", line 1230, in __init__ self._feed(isHTML=isHTML)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py", line 1263, in _feed self.builder.feed(markup)
File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed self.goahead(0)
File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead k = self.parse_starttag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 226, in parse_starttag endpos = self.check_for_whole_start_tag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 301, in check_for_whole_start_tag self.error("malformed start tag")
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 100, column 3

我通过这些文件和行号看，但由于我不熟悉Python的，它并没有太大的意义。有关下一步做什么的建议？

来源

2010-07-07 nicorellius

您遇到的问题很常见，它们专门处理错误形成的HTML。就我而言，有一个HTML元素已经双引用了一个属性的值。实际上，我今天遇到了这个问题，这样做的时候遇到了你的帖子。我终于能穿过html5lib解析HTML交给它关闭BeautifulSoup 4

首先之前解决这个问题，你需要：

sudo easy_install bs4 
sudo apt-get install python-html5lib

然后，运行此示例代码：

from bs4 import BeautifulSoup 
import html5lib 
from html5lib import sanitizer 
from html5lib import treebuilders 
import urllib 

url = 'http://the-url-to-scrape' 
fp = urllib.urlopen(url) 

# Create an html5lib parser. Not sure if the sanitizer is required. 
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"), tokenizer=sanitizer.HTMLSanitizer) 
# Load the source file's HTML into html5lib 
html5lib_object = parser.parse(file_pointer) 
# In theory we shouldn't need to convert this to a string before passing to BS. Didn't work passing directly to BS for me however. 
html_string = str(html5lib_object) 

# Load the string into BeautifulSoup for parsing. 
soup = BeautifulSoup(html_string) 

for content in soup.findAll('div'): 
    print content

如果您对此代码有任何疑问或需要更详细的指导，请告诉我。:)

来源

2012-02-10 18:22:04

我得到'ValueError：无法识别的treebuilder“beautifulsoup”' （Python 2.7.5，beautifulsoup 4.3.2，html5lib 0.999） – 2014-03-16 16:20:43

-2

看在被在文件“/usr/bin/Sipie/Sipie/Factory.py”中提到的“数据”线100的第3列，行298

来源

2010-07-07 21:23:27

我明白你的意思了，但我很难找到这些数据...仍然在搜索。仍然不熟悉所有这些程序如何协同工作......任何其他提示？ – nicorellius 2010-07-08 14:56:20

较新版本BeautifulSoup uses HTMLParser rather than SGMLParser的（由于从Python 3.0标准库中删除SGMLParser）。因此，BeautifulSoup不能再正确处理许多格式不正确的HTML文档，这是我相信你在这里遇到的。

一个解决问题的方法很可能是uninstall BeautifulSoup, and install an older version（这仍将在Ubuntu 10.04LTS与Python 2.6工作）：

sudo apt-get remove python-beautifulsoup 
sudo easy_install -U "BeautifulSoup==3.0.7a"

要知道，这种临时解决方案将不再使用Python 3.0工作（这可能会成为未来Ubuntu版本的默认设置）。

来源

2010-08-29 04:09:49

假设你正在使用BeautifulSoup4，我发现了正式文件中有关的内容：http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

If you’re using a version of Python 2 earlier than 2.7.3, or a version of Python 3 earlier than 3.2.2, it’s essential that you install lxml or html5lib–Python’s built-in HTML parser is just not very good in older versions.

我想这和它工作得很好，就像什么@Joshua

soup = BeautifulSoup(r.text, 'html5lib')

来源

2012-04-30 03:11:10 Drake

+1，很好找！ – 2012-09-26 14:47:05

上述代码中的“r”是来自请求库的html对象吗？无论如何，这个伟大的oneliner也像使用pycurl库一样具有魅力。 +1 – FredTheWebGuy 2013-07-17 06:46:47

@Dreadful_Code：r = requests.get（url） – dannyroa 2013-09-17 18:06:15

命令行：

$ pip install beautifulsoup4 
$ pip install html5lib

的Python 3：

from bs4 import BeautifulSoup 
from urllib.request import urlopen 

url = 'http://www.example.com' 
page = urlopen(url) 
soup = BeautifulSoup(page.read(), 'html5lib') 
links = soup.findAll('a') 

for link in links: 
    print(link.string, link['href'])

来源

2014-03-16 16:52:57

@ Ryan Allen我也收到了格式不正确的开始标记消息，但我需要用保存到磁盘的html文件解析，而不是打开的URL。有没有办法做到这一点？ – ShaunO 2017-06-30 19:59:02

您只需打开文件而不是使用urlopen。 'page = open（'your/file/path /'）' – 2017-07-05 17:49:42

malformed start tag error - Python，BeautifulSoup和Sipie - Ubuntu 10.04

回答

相关问题