2010-07-07 68 views
9

我刚刚在我的Ubuntu 10.04机器上安装了python,mplayer,beautifulsoup和sipie来运行Sirius。我跟踪了一些看似简单的文档,但遇到了一些问题。我对Python并不熟悉,所以这可能会超出我的联盟。malformed start tag error - Python,BeautifulSoup和Sipie - Ubuntu 10.04

我能得到的一切安装完毕,但随后运行sipie给出了这样的:

/usr/bin/Sipie/Sipie/Config.py:12: DeprecationWarning: the md5 module is deprecated; use hashlib instead import md5
Traceback (most recent call last): File "/usr/bin/Sipie/sipie.py", line 22, in <module> Sipie.cliPlayer()
File "/usr/bin/Sipie/Sipie/cliPlayer.py", line 74, in cliPlayer completer = Completer(sipie.getStreams())
File "/usr/bin/Sipie/Sipie/Factory.py", line 374, in getStreams streams = self.tryGetStreams()
File "/usr/bin/Sipie/Sipie/Factory.py", line 298, in tryGetStreams soup = BeautifulSoup(data)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py", line 1499, in __init__ BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py", line 1230, in __init__ self._feed(isHTML=isHTML)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py", line 1263, in _feed self.builder.feed(markup)
File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed self.goahead(0)
File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead k = self.parse_starttag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 226, in parse_starttag endpos = self.check_for_whole_start_tag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 301, in check_for_whole_start_tag self.error("malformed start tag")
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 100, column 3

我通过这些文件和行号看,但由于我不熟悉Python的,它并没有太大的意义。有关下一步做什么的建议?

回答

8

您遇到的问题很常见,它们专门处理错误形成的HTML。就我而言,有一个HTML元素已经双引用了一个属性的值。实际上,我今天遇到了这个问题,这样做的时候遇到了你的帖子。我终于能穿过html5lib解析HTML交给它关闭BeautifulSoup 4

首先之前解决这个问题,你需要:

sudo easy_install bs4 
sudo apt-get install python-html5lib 

然后,运行此示例代码:

from bs4 import BeautifulSoup 
import html5lib 
from html5lib import sanitizer 
from html5lib import treebuilders 
import urllib 

url = 'http://the-url-to-scrape' 
fp = urllib.urlopen(url) 

# Create an html5lib parser. Not sure if the sanitizer is required. 
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"), tokenizer=sanitizer.HTMLSanitizer) 
# Load the source file's HTML into html5lib 
html5lib_object = parser.parse(file_pointer) 
# In theory we shouldn't need to convert this to a string before passing to BS. Didn't work passing directly to BS for me however. 
html_string = str(html5lib_object) 

# Load the string into BeautifulSoup for parsing. 
soup = BeautifulSoup(html_string) 

for content in soup.findAll('div'): 
    print content 

如果您对此代码有任何疑问或需要更详细的指导,请告诉我。:)

+2

我得到'ValueError:无法识别的treebuilder“beautifulsoup”' (Python 2.7.5,beautifulsoup 4.3.2,html5lib 0.999) – 2014-03-16 16:20:43

-2

看在被在文件“/usr/bin/Sipie/Sipie/Factory.py”中提到的“数据”线100的第3列,行298

+0

我明白你的意思了,但我很难找到这些数据...仍然在搜索。仍然不熟悉所有这些程序如何协同工作......任何其他提示? – nicorellius 2010-07-08 14:56:20

2

较新版本BeautifulSoup uses HTMLParser rather than SGMLParser的(由于从Python 3.0标准库中删除SGMLParser)。因此,BeautifulSoup不能再正确处理许多格式不正确的HTML文档,这是我相信你在这里遇到的。

一个解决问题的方法很可能是uninstall BeautifulSoup, and install an older version(这仍将在Ubuntu 10.04LTS与Python 2.6工作):

sudo apt-get remove python-beautifulsoup 
sudo easy_install -U "BeautifulSoup==3.0.7a" 

要知道,这种临时解决方案将不再使用Python 3.0工作(这可能会成为未来Ubuntu版本的默认设置)。

15

假设你正在使用BeautifulSoup4,我发现了正式文件中有关的内容:http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

If you’re using a version of Python 2 earlier than 2.7.3, or a version of Python 3 earlier than 3.2.2, it’s essential that you install lxml or html5lib–Python’s built-in HTML parser is just not very good in older versions.

我想这和它工作得很好,就像什么@Joshua

soup = BeautifulSoup(r.text, 'html5lib') 
+1

+1,很好找! – 2012-09-26 14:47:05

+0

上述代码中的“r”是来自请求库的html对象吗?无论如何,这个伟大的oneliner也像使用pycurl库一样具有魅力。 +1 – FredTheWebGuy 2013-07-17 06:46:47

+1

@Dreadful_Code:r = requests.get(url) – dannyroa 2013-09-17 18:06:15

2

命令行:

$ pip install beautifulsoup4 
$ pip install html5lib 

的Python 3:

from bs4 import BeautifulSoup 
from urllib.request import urlopen 

url = 'http://www.example.com' 
page = urlopen(url) 
soup = BeautifulSoup(page.read(), 'html5lib') 
links = soup.findAll('a') 

for link in links: 
    print(link.string, link['href']) 
+0

@ Ryan Allen我也收到了格式不正确的开始标记消息,但我需要用保存到磁盘的html文件解析,而不是打开的URL。有没有办法做到这一点? – ShaunO 2017-06-30 19:59:02

+0

您只需打开文件而不是使用urlopen。 'page = open('your/file/path /')' – 2017-07-05 17:49:42