BeautifulSoup窒碍jQuery脚本，任何已知的解决方法？

我给BeautifulSoup HTML文档，并通过构建一个BeautifulSoup对象实例的完整HTML简单地说，它似乎窒息了一个jQuery脚本的以下行内容内嵌在HTML中：BeautifulSoup窒碍jQuery脚本，任何已知的解决方法？

 var txt = "Logged in as: <a href=\"http://somedomain.com/the-blah/\">" + uname + "</a> <small>(<a href=\"http://somedomain.com/the-blah/\">The Blah</a> | <a href=\"http://somedomain.com/the-blah/?action=logout\">logout</a>)</small>";

全对于错误堆栈跟踪如下：

/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in __init__(self, *args, **kwargs) 
    1497    kwargs['smartQuotesTo'] = self.HTML_ENTITIES 
    1498   kwargs['isHTML'] = True 
-> 1499   BeautifulStoneSoup.__init__(self, *args, **kwargs) 
    1500 
    1501  SELF_CLOSING_TAGS = buildTagMap(None, 

/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in __init__(self, markup, parseOnlyThese, fromEncoding, markupMassage, smartQuotesTo, convertEntities, selfClosingTags, isHTML, builder) 
    1228   self.markupMassage = markupMassage 
    1229   try: 
-> 1230    self._feed(isHTML=isHTML) 
    1231   except StopParsing: 
    1232    pass 

/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in _feed(self, inDocumentEncoding, isHTML) 
    1261   self.builder.reset() 
    1262 
-> 1263   self.builder.feed(markup) 
    1264   # Close out any unfinished strings and close all the open tags. 

    1265   self.endData() 

/usr/lib/python2.6/HTMLParser.pyc in feed(self, data) 
    106   """ 
    107   self.rawdata = self.rawdata + data 
--> 108   self.goahead(0) 
    109 
    110  def close(self): 

/usr/lib/python2.6/HTMLParser.pyc in goahead(self, end) 
    146    if startswith('<', i): 
    147     if starttagopen.match(rawdata, i): # < + letter 
--> 148      k = self.parse_starttag(i) 
    149     elif startswith("</", i): 
    150      k = self.parse_endtag(i) 

/usr/lib/python2.6/HTMLParser.pyc in parse_starttag(self, i) 
    227  def parse_starttag(self, i): 
    228   self.__starttag_text = None 
--> 229   endpos = self.check_for_whole_start_tag(i) 
    230   if endpos < 0: 
    231    return endpos 

/usr/lib/python2.6/HTMLParser.pyc in check_for_whole_start_tag(self, i) 
    302     return -1 
    303    self.updatepos(i, j) 
--> 304    self.error("malformed start tag") 
    305   raise AssertionError("we should not get here!") 
    306 

/usr/lib/python2.6/HTMLParser.pyc in error(self, message) 
    113 
    114  def error(self, message): 
--> 115   raise HTMLParseError(message, self.getpos()) 
    116 
    117  __starttag_text = None 

HTMLParseError: malformed start tag, at line 193, column 110

从我可以搜集它有事情做与尖括号是引号内，它似乎是由这个被甩出。那里有什么样的工作，还是有另一个库更好地处理这些边缘案例？或者，有没有办法告诉它忽略所有的JavaScript内容？

来源

2010-11-14 Silly Kids

最简单的方法可能是删除所有脚本。请参阅文档中的删除元素部分：http://www.crummy.com/software/BeautifulSoup/documentation.html#Removing%20elements

来源

2010-11-14 18:14:52

已测试和工作：trashed = [script.extract（）for soup.findAll（'script'）中的脚本] – 2010-11-14 18:25:03

哇，它真的很漂亮！ – 2010-11-14 19:10:48

BeautifulSoup窒碍jQuery脚本，任何已知的解决方法？

回答

相关问题