风车没有得到所有的html内容

我试图用python风车框架刮掉网页上的数据。不过，我无法从页面获取HTML表格内容。该表由JavaScript生成 - 因此我使用Windmill来获取内容。但是，内容不会返回表格 - 如果我使用BeautifulSoup尝试解析内容，会导致错误。风车没有得到所有的html内容

from windmill.authoring import WindmillTestClient 
from BeautifulSoup import BeautifulSoup 

from copy import copy 
import re 

def get_massage(): 
    my_massage = copy(BeautifulSoup.MARKUP_MASSAGE) 
    my_massage.append((re.compile(u"document.write(.+);"), lambda match: "")) 
    my_massage.append((re.compile(u'alt=".+">'), lambda match: ">")) 
    return my_massage 

def test_scrape(): 
    my_massage = get_massage() 
    client = WindmillTestClient(__name__) 
    client.open(url='http://marinetraffic.com/ais/datasheet.aspx?MMSI=636092060&TIMESTAMP=2&menuid=&datasource=POS&app=&mode=&B1=Search') 
    client.waits.forPageLoad(timeout='60000') 
    html = client.commands.getPageText() 
    assert html['status'] 
    assert html['result'] 
    soup=BeautifulSoup(html['result'],markupMassage=my_massage) 
    print soup.prettify()

当你看从表中缺少汤输出，但如果你看看网页内容的东西，如萤火虫它的显示。总的来说，我试图获取表格内容并将其解析为某种数据结构以供进一步处理。任何帮助深表感谢！

来源

2012-03-09 user1242670

问题是您使用的标记按摩对您正在处理的页面无法正常工作，也就是说，它将删除比应该更多的html代码。

要验证是否BeautifulSoup可能能够解析网页你需要，我只是尝试这样做：

soup = BeautifulSoup(html['result']) 
soup.table

它工作得很好，如此看来，在这种情况下，没有必要对任何标记按摩毕竟。

来源

2012-03-11 18:47:16 jcollado

感谢您的帮助 - 现在正常工作！ – user1242670 2012-03-12 00:17:48

风车没有得到所有的html内容

回答

相关问题