2016-03-04 49 views
0

我使用的代码,如下面下一页解析页:如何用美丽的汤来解析下一页?

def parseNextThemeUrl(url): 
    ret = [] 
    ret1 = [] 
    html = urllib.request.urlopen(url) 
    html = BeautifulSoup(html, PARSER) 
    html = html.find('a', class_='pager_next') 
    if html: 
    html = urljoin(url, html.get('href')) 
    ret1 = parseNextThemeUrl(html) 

    for r in ret1: 
     ret.append(r) 
    else: 
    ret.append(url) 
    return ret 

但我得到了如下错误,我怎么能分析下一个链接,如果有一个环节。

Traceback (most recent call last): 
html = urllib.request.urlopen(url) 
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 162, in urlopen 
return opener.open(url, data, timeout) 
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 456, in open 
req.timeout = timeout 
AttributeError: 'list' object has no attribute 'timeout' 
+0

你可以给我们网络链接?如果不知道网页,我们无法确定。 – Seekheart

+0

'http://003.b2btoys.net/en/ProductList.aspx?Class1 = 12'' http://003.b2btoys.net/en/ProductList.aspx?PageIndex = 2&Class1 = 13&Class2 = 0&type =&keyWord =' – mikezang

回答

0

我得到了如下我自己的答案:

def parseNextThemeUrl(url): 
    urls = [] 
    urls.append(url) 
    html = urllib.request.urlopen(url) 
    soup = BeautifulSoup(html, 'lxml') 
    new_page = soup.find('a', class_='pager_next') 

    if new_page: 
    new_url = urljoin(url, new_page.get('href')) 
    urls1 = parseNextThemeUrl(new_url) 

    for url1 in urls1: 
     urls.append(url1) 
    return urls