2017-06-06 107 views
0

我想凑网页用下面的代码: -循环多个URL

import requests 
from bs4 import BeautifulSoup 

page = requests.get("http://www.realcommercial.com.au/sold/property-offices-retail-showrooms+bulky+goods-land+development-hotel+leisure-medical+consulting-other-in-vic/list-1?includePropertiesWithin=includesurrounding&activeSort=list-date&autoSuggest=true") 

soup = BeautifulSoup(page.content, 'html.parser') 
links = soup.find_all('a', attrs ={'class' :'details-panel'}) 
hrefs = [link['href'] for link in links] 

for urls in hrefs: 
    pages = requests.get(urls) 
    soup_2 =BeautifulSoup(pages.content, 'html.parser') 

    Date = soup_2.find_all('li', attrs ={'class' :'sold-date'}) 
    Sold_Date = [Sold_Date.text.strip() for Sold_Date in Date] 
    Address_1 = soup_2.find_all('p', attrs={'class' :'full-address'}) 
    Address = [Address.text.strip() for Address in Address_1] 

上面的代码只返回从HREF中的第一个网址的细节。

['Mon 05-Jun-17'] ['261 Keilor Road, Essendon, Vic 3040'] 

我需要遍历的HREF每个URL运行&从每个URL中的HREF返回类似的细节。 请建议我应该在上面的代码中添加/编辑什么。 任何帮助将不胜感激。

感谢

回答

1

它表现正确。 您需要将信息存储在外部列表中,然后将其返回。

import requests 
from bs4 import BeautifulSoup 

page = requests.get("http://www.realcommercial.com.au/sold/property-offices-retail-showrooms+bulky+goods-land+development-hotel+leisure-medical+consulting-other-in-vic/list-1?includePropertiesWithin=includesurrounding&activeSort=list-date&autoSuggest=true") 

soup = BeautifulSoup(page.content, 'html.parser') 
links = soup.find_all('a', attrs ={'class' :'details-panel'}) 
hrefs = [link['href'] for link in links] 
Data = [] 
for urls in hrefs: 
    pages = requests.get(urls) 
    soup_2 =BeautifulSoup(pages.content, 'html.parser') 

    Date = soup_2.find_all('li', attrs ={'class' :'sold-date'}) 
    Sold_Date = [Sold_Date.text.strip() for Sold_Date in Date] 
    Address_1 = soup_2.find_all('p', attrs={'class' :'full-address'}) 
    Address = [Address.text.strip() for Address in Address_1] 
    Data.append(Sold_Date + Address) 
return Data 
+0

非常感谢Anubhav,它现在对我有用,, –

+0

你能不能也请指导我如何在同一网站上运行相同的代码说10或20页,而不必每次都提供每个新页面的链接? –

+0

如果正在工作,请批准答案以结束问题。 –

1

您在每次迭代覆盖AddressSold_Date对象:

# after assignment previous data will be lost 
Sold_Date = [Sold_Date.text.strip() for Sold_Date in Date] 
Address = [Address.text.strip() for Address in Address_1] 

尝试初始化循环的空list。外面和扩展它们

import requests 
from bs4 import BeautifulSoup 

page = requests.get("http://www.realcommercial.com.au/sold/property-offices-retail-showrooms+bulky+goods-land+development-hotel+leisure-medical+consulting-other-in-vic/list-1?includePropertiesWithin=includesurrounding&activeSort=list-date&autoSuggest=true") 

soup = BeautifulSoup(page.content, 'html.parser') 
links = soup.find_all('a', attrs={'class': 'details-panel'}) 
hrefs = [link['href'] for link in links] 

addresses = [] 
sold_dates = [] 
for urls in hrefs: 
    pages = requests.get(urls) 
    soup_2 = BeautifulSoup(pages.content, 'html.parser') 

    dates_tags = soup_2.find_all('li', attrs={'class': 'sold-date'}) 
    sold_dates += [date_tag.text.strip() for date_tag in dates_tags] 
    addresses_tags = soup_2.find_all('p', attrs={'class': 'full-address'}) 
    addresses += [address_tag.text.strip() for address_tag in addresses_tags] 

给我们

>>>sold_dates 
[u'Tue 06-Jun-17', 
u'Tue 06-Jun-17', 
u'Tue 06-Jun-17', 
u'Tue 06-Jun-17', 
u'Tue 06-Jun-17', 
u'Tue 06-Jun-17', 
u'Tue 06-Jun-17', 
u'Mon 05-Jun-17', 
u'Mon 05-Jun-17', 
u'Mon 05-Jun-17'] 
>>>addresses 
[u'141 Napier Street, Essendon, Vic 3040', 
u'5 Loupe Crescent, Leopold, Vic 3224', 
u'80 Ryrie Street, Geelong, Vic 3220', 
u'18 Boase Street, Brunswick, Vic 3056', 
u'130-186 Buckley Street, West Footscray, Vic 3012', 
u'223 Park Street, South Melbourne, Vic 3205', 
u'48-50 The Centreway, Lara, Vic 3212', 
u'14 Webster Street, Ballarat, Vic 3350', 
u'323 Nepean Highway, Frankston, Vic 3199', 
u'341 Buckley Street, Aberfeldie, Vic 3040'] 
+0

非常感谢你的回复Azat !! –

+0

@Renusharma:它的工作? –