从Zillow刮取数据的最佳方式是什么？

-2

我试图从Zillow收集数据时一直没有成功。从Zillow刮取数据的最佳方式是什么？

例子：

url = https://www.zillow.com/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy

我想拉像地址信息，价格，zestimates，从在洛杉矶家中的所有位置。

我已经尝试过使用像BeautifulSoup这样的软件包进行HTML抓取。我也尝试过使用json。我几乎肯定Zillow的API不会有帮助。我的理解是，API最适合收集特定资产的信息。

我已经能够从其他网站获取信息，但似乎Zillow使用动态ID（每刷新一次），使访问信息变得更加困难。

UPDATE： 使用下面的代码试过，但我仍然没有产生任何结果

import requests 
from bs4 import BeautifulSoup 

url = 'https://www.zillow.com/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy' 

page = requests.get(url) 
data = page.content 

soup = BeautifulSoup(data, 'html.parser') 

for li in soup.find_all('div', {'class': 'zsg-photo-card-caption'}): 
    try: 
     #There is sponsored links in the list. You might need to take care 
     #of that 
     #Better check for null values which we are not doing in here 
     print(li.find('span', {'class': 'zsg-photo-card-price'}).text) 
     print(li.find('span', {'class': 'zsg-photo-card-info'}).text) 
     print(li.find('span', {'class': 'zsg-photo-card-address'}).text) 
     print(li.find('span', {'class': 'zsg-photo-card-broker-name'}).text) 
    except : 
     print('An error occured')

来源

2017-10-07 Chris Unice

HTTPS ：//www.zillow.com/howto/api/APIOverview.htm –

已经签出API，并不完全给我我需要的东西。 –

您可能会发现这是因为Zillow的API使用条款（以及该网站）特别禁止刮擦。 – toonarmycaptain

这可能是因为你不及格头。

如果你看一看的开发人员工具Chrome的网络选项卡，这是由浏览器传递的标题：

:authority:www.zillow.com 
:method:GET 
:path:/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy 
:scheme:https 
accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8 
accept-encoding:gzip, deflate, br 
accept-language:en-US,en;q=0.8 
upgrade-insecure-requests:1 
user-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36

但是，如果你尝试发送所有的人，它会失败，因为requests不允许您发送以冒号“：”开头的标头。

我试过跳过那四个，并在这个脚本中使用了其他五个。有效。所以，试试这个：

from bs4 import BeautifulSoup 
import requests 

req_headers = { 
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 
    'accept-encoding': 'gzip, deflate, br', 
    'accept-language': 'en-US,en;q=0.8', 
    'upgrade-insecure-requests': '1', 
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36' 
} 

with requests.Session() as s: 
    url = 'https://www.zillow.com/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy' 
    r = s.get(url, headers=req_headers)

之后，您可以使用BeautifulSoup提取你需要的信息：

soup = BeautifulSoup(r.content, 'lxml') 
price = soup.find('span', {'class': 'zsg-photo-card-price'}).text 
info = soup.find('span', {'class': 'zsg-photo-card-info'}).text 
address = soup.find('span', {'itemprop': 'address'}).text

下面是从页中提取数据的样本：

+--------------+-----------------------------------------------------------+ 
| $615,000  | 121 S Hope St APT 435 Los Angeles CA 90012    | 
| $330,000  | 4859 Coldwater Canyon Ave APT 14A Sherman Oaks CA 91423 | 
| $3,495,000 | 13446 Valley Vista Blvd Sherman Oaks CA 91423   | 
| $1,199,000 | 6241 Crescent Park W UNIT 410 Los Angeles CA 90094  | 
| $771,472+ | Chase St. And Woodley Ave # HGS0YX North Hills CA 91343 | 
| $369,000  | 8650 Gulana Ave UNIT L2179 Playa Del Rey CA 90293  | 
| $595,000  | 6427 Klump Ave North Hollywood CA 91606     | 
+--------------+-----------------------------------------------------------+

来源

2017-10-08 14:28:28 Mahesh

从Zillow刮取数据的最佳方式是什么？

回答

相关问题