2015-11-03 122 views
1

我试图从本网站刮纬度经度&数的纬度/经度数据:如何使用beautifulsoup刮从HTML页面

http://www.healthgrades.com/provider-search-directory/search?q=Dentistry&prof.type=provider&search.type=&method=&loc=New+York+City%2C+NY+&pt=40.71455%2C-74.007118&isNeighborhood=&locType=%7Cstate%7Ccity&locIsSolrCity=false 

对于每一个供应商,如果你看的元素,它看起来像

div class="listing" data-lat="40.66862" data-lng="-73.98574" data-listing="22" 

我怎样才能使用beautifulsoup纬度和经度这里数?

我试图用正则表达式在我的剧本,

下面是我的脚本 -

Geo = soup.find("div", class_="providerSearchResults") 
print Geo.findAll("div", data-lat_= re.compile('[0-9.]')) 

但我得到这个错误信息:“语法错误:关键字不能是一个表达式”

此外,每个供应商的“格”部分的变化总是 它可以是:

div class="listing" data-lat="40.66862" data-lng="-73.98574" data-listing="22" 

div class="listingfirst" data-lat="40.66862" data-lng="-73.98574" data-listing="22" 

甚至

div class="listing enhancedlisting" data-lat="40.66862" data-lng="-73.98574" data-listing="22" 
+1

Python正则表达式包(['re'](https://docs.python.org/3.5/library/re.html))没有属性/方法'.find',这就是为什么你'重新得到那个错误。 – Rejected

回答

1

第一点有几个要求:

pip install requests 
pip install BeautifulSoup 
pip install lxml 

latlongbs4.py:

import requests 
from bs4 import BeautifulSoup 

r = requests.get('http://www.healthgrades.com/provider-search-directory/search?q=Dentistry&prof.type=provider&search.type=&method=&loc=New+York+City%2C+NY+&pt=40.71455%2C-74.007118&isNeighborhood=&locType=%7Cstate%7Ccity&locIsSolrCity=false') 
soup = BeautifulSoup(r.text, 'lxml') 
latlonglist = soup.find_all(attrs={"data-lat": True, "data-lng": True}) 
for latlong in latlonglist: 
    print latlong['data-lat'], latlong['data-lng'] 

编辑:从attrs词典中删除了class

输出:

(latlongbs4)macbook:latlongbs4 joeyoung$ python latlongbs4.py 
40.71851 -74.00984 
40.77536 -73.97707 
40.71961 -74.00347 
40.71395 -74.008 
40.711614 -74.015901 
40.724576 -74.001771 
40.7175 -74.00087 
40.71961 -74.00347 
40.71766 -73.99293 
40.71961 -74.00347 
40.71848 -73.99648 
40.709917 -74.009884 
40.71553 -74.00977 
40.71702 -73.996 
40.71254 -73.99994 
40.70869 -74.01164 
40.70994 -74.00764 
40.707325 -74.003982 
40.7184 -74.00098 
40.71373 -74.00812 
40.710474 -74.009844 
40.7175 -74.00087 
40.727582 -73.894632 
40.763469 -73.963106 
40.724853 -73.841097 

的几个注意事项:

我用attrs关键字与字典,因为:

Some attributes, like the data-* attributes in HTML 5, have names that can’t be used as the names of keyword arguments:

You can use these attributes in searches by putting them into a dictionary and passing the dictionary into find_all() as the attrs argument:

来源:http://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments

+0

我只是意识到使用这段代码有一个问题。正如我所说的,div之后的关键字从提供者变为提供者。所以如果我只使用div class =“listing”,我会错过一些提供者。 – backpackerice

+0

只要div仍包含'data-lat'和'data-lng'属性,就可以从字典中取出''class“:”listing“。当我在网址上试用它时,我没有看到任何类似的情况。 –

+0

你可以在我原来的问题中找到更多细节。此外,我试图使用正则表达式“列表”,例如“^ listing. *”。但是,这会给我一些无用的数据,如div class = listingInner或div class = listingBody – backpackerice