如何使用beautifulsoup刮从HTML页面

我试图从本网站刮纬度经度&数的纬度/经度数据：如何使用beautifulsoup刮从HTML页面

http://www.healthgrades.com/provider-search-directory/search?q=Dentistry&prof.type=provider&search.type=&method=&loc=New+York+City%2C+NY+&pt=40.71455%2C-74.007118&isNeighborhood=&locType=%7Cstate%7Ccity&locIsSolrCity=false

对于每一个供应商，如果你看的元素，它看起来像

div class="listing" data-lat="40.66862" data-lng="-73.98574" data-listing="22"

我怎样才能使用beautifulsoup纬度和经度这里数？

我试图用正则表达式在我的剧本，

下面是我的脚本 -

Geo = soup.find("div", class_="providerSearchResults") 
print Geo.findAll("div", data-lat_= re.compile('[0-9.]'))

但我得到这个错误信息：“语法错误：关键字不能是一个表达式”

此外，每个供应商的“格”部分的变化总是它可以是：

div class="listing" data-lat="40.66862" data-lng="-73.98574" data-listing="22"

或

div class="listingfirst" data-lat="40.66862" data-lng="-73.98574" data-listing="22"

甚至

div class="listing enhancedlisting" data-lat="40.66862" data-lng="-73.98574" data-listing="22"

来源

2015-11-03 backpackerice

Python正则表达式包（['re']（https://docs.python.org/3.5/library/re.html））没有属性/方法'.find'，这就是为什么你'重新得到那个错误。 – Rejected

第一点有几个要求：

pip install requests 
pip install BeautifulSoup 
pip install lxml

latlongbs4.py：

import requests 
from bs4 import BeautifulSoup 

r = requests.get('http://www.healthgrades.com/provider-search-directory/search?q=Dentistry&prof.type=provider&search.type=&method=&loc=New+York+City%2C+NY+&pt=40.71455%2C-74.007118&isNeighborhood=&locType=%7Cstate%7Ccity&locIsSolrCity=false') 
soup = BeautifulSoup(r.text, 'lxml') 
latlonglist = soup.find_all(attrs={"data-lat": True, "data-lng": True}) 
for latlong in latlonglist: 
    print latlong['data-lat'], latlong['data-lng']

编辑：从attrs词典中删除了class。

输出：

(latlongbs4)macbook:latlongbs4 joeyoung$ python latlongbs4.py 
40.71851 -74.00984 
40.77536 -73.97707 
40.71961 -74.00347 
40.71395 -74.008 
40.711614 -74.015901 
40.724576 -74.001771 
40.7175 -74.00087 
40.71961 -74.00347 
40.71766 -73.99293 
40.71961 -74.00347 
40.71848 -73.99648 
40.709917 -74.009884 
40.71553 -74.00977 
40.71702 -73.996 
40.71254 -73.99994 
40.70869 -74.01164 
40.70994 -74.00764 
40.707325 -74.003982 
40.7184 -74.00098 
40.71373 -74.00812 
40.710474 -74.009844 
40.7175 -74.00087 
40.727582 -73.894632 
40.763469 -73.963106 
40.724853 -73.841097

的几个注意事项：

我用attrs关键字与字典，因为：

Some attributes, like the data-* attributes in HTML 5, have names that can’t be used as the names of keyword arguments:

You can use these attributes in searches by putting them into a dictionary and passing the dictionary into find_all() as the attrs argument:

来源：http://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments

来源

2015-11-03 23:16:06

我只是意识到使用这段代码有一个问题。正如我所说的，div之后的关键字从提供者变为提供者。所以如果我只使用div class =“listing”，我会错过一些提供者。 – backpackerice

只要div仍包含'data-lat'和'data-lng'属性，就可以从字典中取出''class“：”listing“。当我在网址上试用它时，我没有看到任何类似的情况。 –

你可以在我原来的问题中找到更多细节。此外，我试图使用正则表达式“列表”，例如“^ listing. *”。但是，这会给我一些无用的数据，如div class = listingInner或div class = listingBody – backpackerice

如何使用beautifulsoup刮从HTML页面

回答

相关问题