2014-10-01 64 views
2

因此,我正在为用户“Sri”发布的所有“餐馆点评”(而不是自己的评论的自我评论)抓取此特定网页https://www.zomato.com/srijata打印网页的某些文档元素的所有发生

zomato_ind = urllib2.urlopen('https://www.zomato.com/srijata') 
zomato_info = zomato_ind.read() 
open('zomato_info.html', 'w').write(zomato_info) 
soup = BeautifulSoup(open('zomato_info.html')) 
soup.find('div','mtop0 rev-text').text 

这将打印了她的第一家餐厅的评论,即 - “斯里兰卡审查大草帽 - 啃这种”为: -

u'Rated  This is situated right in the heart of the city. The items on the menu are alright and I really had to compromise for bubble tea. The tapioca was not fresh. But the latte and the soda pop my friends tried was good. Another issue which I faced was mosquitos... They almost had me.. Lol..' 

我也尝试另一个选择: -

我有这样的问题, : -

如何打印下一家餐厅评论?我试过findNextSiblings等,但都没有看起来工作。

+0

为什么保存在一个文件中的HTML然后将该文件读入汤对象? – 2014-10-01 12:22:02

+0

这是我做的一项措施,以避免连续击中网站,从而遵循安全措施,防止刮擦! – shalini 2014-10-02 05:41:56

回答

1

首先,您不需要将输出写入文件,将urlopen()调用的结果传递给BeautifulSoup构造函数。

要获得的评论,您需要遍历所有div标签与rev-text类,并得到了div元素中的.next_sibling

import urllib2 
from bs4 import BeautifulSoup 

soup = BeautifulSoup(urllib2.urlopen('https://www.zomato.com/srijata')) 
for div in soup.find_all('div', class_='rev-text'): 
    print div.div.next_sibling 

打印:

This is situated right in the heart of the city. The items on the menu are alright and I really had to compromise for bubble tea. The tapioca was not fresh. But the latte and the soda pop my friends tried was good. Another issue which I faced was mosquitos... They almost had me.. Lol.. 

The ambience is good. The food quality is good. I Didn't find anything to complain. I wanted to visit the place fir a very long time and had dinner today. The meals are very good and if u want the better quality compared to other Andhra restaurants then this is the place. It's far better than nandhana. The staffs are very polite too. 

... 
+0

感谢alecxe这个工程,但我仍然试图找出如何?就像为什么你只使用“rev-text”而不是“mtop0 rev-text”? – shalini 2014-10-01 14:44:19

+0

@shalini我使用过浏览器开发工具,检查了几个评论,发现他们都遵循'rev-text'类模式。那么,肯定有很多方法可以在网页上找到评论。您可以自由选择适合您的任何作品,以及您认为可靠的任何内容。谢谢。 – alecxe 2014-10-01 14:46:58

+0

亚历克斯的问题是,在开发工具class =“mtop0 rev-text”。因此,如果在您的代码中,我将“rev-text”替换为“mtop0 rev-text”,它根本不打印任何内容。根据开发工具“mtop0 rev-text”也应该可以工作,,,,? – shalini 2014-10-01 14:59:26

0

你应该做一个for循环和find_all使用,而不是发现:

zomato_ind = urllib2.urlopen('https://www.zomato.com/srijata') 
zomato_info = zomato_ind.read() 
open('zomato_info.html', 'w').write(zomato_info) 
soup = BeautifulSoup(open('zomato_info.html')) 
for div in soup.find_all('div','rev-text'): 
    print div.text 

另外一个问题:为什么要保存在一个文件中的HTML,然后把文件读入汤对象?

+0

does not work,print div.text ==> AttributeError:'NavigableString'对象没有属性'text' – shalini 2014-10-01 12:19:49

+0

抱歉试试这个。我忘记将find改成find_all – 2014-10-01 12:21:39

+0

仅在打印第一个评论后停止。 – shalini 2014-10-01 12:23:22