2017-08-11 82 views
0

我是新来的python。 我已经制作了我自己的网络爬虫,这个爬虫应该是为了练习Yelp。Web Crawler --- TypeError:强制为Unicode:需要字符串或缓冲区,找不到类型


我不断收到这个错误,似乎无法让过去的第一页:

Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "<stdin>", line 26, in yelpSpider 
    TypeError: coercing to Unicode: need string or buffer, NoneType found 

这里是我的代码:

import requests 
from BeautifulSoup import BeautifulSoup 
def yelpSpider(maxPages): 
    page = 0 
    listURL = [] 
    listRATE = [] 
    listAREA = [] 
    listADDRESS = [] 
    listType = [] 
    while page <= maxPages: 
     url = 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=Manhattan,+NY&start=0' + str(page) 
     sourceCode = requests.get(url) 
     plainText = sourceCode.text 
     soup = BeautifulSoup(plainText) 
     for bizName in soup.findAll('a',{'class':'biz-name js-analytics-click'}): 
      href = 'https://www.yelp.com.com' + bizName.get('href') 
      listURL.append(href) 
     for rating in soup.findAll('img',{'class':'offscreen'}): 
      stars = rating.get('alt') 
      listRATE.append(stars) 
     for area in soup.findAll('span',{'class':'neighborhood-str-list'}): 
      listAREA.append(area.string) 
     for type in soup.findAll('span',{'class':'category-str-list'}): 
      listType.append(type) 
     for tracker in range(int(page),int(page) + 10): 
      print(listURL[tracker]) 
      print(' ') 
      print(listAREA[tracker] + ' | ' + listRATE[tracker]) 
     page += 10 

yelpSpider(20) 

谢谢你的帮助!

+0

改变最后打印改为前后解决您的listRATE:' print('{} | {}'。format(listAREA [tracker],listRATE [tracker]))' –

回答

0

问题在print(listAREA[tracker] + ' | ' + listRATE[tracker])

发生当你的listRATE出来是

['4.5 star rating', 
'4.5 star rating', 
'4.5 star rating', 
'4.0 star rating', 
'4.0 star rating', 
'4.0 star rating', 
'4.0 star rating', 
'5.0 star rating', 
'4.5 star rating', 
'4.0 star rating', 
None, 
None, 
'4.0 star rating', 
'4.5 star rating', 
'4.0 star rating', 
'3.0 star rating', 
'4.0 star rating', 
'3.5 star rating', 
'4.5 star rating', 
'4.5 star rating', 
'5.0 star rating', 
'4.0 star rating', 
None, 
None] 

正如你可以看到tracker: 10指数无它发生。无法在字符串连接中使用无。

所以你不同的选择,一个是使用or条件,并用''代替它。您的代码将成为

print((listAREA[tracker] or '') + ' | ' + (listRATE[tracker] or '')) 

下一个选项是打印

listRATE = list(map(lambda text: text if text is not None else 'N/A', listRATE)) 

执行你的阵列上面会像下面

['4.5 star rating', 
'4.5 star rating', 
'4.5 star rating', 
'4.0 star rating', 
'4.0 star rating', 
'4.0 star rating', 
'4.0 star rating', 
'5.0 star rating', 
'4.5 star rating', 
'4.0 star rating', 
'N/A', 
'N/A', 
'4.0 star rating', 
'4.5 star rating', 
'4.0 star rating', 
'3.0 star rating', 
'4.0 star rating', 
'3.5 star rating', 
'4.5 star rating', 
'4.5 star rating', 
'5.0 star rating', 
'4.0 star rating', 
'N/A', 
'N/A'] 
相关问题