2017-10-15 84 views
0

当运行我的代码,我得到这个错误UnicodeEncodeError在Python 3和BeautifulSoup4

UnicodeEncodeError: 'ascii' codec can't encode character '\u0303' in position 71: ordinal not in range(128)

这是我的全部代码,

from urllib.request import urlopen as uReq 
from urllib.request import urlretrieve as uRet 
from bs4 import BeautifulSoup as soup 
import urllib 

for x in range(143, 608): 
    myUrl = "example.com/" + str(x) 
    try: 
     uClient = uReq(myUrl) 
     page_html = uClient.read() 
     uClient.close() 
     page_soup = soup(page_html, "html.parser") 

     container = page_soup.findAll("div", {"id": "videoPostContent"}) 

     img_container = container[0].findAll("img") 
     images = img_container[0].findAll("img") 

     imgCounter = 0 

     if len(images) == "": 
      for image in images: 
       print('Downloading image from ' + image['src'] + '...') 
       imgCounter += 1 
       uRet(image['src'], 'pictures/' + str(x) + '.jpg') 
     else: 
      for image in img_container: 
       print('Downloading image from ' + image['src'] + '...') 
       imgCounter += 1 
       uRet(image['src'], 'pictures/' + str(x) + '_' + str(imgCounter) + '.jpg') 
    except urllib.error.HTTPError: 
     continue 

试图解决方案:

我尝试添加.encode/decode('utf-8').text.encode/decode('utf-8')page_soup,但它给出了这个错误。

AttributeError: 'str'/'bytes' object has no attribute 'findAll' or

+1

包裹你的urlretrieve电话在被抛出什么行错误跳过它们? –

+0

将page_soup转换为字符串意味着它不再是一个BeatifulSoup对象。所以你不能使用'findAll'。什么行会抛出错误? – TheF1rstPancake

+0

在uRet()[这是urlretrieve] – Axis

回答

0

至少有一个图像src网址包含非ascii字符,且urlretrieve无法处理它们。

>>> url = 'http://example.com/' + '\u0303' 
>>> urlretrieve(url) 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    ... 
UnicodeEncodeError: 'ascii' codec can't encode character '\u0303' in position 5: ordinal not in range(128) 

您可以尝试以下方法之一来解决此问题。

  1. 假定这些URL是有效的,并使用具有更好的Unicode处理,像requests图书馆检索。

  2. 假设网址是有效的,但包含必须在传递到urlretrieve之前转义的unicode字符。这将需要将URL分解为方案,域,路径等,引用路径和任何查询参数,然后解开分裂;所有的工具都在urllib.parse包中(但这可能是什么请求,所以只是使用请求)。

  3. 假定这些URL被打破,并通过与try/except UnicodeEncodeError