2014-10-28 69 views
1

我收到一个url:https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp®-75-desktop-virtualization-solutions;它来自BeautifulSoup。如何在url中处理urllib2.urlopen?

url=u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp\xae-75-desktop-virtualization-solutions' 

我想再次反馈到urllib2.urlopen。

import urllib2 
source = urllib2.urlopen(url).read() 

的错误,我得到:

UnicodeEncodeError: 'gbk' codec can't encode character u'\xae' in position 43: illegal multibyte sequence 

因此,我想:

source = urllib2.urlopen(url.encode("utf-8")).read() 

它让网页的源文件,但它是从原始地址从什么不同。

originalUrl = 'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp®-75-desktop-virtualization-solutions' 
originalSource = urllib2.urlopen(originalUrl).read() 
originalSource == source 

结果是错误的。有没有想法解决这个网址?如何将u'\ xae'转换成原来的®

回答

3

URL必须是有效的bytestring,并且正确编码非ASCII码点。你需要编码成UTF-8,那么URL报价你的URL路径:

import urllib 
import urllib2 
import urlparse 

originalUrl = u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp\xae-75-desktop-virtualization-solutions' 
parsed_link = urlparse.urlsplit(originalUrl.encode('utf8')) 
parsed_link = parsed_link._replace(path=urllib.quote(parsed_link.path)) 
encoded_link = parsed_link.geturl() 
source = urllib2.urlopen(encoded_link).read() 

演示:

>>> import urllib 
>>> import urllib2 
>>> import urlparse 
>>> originalUrl = u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp\xae-75-desktop-virtualization-solutions' 
>>> parsed_link = urlparse.urlsplit(originalUrl.encode('utf8')) 
>>> parsed_link = parsed_link._replace(path=urllib.quote(parsed_link.path)) 
>>> encoded_link = parsed_link.geturl() 
>>> encoded_link 
'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp%C2%AE-75-desktop-virtualization-solutions' 
>>> source = urllib2.urlopen(encoded_link).read() 
>>> len(source) 
68758 
+0

是否有任何其他简便的方法来处理整个网址不只是URL.path? – user4181172 2014-10-29 01:51:54

+0

不确定你的意思;如果您尝试将'urllib.quote'应用于整个URL,那么错误的东西将被编码(如冒号)。 – 2014-10-29 07:55:21

+0

@Martijin,谢谢。你已经回答了我的问题。只需使用urllib.quote来编码URL.path即可。 – user4181172 2014-11-01 21:44:03

相关问题