提取条件打印

我使用urllib2，BeautifulSoup和topia.termextract模块在Python 2.7中提取条款阅读网站提取条件打印

>>> extractor("he is Programmer, Visionary Entrepreneur and Investor ") 
[('Entrepreneur', 1, 1), ('Programmer', 1, 1), ('Visionary', 1, 1), ('Investor', 1, 1), ('Visionary Entrepreneur', 1, 2)]

的段落也能正常工作了一段

但在下面环扭曲元组

>>> def getTerms(website): 
     page = urllib2.urlopen(website) 
     text = page.read() 
     soup = BeautifulSoup(text) 

     for para in soup.findAll('p'): 
      print extractor(para.text)

将网页url传递给函数上述N 打印

[(u'Entrepreneur', 1, 1), (u'Programmer', 1, 1), (u'Visionary', 1, 1), (u'Investor', 1, 1), (u'Visionary Entrepreneur', 1, 2)] .....

还有就是u印在元组的起始？我如何检索纯元组形式？

注意：只打印para.text正在打印纯文本的段落中循环上述

来源

2014-12-07 Suman K.C

这些是Unicode字符串（因此U“”）格式。 'u'不是字符串的一部分，但表示它的格式。

>>> s='abc' 
>>> type(s) 
<type 'str'> 
>>> s=u'abc' 
>>> type(s) 
<type 'unicode'>

如果你正在处理的第三方网站，您将需要处理的Unicode（因为你最终将遇到一个网站是不是在美国英语）。

请阅读python文档彻底的这一部分：https://docs.python.org/2/howto/unicode.html

或者更好的是，切换到Python 3，其中Unicode是字符串默认格式。

来源

2014-12-07 16:15:43 kdopen

提取条件打印

回答

相关问题