0
我使用urllib2
,BeautifulSoup
和topia.termextract
模块在Python 2.7中提取条款阅读网站提取条件打印
>>> extractor("he is Programmer, Visionary Entrepreneur and Investor ")
[('Entrepreneur', 1, 1), ('Programmer', 1, 1), ('Visionary', 1, 1), ('Investor', 1, 1), ('Visionary Entrepreneur', 1, 2)]
的段落也能正常工作了一段
但在下面环扭曲元组
>>> def getTerms(website):
page = urllib2.urlopen(website)
text = page.read()
soup = BeautifulSoup(text)
for para in soup.findAll('p'):
print extractor(para.text)
将网页url传递给函数上述N 打印
[(u'Entrepreneur', 1, 1), (u'Programmer', 1, 1), (u'Visionary', 1, 1), (u'Investor', 1, 1), (u'Visionary Entrepreneur', 1, 2)] .....
还有就是u
印在元组的起始?我如何检索纯元组形式?
注意:只打印para.text
正在打印纯文本的段落中循环上述