如何从html页面提取文本？

https://www.architecture.com/FindAnArchitect/FAAPractices.aspx?display=50

我必须有公司的名称及其地址和网站。我曾尝试以下的HTML转换为文本：

import nltk 
from urllib import urlopen 

url = "https://www.architecture.com/FindAnArchitect/FAAPractices.aspx display=50"  
html = urlopen(url).read()  
raw = nltk.clean_html(html) 
print(raw)

但它返回的错误：

ImportError: cannot import name 'urlopen

来源

2015-11-06 Nique

您正在使用[Python 3 **'urllib' **]（https://docs.python.org/3/library/urllib.html），它与[Python 2 **' urllib' **]（https://docs.python.org/2/library/urllib.html） –

很确定你一旦得到它会失望：[**'clean_html' **]（ http://www.nltk.org/_modules/nltk/util.html#clean_html）未实现。看看[这个问题]（http://stackoverflow.com/questions/26002076/python-nltk-clean-html-not-implemented）。 –

醒木已经回答了你的问题（link）。

import urllib.request 

uf = urllib.request.urlopen(url) 
html = uf.read()

但是，如果你想提取数据（如公司，地址名称和网站），那么你将需要获取你的HTML源代码并使用HTML解析器解析它。

我建议使用requests来获取HTML源文件，并使用BeautifulSoup来解析生成的HTML文件并提取所需的文本。

这是一个小snipet，会给你一个良好的开端。

import requests 
from bs4 import BeautifulSoup 

link = "https://www.architecture.com/FindAnArchitect/FAAPractices.aspx?display=50" 

html = requests.get(link).text 

"""If you do not want to use requests then you can use the following code below 
    with urllib (the snippet above). It should not cause any issue.""" 
soup = BeautifulSoup(html, "lxml") 
res = soup.findAll("article", {"class": "listingItem"}) 
for r in res: 
    print("Company Name: " + r.find('a').text) 
    print("Address: " + r.find("div", {'class': 'address'}).text) 
    print("Website: " + r.find_all("div", {'class': 'pageMeta-item'})[3].text)

来源

2015-11-06 12:34:28 JRodDynamite

这并不能帮助他们理解错误。 –

@PeterWood - 我已经更新了我的答案。希望能帮助到你。 – JRodDynamite

如何从html页面提取文本？

回答

相关问题