从网页中提取Meta关键字？

我需要使用Python从网页中提取元关键字。我在想这可以使用urllib或urllib2来完成，但我不确定。有人有主意吗？从网页中提取Meta关键字？

我使用Python 2.6在Windows XP

2010-07-09 Zac Brown

确保使用的内容缓存尽可能https://developer.yahoo.com/python/python-caching.html – fedmich 2014-12-18 07:26:14

lxml比BeautifulSoup（我认为）速度更快，具有更好的功能性，同时保持比较好用。例如：

52> from urllib import urlopen 
53> from lxml import etree 

54> f = urlopen("http://www.google.com").read() 
55> tree = etree.HTML(f) 
61> m = tree.xpath("//meta") 

62> for i in m: 
..>  print etree.tostring(i) 
..> 
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-2"/>

编辑：另一个例子。

75> f = urlopen("http://www.w3schools.com/XPath/xpath_syntax.asp").read() 
76> tree = etree.HTML(f) 
85> tree.xpath("//meta[@name='Keywords']")[0].get("content") 
85> "xml,tutorial,html,dhtml,css,xsl,xhtml,javascript,asp,ado,vbscript,dom,sql,colors,soap,php,authoring,programming,training,learning,b 
eginner's guide,primer,lessons,school,howto,reference,examples,samples,source code,tags,demos,tips,links,FAQ,tag list,forms,frames,color table,w3c,cascading 
style sheets,active server pages,dynamic html,internet,database,development,Web building,Webmaster,html guide"

其它：XPath值得了解。

另一个编辑：

或者，你可以使用正则表达式：

87> f = urlopen("http://www.w3schools.com/XPath/xpath_syntax.asp").read() 
88> import re 
101> re.search("<meta name=\"Keywords\".*?content=\"([^\"]*)\"", f).group(1) 
101>"xml,tutorial,html,dhtml,css,xsl,xhtml,javascript,asp,ado,vbscript,dom,sql, ...etc...

...但我觉得它的可读性变差，更容易出错（但只涉及标准模块，并仍然适用于一个线）。

来源

2010-07-09 19:34:10 cji

好了，但在哪里文档的关键字。我需要根据我的列表检查元数据中的关键字。 – 2010-07-09 19:51:42

正如你所看到的，他们在''标签的'内容'属性中'name'属性为'关键字':) – cji 2010-07-09 20:07:30

也请确保尽可能使用缓存内容https://developer.yahoo.com/ python/python-caching.html – fedmich 2014-12-18 07:25:52

BeautifulSoup是Python来解析HTML的好方法。

特别是，检查出的findAll方法： http://www.crummy.com/software/BeautifulSoup/documentation.html

来源

2010-07-09 19:17:55

为什么不使用正则表达式

keywordregex = re.compile('<meta\sname= 
["\']keywords["\']\scontent=["\'](.*?)["\']\s/>') 

keywordlist = keywordregex.findall(html) 
if len(keywordlist) > 0: 
    keywordlist = keywordlist[0] 
    keywordlist = keywordlist.split(", ")

来源

2013-10-23 15:01:51

因为http://stackoverflow.com/a/1732454/476716 – OrangeDog 2016-06-23 15:12:32

从网页中提取Meta关键字？

回答

相关问题