我们可以在BeautifulSoup中使用xpath吗？

我使用BeautifulSoup凑一个网址，我有下面的代码我们可以在BeautifulSoup中使用xpath吗？

import urllib 
import urllib2 
from BeautifulSoup import BeautifulSoup 

url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html" 
req = urllib2.Request(url) 
response = urllib2.urlopen(req) 
the_page = response.read() 
soup = BeautifulSoup(the_page) 
soup.findAll('td',attrs={'class':'empformbody'})

现在在上面的代码中我们可以使用findAll获得与之相关的标签和信息，但我想使用XPath。是否有可能与BeautifulSoup一起使用xpath？如果可能的话，任何人都可以给我一个示例代码，以便它更有帮助吗？

来源

2012-07-13 shiva krishna

108

不，BeautifulSoup本身不支持XPath表达式。

另一个库lxml,确实支持XPath 1.0。它有一个BeautifulSoup compatible mode它会尝试和解析破碎的HTML汤方式。然而，default lxml HTML parser解析破碎的HTML的工作同样出色，我相信速度更快。

将文档解析为lxml树后，可以使用.xpath()方法搜索元素。

import urllib2 
from lxml import etree 

url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html" 
response = urllib2.urlopen(url) 
htmlparser = etree.HTMLParser() 
tree = etree.parse(response, htmlparser) 
tree.xpath(xpathselector)

您可能感兴趣的是CSS Selector support;在CSSSelector类转化CSS语句转换为XPath表达式，使您的搜索td.empformbody容易得多：

from lxml.cssselect import CSSSelector 

td_empformbody = CSSSelector('td.empformbody') 
for elem in td_empformbody(tree): 
    # Do something with these table cells.

一圈下来：BeautifulSoup本身确实有相当不错CSS selector support：

for cell in soup.select('table#foobar td.empformbody'): 
    # Do something with these table cells.

来源

2012-07-13 07:31:41

非常感谢Pieters，我从你的代码中得到了两个信息，1。一个澄清，我们不能与BS 2.使用xpath关于如何使用lxml一个很好的例子。我们能否在特定的文档上看到它“我们无法以书面形式使用BS来实现xpath”，因为我们应该向那些要求澄清权利的人展示一些证据？ – 2012-07-13 08:01:16

无论如何感谢你的preciuos帮助 – 2012-07-13 08:01:42

很难证明一个消极的; [BeautifulSoup 4文档]（http://www.crummy.com/software/BeautifulSoup/bs4/doc/）具有搜索功能，并且没有'xpath'的匹配。 – 2012-07-13 08:06:58

我搜遍了他们的docs，似乎没有xpath选项。另外，你可以在SO上的类似问题上看到here，OP要求从xpath到BeautifulSoup的翻译，所以我的结论是 - 不，没有可用的xpath解析。

来源

2012-07-13 07:30:25 Nikola

['scrapy'（http://scrapy.org/）是另一种选择，以获得LXML工作机智BS实际上 – inspectorG4dget 2012-07-13 07:38:33

是到现在为止，我用它使用XPath来获取里面tags.Its数据scrapy非常方便和容易获取数据，但我需要做到这一点与beautifulsoup相同，所以期待着它。 – 2012-07-13 07:46:48

我可以证实美丽的汤内没有XPath支持。

来源

2012-07-13 11:44:45

+46

注意：Leonard Richardson是Beautiful Soup的作者，你会看到如果你点击他的用户资料。 – senshin 2014-05-14 05:30:37

+13

能够在BeautifulSoup中使用XPATH将会非常好 – DarthOpto 2014-12-02 20:42:30

那么有什么选择呢？ – 2017-05-08 11:04:22

BeautifulSoup有一个从当前元素命名为findNext功能执导子女，所以：

father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a')

上面的代码可以模仿以下XPath：

div[class=class_value]/div[id=id_value]

来源

2014-07-09 13:11:19 user3820561

的Martijn的代码不再正常工作（这是4 +岁以前......），则etree.parse()行将打印到控制台，并且不会将值分配给tree变量。引用this，我能弄清楚这个工程使用要求和LXML：

from lxml import html 
import requests 

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html') 
tree = html.fromstring(page.content) 
#This will create a list of buyers: 
buyers = tree.xpath('//div[@title="buyer-name"]/text()') 
#This will create a list of prices 
prices = tree.xpath('//span[@class="item-price"]/text()') 

print 'Buyers: ', buyers 
print 'Prices: ', prices

来源

2017-01-06 21:38:07 wordsforthewise

这是一个非常古老的线程，但有一个变通的解决方案，现在，这可能不是一直在BeautifulSoup的时候。

这是我做的一个例子。我使用“requests”模块来读取RSS提要，并在名为“rss_text”的变量中获取其文本内容。这样，我通过BeautifulSoup运行它，搜索xpath/rss/channel/title并检索其内容。它不完全是XPath的所有荣耀（通配符，多条路径等），但是如果你只有一条你想要找到的基本路径，它就可以工作。

from bs4 import BeautifulSoup 
rss_obj = BeautifulSoup(rss_text, 'xml') 
cls.title = rss_obj.rss.channel.title.get_text()

来源

2017-12-15 08:35:00

我们可以在BeautifulSoup中使用xpath吗？

回答

相关问题