Scrapy xpath <字符后删除文本

我想从this页面获取产品信息。为了得到描述（出现在页面的底部），我使用XPathScrapy xpath <字符后删除文本

response.xpath('//*[@itemprop="description"]/table//text()').extract()[3].strip()

这使我的描述：

u'Color: White, Size:Free Size, With the body: Braided, Buckle: Automatic Deduction, With the body width: section ('

而一个目前在网站上是

Color: White, Size:Free Size, With the body: Braided, Buckle: Automatic Deduction, With the body width: section (<2cm), Belt Length: 93cm 
Product Type: Belts, Accessories

我已验证网站上的内容即使在禁用javascript后也会加载。我在这里错过了什么？

来源

2015-11-03 Pravesh Jain

它看起来像是因为'<'符号而被切断，甚至BeautifulSoup在'<'之后切出文本......非常奇怪 – heinst

这是一个'parsel'错误，我会在存储库上检查它[这里]（https://github.com/scrapy/parsel/issues/23） – eLRuLL

有帮助吗？ – eLRuLL

这仍然应该处理没有任何破解但你能得到这个工作：

from parsel import Selector 
... 

s = Selector(text=response.body_as_unicode(), type='xml') 
s.xpath('//*[@itemprop="description"]/table//text()').extract()[3].strip() 
# gives u'Color: White, Size:Free Size, With the body: Braided, Buckle: Automatic Deduction, With the body width: section (2cm), Belt Length: 93cm'

这里的问题是，parsel（内scrapy分析器）使用lxml.etree.HtmlParser(recover=True, encoding='utf8')从而消除这种奇怪的字符避免问题。

来源

2015-11-03 15:53:11 eLRuLL

Scrapy xpath <字符后删除文本

回答

相关问题