2016-02-12 73 views
3

非换空间在我的scrapy蜘蛛选择标签,我想只有<p>与文本内容来选择:Scrapy:与使用XPath

item['Description'] = response.xpath('//*[@id="textepresentation"]//p[string(.)]').extract() 

它工作正常,但不幸的是,这样做,我也得空<p>与非打破空间

u'<p>\xa0</p>', 

如何避免与XPath的非换空间中选择<p>

回答

2

可以使用XPath's normalize-space()字符串函数此一对夫妇谓词:

  • [normalize-space()]让你得到与非空字符串表示的元素,但不包括开头和结尾的空白
  • [not(contains(normalize-space(), "\u00a0"))]因为NO-BREAK SPACE未被删除(请参见this other answer where I checked which ones work,您可能需要添加其他字符进行测试)

样品:

>>> import scrapy 
>>> selector = scrapy.Selector(text=u''' 
... <html> 
...  <p>&nbsp;</p> 
...  <p>something</p> 
...  <p> </p> 
...  <p><a href="http://example.com">some link</a></p> 
... </html> 
... ''') 
>>> selector.xpath(u''' 
...  //p[normalize-space()] 
...  [not(contains(normalize-space(), "\u00a0"))] 
... ''').extract() 
[u'<p>something</p>', u'<p><a href="http://example.com">some link</a></p>'] 
>>> 

编辑:

以下的中@ Kimmy的回答,这里是1个谓词替代方案,其他空格字符和:

  • 采取空白不能被normalize-space()
  • 替换的字符并将它们放入XPath translate()呼叫与'
  • 正常化的空间,修剪开头和结尾的那些

这里有云:

>>> chars = ''' 
... #CHARACTER TABULATION 
... #LINE FEED 
... #LINE TABULATION 
... #FORM FEED 
... #CARRIAGE RETURN 
... #SPACE 
... #NEXT LINE 
... NO-BREAK SPACE 
... OGHAM SPACE MARK 
... MONGOLIAN VOWEL SEPARATOR 
... EN QUAD 
... EM QUAD 
... EN SPACE 
... EM SPACE 
... THREE-PER-EM SPACE 
... FOUR-PER-EM SPACE 
... SIX-PER-EM SPACE 
... FIGURE SPACE 
... PUNCTUATION SPACE 
... THIN SPACE 
... HAIR SPACE 
... ZERO WIDTH SPACE 
... ZERO WIDTH NON-JOINER 
... ZERO WIDTH JOINER 
... LINE SEPARATOR 
... PARAGRAPH SEPARATOR 
... NARROW NO-BREAK SPACE 
... MEDIUM MATHEMATICAL SPACE 
... WORD JOINER 
... IDEOGRAPHIC SPACE 
... ZERO WIDTH NO-BREAK SPACE 
... ''' 
>>> import unicodedata 
>>> wsp = [unicodedata.lookup(c) 
...  for c in chars.splitlines() 
...  if c.strip() and not c.startswith('#')] 
>>> 
>>> # somehow NEXT LINE (U+0085) does not work with unicodedata 
... wsp.append(u'\u0085') 
>>> 
>>> selector.xpath(u''' 
...  //p[normalize-space(translate(., "%(in)s", "%(out)s"))] 
...  ''' % {'in': ''.join(wsp), 
...   'out': ' '*len(wsp) 
...  }).extract() 
[u'<p>something</p>', u'<p><a href="http://example.com">some link</a></p>'] 
>>> 
+0

谢谢你这个有价值的详细解释!它按预期工作。谢谢 ! – jacquesseite

0
//p[translate(string(.),"\xa0","")] 
+0

不错的尝试,但'项目[ '说明'] = response.xpath(” // * [@ id =“textepresentation”] // p [translate(string(。),'\ xa0','')]')。extract() SyntaxError:行结束符后续字符' – jacquesseite

+0

@jacquesseite字符串分隔符冲突。在XPath表达式中始终使用双引号,即translate(string(。),“\ xa0”,“”)' – har07

+0

编辑为使用双引号。 –