2015-11-20 46 views
3

我有这样的HTML:LXML XPath不忽视 “ ”

<td class="0"> 
<b>Bold Text</b>&nbsp; 
<a href=""></a> 
</td> 

<td class="0"> 
Regular Text&nbsp; 
<a href=""></a> 
</td> 

,当使用XPath格式...

new_html = tree.xpath('//td[@class="0"]/text() | //td[@class="0"]/b/text()') 

打印:

['Bold Text', '', 'Regular Text'] 

由于你可以看到,&nbsp;字符没有被忽略,实际上在td中被读作一个额外的入口。我怎样才能获得更好的产出?

回答

3

相反,我会遍历所有所需td元素,并获得.text_content()

[td.text_content().strip() for td in tree.xpath('//td[@class="0"]')] 

打印:

[u'Bold Text', u'Regular Text'] 
+0

做事更有效的方式,谢谢! – Prof

5

注:我张贴这种没有那么多一个答案,但作为一个有趣的事情(我不知道)关于XPath's normalize-space()。这可能有助于其他用户。

它看起来像normalize-space()这我会建议在这里,不排除'NO-BREAK SPACE' (U+00A0)

>>> text = '''<html> 
... <table> 
... <tr> 
... <td class="0"> 
... <b>Bold Text</b>&nbsp; 
... <a href=""></a> 
... </td> 
... 
... <td class="0"> 
... Regular Text&nbsp; 
... <a href=""></a> 
... </td> 
... </tr> 
... </table> 
... </html>''' 
>>> doc = lxml.html.fromstring(text) 
>>> 
>>> # ouch, &nbsp; is not stripped... 
>>> [td.xpath('normalize-space(.)') for td in doc.xpath('.//td[@class="0"]')] 
[u'Bold Text\xa0', u'Regular Text\xa0'] 
>>> 
>>> # one needs to strip() like in @alecxe's answer 
>>> [td.xpath('normalize-space(.)').strip() for td in doc.xpath('.//td[@class="0"]')] 
[u'Bold Text', u'Regular Text'] 
>>> 

编辑:

所以我继续寻找到空白字符以及它们是如何剥离或不使用Python的strip()或XPath的normalize-space()

以下是多一点的时间比我先想,但他是整个脚本测试的Unicode码点的空白:

>>> import lxml.html 
>>> import requests 
>>> 
>>> whitespace_chars_wikipedia = 'https://en.wikipedia.org/wiki/Whitespace_character#Unicode' 
>>> r = requests.get(whitespace_chars_wikipedia) 
>>> 
>>> doc = lxml.html.fromstring(r.text) 
>>> 
>>> 
>>> import collections 
>>> import re 
>>> 
>>> WhitespaceChar = collections.namedtuple('WhitespaceChar', ['codepoint', 'name', 'decimal', 'named_entity']) 
>>> r = re.compile('') 
>>> wchars = {} 
>>> for table in doc.xpath(''' 
...  .//div[@class="NavHead"][.//strong="Whitespace"] 
...  /following-sibling::div[@class="NavContent"] 
...   //table[1] 
...  | 
...  .//table[caption="Related characters"] 
...  '''): 
...  for row in table.xpath('.//tr[position()>1]'): 
...   codepoint = row.xpath('string(./td[1]/text()[last()])') 
...   name = row.xpath('normalize-space(./td[2])').upper() 
...   decimal = int(row.xpath('string(./td[3])')) 
...   named_entity = row.xpath('''string(
...    ./td[last()]/text()[contains(., "HTML/XML named entity: ")] 
...       /following-sibling::code 
...   )''') 
...   wchars[decimal] = WhitespaceChar(codepoint, name, decimal, named_entity or None) 
... 
>>> 
>>> listitems = "\n".join(
...  '<li><i>&#x{wchar.decimal:04X};</i> <b data-decimal="{wchar.decimal}">{wchar.codepoint}</b> <i>&#x{wchar.decimal:04X};</i></li>'.format(wchar=c) 
...  for c in sorted(wchars.values(), key=lambda c: c.decimal) 
...) 
>>> text = ''' 
... <html> 
...  <body> 
...   <ul> 
... {} 
...   </ul> 
...  </body> 
... </html> 
... '''.format(listitems) 
>>> print text 

<html> 
    <body> 
     <ul> 
<li><i>&#x0009;</i> <b data-decimal="9">U+0009</b> <i>&#x0009;</i></li> 
<li><i>&#x000A;</i> <b data-decimal="10">U+000A</b> <i>&#x000A;</i></li> 
<li><i>&#x000B;</i> <b data-decimal="11">U+000B</b> <i>&#x000B;</i></li> 
<li><i>&#x000C;</i> <b data-decimal="12">U+000C</b> <i>&#x000C;</i></li> 
<li><i>&#x000D;</i> <b data-decimal="13">U+000D</b> <i>&#x000D;</i></li> 
<li><i>&#x0020;</i> <b data-decimal="32">U+0020</b> <i>&#x0020;</i></li> 
<li><i>&#x0085;</i> <b data-decimal="133">U+0085</b> <i>&#x0085;</i></li> 
<li><i>&#x00A0;</i> <b data-decimal="160">U+00A0</b> <i>&#x00A0;</i></li> 
<li><i>&#x1680;</i> <b data-decimal="5760">U+1680</b> <i>&#x1680;</i></li> 
<li><i>&#x180E;</i> <b data-decimal="6158">U+180E</b> <i>&#x180E;</i></li> 
<li><i>&#x2000;</i> <b data-decimal="8192">U+2000</b> <i>&#x2000;</i></li> 
<li><i>&#x2001;</i> <b data-decimal="8193">U+2001</b> <i>&#x2001;</i></li> 
<li><i>&#x2002;</i> <b data-decimal="8194">U+2002</b> <i>&#x2002;</i></li> 
<li><i>&#x2003;</i> <b data-decimal="8195">U+2003</b> <i>&#x2003;</i></li> 
<li><i>&#x2004;</i> <b data-decimal="8196">U+2004</b> <i>&#x2004;</i></li> 
<li><i>&#x2005;</i> <b data-decimal="8197">U+2005</b> <i>&#x2005;</i></li> 
<li><i>&#x2006;</i> <b data-decimal="8198">U+2006</b> <i>&#x2006;</i></li> 
<li><i>&#x2007;</i> <b data-decimal="8199">U+2007</b> <i>&#x2007;</i></li> 
<li><i>&#x2008;</i> <b data-decimal="8200">U+2008</b> <i>&#x2008;</i></li> 
<li><i>&#x2009;</i> <b data-decimal="8201">U+2009</b> <i>&#x2009;</i></li> 
<li><i>&#x200A;</i> <b data-decimal="8202">U+200A</b> <i>&#x200A;</i></li> 
<li><i>&#x200B;</i> <b data-decimal="8203">U+200B</b> <i>&#x200B;</i></li> 
<li><i>&#x200C;</i> <b data-decimal="8204">U+200C</b> <i>&#x200C;</i></li> 
<li><i>&#x200D;</i> <b data-decimal="8205">U+200D</b> <i>&#x200D;</i></li> 
<li><i>&#x2028;</i> <b data-decimal="8232">U+2028</b> <i>&#x2028;</i></li> 
<li><i>&#x2029;</i> <b data-decimal="8233">U+2029</b> <i>&#x2029;</i></li> 
<li><i>&#x202F;</i> <b data-decimal="8239">U+202F</b> <i>&#x202F;</i></li> 
<li><i>&#x205F;</i> <b data-decimal="8287">U+205F</b> <i>&#x205F;</i></li> 
<li><i>&#x2060;</i> <b data-decimal="8288">U+2060</b> <i>&#x2060;</i></li> 
<li><i>&#x3000;</i> <b data-decimal="12288">U+3000</b> <i>&#x3000;</i></li> 
<li><i>&#xFEFF;</i> <b data-decimal="65279">U+FEFF</b> <i>&#xFEFF;</i></li> 
     </ul> 
    </body> 
</html> 

>>> 
>>> 
>>> doc2 = lxml.html.fromstring(text) 
>>> 
>>> from prettytable import PrettyTable 
>>> 
>>> x = PrettyTable([ 
...   #"#", 
...   #"Code point", 
...   "Name", 
...   #"Char Python repr", 
...   "Test string", 
...   "strip()", 
...   "normalize-space()" 
...  ]) 
>>> 
>>> for cnt, li in enumerate(doc2.xpath('.//ul/li'), start=1): 
...  codepoint = li.xpath('string(b)') 
...  wc = wchars[li.xpath('number(b/@data-decimal)')] 
...  tstring = li.xpath('string(.)') 
...  x.add_row([ 
...    #cnt, 
...    #wc.codepoint, 
...    wc.name, 
...    #repr([unichr(wc.decimal)]).strip('[]'), 
...    repr([tstring]).strip('[]'), 
...    tstring.strip() == codepoint, 
...    li.xpath('normalize-space(.)') == codepoint 
...   ]) 
... 

strip()normalize-space()带这些空白字符?

>>> print x 
+-------------------------------+-------------------------+---------+-------------------+ 
|    Name    |  Test string  | strip() | normalize-space() | 
+-------------------------------+-------------------------+---------+-------------------+ 
|  CHARACTER TABULATION  |  '\t U+0009 \t'  | True |  True  | 
|   LINE FEED   |  '\n U+000A \n'  | True |  True  | 
|  LINE TABULATION  |  ' U+000B '  | True |  True  | 
|   FORM FEED   |  ' U+000C '  | True |  True  | 
|  CARRIAGE RETURN  |  '\r U+000D \r'  | True |  True  | 
|    SPACE    |  ' U+0020 '  | True |  True  | 
|   NEXT LINE   | u'\x85 U+0085 \x85' | True |  False  | 
|   NO-BREAK SPACE  | u'\xa0 U+00A0 \xa0' | True |  False  | 
|  OGHAM SPACE MARK  | u'\u1680 U+1680 \u1680' | True |  False  | 
| MONGOLIAN VOWEL SEPARATOR | u'\u180e U+180E \u180e' | True |  False  | 
|   EN QUAD   | u'\u2000 U+2000 \u2000' | True |  False  | 
|   EM QUAD   | u'\u2001 U+2001 \u2001' | True |  False  | 
|   EN SPACE   | u'\u2002 U+2002 \u2002' | True |  False  | 
|   EM SPACE   | u'\u2003 U+2003 \u2003' | True |  False  | 
|  THREE-PER-EM SPACE  | u'\u2004 U+2004 \u2004' | True |  False  | 
|  FOUR-PER-EM SPACE  | u'\u2005 U+2005 \u2005' | True |  False  | 
|  SIX-PER-EM SPACE  | u'\u2006 U+2006 \u2006' | True |  False  | 
|   FIGURE SPACE   | u'\u2007 U+2007 \u2007' | True |  False  | 
|  PUNCTUATION SPACE  | u'\u2008 U+2008 \u2008' | True |  False  | 
|   THIN SPACE   | u'\u2009 U+2009 \u2009' | True |  False  | 
|   HAIR SPACE   | u'\u200a U+200A \u200a' | True |  False  | 
|  ZERO WIDTH SPACE  | u'\u200b U+200B \u200b' | False |  False  | 
|  ZERO WIDTH NON-JOINER  | u'\u200c U+200C \u200c' | False |  False  | 
|  ZERO WIDTH JOINER  | u'\u200d U+200D \u200d' | False |  False  | 
|   LINE SEPARATOR  | u'\u2028 U+2028 \u2028' | True |  False  | 
|  PARAGRAPH SEPARATOR  | u'\u2029 U+2029 \u2029' | True |  False  | 
|  NARROW NO-BREAK SPACE  | u'\u202f U+202F \u202f' | True |  False  | 
| MEDIUM MATHEMATICAL SPACE | u'\u205f U+205F \u205f' | True |  False  | 
|   WORD JOINER   | u'\u2060 U+2060 \u2060' | False |  False  | 
|  IDEOGRAPHIC SPACE  | u'\u3000 U+3000 \u3000' | True |  False  | 
| ZERO WIDTH NON-BREAKING SPACE | u'\ufeff U+FEFF \ufeff' | False |  False  | 
+-------------------------------+-------------------------+---------+-------------------+ 
>>> 

空白字符:

>>> pprint.pprint(wchars) 
{9: WhitespaceChar(codepoint='U+0009', name='CHARACTER TABULATION', decimal=9, named_entity=None), 
10: WhitespaceChar(codepoint='U+000A', name='LINE FEED', decimal=10, named_entity='&NewLine;'), 
11: WhitespaceChar(codepoint='U+000B', name='LINE TABULATION', decimal=11, named_entity=None), 
12: WhitespaceChar(codepoint='U+000C', name='FORM FEED', decimal=12, named_entity=None), 
13: WhitespaceChar(codepoint='U+000D', name='CARRIAGE RETURN', decimal=13, named_entity=None), 
32: WhitespaceChar(codepoint='U+0020', name='SPACE', decimal=32, named_entity=None), 
133: WhitespaceChar(codepoint='U+0085', name='NEXT LINE', decimal=133, named_entity=None), 
160: WhitespaceChar(codepoint='U+00A0', name='NO-BREAK SPACE', decimal=160, named_entity='&nbsp;'), 
5760: WhitespaceChar(codepoint='U+1680', name='OGHAM SPACE MARK', decimal=5760, named_entity=None), 
6158: WhitespaceChar(codepoint='U+180E', name='MONGOLIAN VOWEL SEPARATOR', decimal=6158, named_entity=None), 
8192: WhitespaceChar(codepoint='U+2000', name='EN QUAD', decimal=8192, named_entity=None), 
8193: WhitespaceChar(codepoint='U+2001', name='EM QUAD', decimal=8193, named_entity=None), 
8194: WhitespaceChar(codepoint='U+2002', name='EN SPACE', decimal=8194, named_entity='&ensp;'), 
8195: WhitespaceChar(codepoint='U+2003', name='EM SPACE', decimal=8195, named_entity='&emsp;'), 
8196: WhitespaceChar(codepoint='U+2004', name='THREE-PER-EM SPACE', decimal=8196, named_entity='&emsp13;'), 
8197: WhitespaceChar(codepoint='U+2005', name='FOUR-PER-EM SPACE', decimal=8197, named_entity='&emsp14;'), 
8198: WhitespaceChar(codepoint='U+2006', name='SIX-PER-EM SPACE', decimal=8198, named_entity=None), 
8199: WhitespaceChar(codepoint='U+2007', name='FIGURE SPACE', decimal=8199, named_entity='&numsp;'), 
8200: WhitespaceChar(codepoint='U+2008', name='PUNCTUATION SPACE', decimal=8200, named_entity='&puncsp;'), 
8201: WhitespaceChar(codepoint='U+2009', name='THIN SPACE', decimal=8201, named_entity='&thinsp;'), 
8202: WhitespaceChar(codepoint='U+200A', name='HAIR SPACE', decimal=8202, named_entity='&hairsp;'), 
8203: WhitespaceChar(codepoint='U+200B', name='ZERO WIDTH SPACE', decimal=8203, named_entity=None), 
8204: WhitespaceChar(codepoint='U+200C', name='ZERO WIDTH NON-JOINER', decimal=8204, named_entity='&zwnj;'), 
8205: WhitespaceChar(codepoint='U+200D', name='ZERO WIDTH JOINER', decimal=8205, named_entity='&zwj;'), 
8232: WhitespaceChar(codepoint='U+2028', name='LINE SEPARATOR', decimal=8232, named_entity=None), 
8233: WhitespaceChar(codepoint='U+2029', name='PARAGRAPH SEPARATOR', decimal=8233, named_entity=None), 
8239: WhitespaceChar(codepoint='U+202F', name='NARROW NO-BREAK SPACE', decimal=8239, named_entity=None), 
8287: WhitespaceChar(codepoint='U+205F', name='MEDIUM MATHEMATICAL SPACE', decimal=8287, named_entity='&MediumSpace;'), 
8288: WhitespaceChar(codepoint='U+2060', name='WORD JOINER', decimal=8288, named_entity='&NoBreak;'), 
12288: WhitespaceChar(codepoint='U+3000', name='IDEOGRAPHIC SPACE', decimal=12288, named_entity=None), 
65279: WhitespaceChar(codepoint='U+FEFF', name='ZERO WIDTH NON-BREAKING SPACE', decimal=65279, named_entity=None)} 
>>>