lxml分隔元素而beautifulsoup不

lxml返回两个项目，而beautifulsoup只返回一个元素。那是因为<br/>不应该在那里，美丽的女孩更容忍不良的HTML？lxml分隔元素而beautifulsoup不

有没有更好的方法来使用lxml提取位置？ <br/>并不总是在那里。

from lxml import html 
from bs4 import BeautifulSoup as bs 

s = '''<td class="location"> 
    <p> 
    TRACY,<br/>&nbsp;CA&nbsp;95304&nbsp; 
    </p></td> 
''' 

tree = html.fromstring(s) 
r = tree.xpath('//td[@class="location"]/p/text()') 
print r 

soup = bs(s, 'lxml') 
r = soup.find_all('td', class_='location')[0].get_text() 
print r

来源

2016-11-06 foosion

有没有更好的方法来使用lxml提取位置？ <br/>并不总是在那里。

如果通过更好你的意思是返回结果更接近它的BS对应，则XPath表达式更好地模拟你的废话，代码如下：

>>> print tree.xpath('string(//td[@class="location"])') 


    TRACY, CA 95304

而且，如果你喜欢多余的空格被删除，使用normalize-space()代替string()：

>>> print tree.xpath('normalize-space(//td[@class="location"])') 
TRACY, CA 95304

来源

2016-11-06 12:12:18 har07

element.get_text()加入单独的字符串运行;从documentation：

如果您只想要文本或标记的文本部分，则可以使用get_text（）方法。它返回一个文档中的所有文本或标签的下面，作为一个Unicode字符串

重点煤矿。

使用Tag.strings generator，如果你要在各个字符串：

>>> list(soup.find_all('td', class_='location')[0].strings) 
[u'\n', u'\n TRACY,', u'\xa0CA\xa095304\xa0\n ']

如果你想LXML加入文字，那么就加入文字：

r = ''.join(tree.xpath('//td[@class="location"]/p/text()'))

的string() XPath function可以做同样的<td>标签：

r = tree.xpath('string(//td[@class="location"])')

演示：

>>> ''.join(tree.xpath('//td[@class="location"]/p/text()')) 
u'\n TRACY,\xa0CA\xa095304\xa0\n ' 
>>> tree.xpath('string(//td[@class="location"])') 
u'\n \n TRACY,\xa0CA\xa095304\xa0\n '

我在任的结果使用str.strip()：

>>> tree.xpath('string(//td[@class="location"])').strip() 
u'TRACY,\xa0CA\xa095304' 
>>> print tree.xpath('string(//td[@class="location"])').strip() 
TRACY, CA 95304

或使用normalize-space() XPath function：

>>> tree.xpath('normalize-space(string(//td[@class="location"]))') 
u'TRACY,\xa0CA\xa095304\xa0'

注意str.strip()去掉尾随的非破\xa0空间，同时normalise-space()叶它。

来源

2016-11-06 12:09:25

我正在寻找从LXML一个字符串，而不是进一步分离bs的结果。 “有没有更好的方法来使用lxml提取位置？” – foosion

@福发：啊，的确，读得太快了。 –

感谢您的尝试 – foosion

lxml分隔元素而beautifulsoup不

回答

相关问题