全部废除文本<a>使用scrapy的span标记下的标记

我正在使用scrapy从网页中提取数据。我想提取的跨度标签下锚标签的文字如下图所示：全部废除文本<a>使用scrapy的span标记下的标记

<span>.....</span> 
<span id = "size_selection_list"> 
    <a>....</a> 
    <a>....</a> 
    . 
    . 
    . 
    <a> 
</span>

我使用以下XPath逻辑：

t = sel.xpath('//div[starts-with(@id,"size_selection_container")]/span[2]') 
for x in t.xpath('.//a'): 
....

是达到这个问题的跨度元素，但<a>标签不会迭代。这里有什么错误？另外<a>有一个HREF有JavaScript。这是问题的原因吗？

来源

2016-11-18 Neel Shah

你的逻辑将与您提供的样本HTML：http://pastebin.com/hxSZ041j。因此，要么不按原样分享代码，要么示例HTML不是您正在使用的代码。 –

如果我愿意，我会使用requests和BeautifulSoup4。

请注意，此代码未经测试，但应该工作。

import requests 
from bs4 import BeautifulSoup 
r = requests.get(yoururlhere).text 
soup = BeautifulSoup(r, 'html.parser') #You can use LXML or other things, I am using the standard parser for compatibility 
span = div.find('div', {'class': 'theclass'} 
tags = span.findAll('a', href=True) 
for i in tags: 
    print(i.getText()) #getText might not be a function, consider removing the extra() 
    print(i['href']) #<-- This is the links, above is the text

我希望这个作品，请让我知道

来源

2016-11-18 01:06:21 Will

但我想爬蜘蛛。所以这就是为什么我更喜欢scrapy的一些解决方案。 –

请问为什么使用scrapy或蜘蛛？ – Will

这是一个我所能做的，你的HTML代码是不完整的。

import lxml.html 
string = '''<span>.....</span> 
<span id = "size_selection_list"> 
    <a>....</a> 
    <a>....</a> 
    . 
    . 
    . 
    <a>....</a> 
</span>''' 

html = lxml.html.fromstring(string) 
for a in html.xpath('//span[@id="size_selection_list"]//a'): 
    print(a.tag)

出来：

a 
a 
a

来源

2016-11-18 05:29:00

这给出了错误 –

它给了什么错误？ –

全部废除文本<a>使用scrapy的span标记下的标记

回答

相关问题