如何王氏蟒蛇scrapy

2015-04-07 76 views 1 likes

我有一个HTML页面(seed)形式的提取物对（HREF，ALT）：如何王氏蟒蛇scrapy

<div class="sth1"> 
    <table cellspacing="6" width="600"> 
     <tr> 
      <td> 
       <a href="link1"><img alt="alt1" border="0" height="22" src="img1" width="92"></a> 
      </td> 
      <td> 
       <a href="link1">name1</a> 
      </td> 
      <td> 
       <a href="link2"><img alt="alt2" border="0" height="22" src="img2" width="92"></a> 
      </td> 
      <td> 
       <a href="link2">name2</a> 
      </td> 
     </tr> 
    </table> 
</div>

我想什么做的是循环到所有<tr>的，并提取所有href, alt与python scrapy配对。在这个例子中，我应该得到：

link1, alt1 
link2, alt2

来源

2015-04-07 user706838

回答

下面是来自Scrapy Shell一个例子：

$ scrapy shell index.html 
In [1]: for cell in response.xpath("//div[@class='sth1']/table/tr/td"): 
    ...:  href = cell.xpath("a/@href").extract() 
    ...:  alt = cell.xpath("a/img/@alt").extract() 
    ...:  print href, alt 

[u'link1'] [u'alt1'] 
[u'link1'] [] 
[u'link2'] [u'alt2'] 
[u'link2'] []

其中index.html包含的问题提供的样本HTML。

来源

2015-04-07 17:10:21 alecxe

你可以尝试Scrapy的内置SelectorList与Python的拉链（）合并：

from scrapy.selector import SelectorList 

xpq = '//div[@class="sth1"]/table/tr/td[./a/img]' 
cells = SelectorList(response.xpath(xpq)) 

zip(cells.xpath('a/@href'), cells.xpath('a/img/@alt')) 
=> [('link1', 'alt1'), ('link2', 'alt2')]

来源

2015-04-08 10:33:52 Roman

如何王氏蟒蛇scrapy

回答

相关问题