2015-04-07 76 views
1

我有一个HTML页面(seed)形式的提取物对(HREF,ALT):如何王氏蟒蛇scrapy

<div class="sth1"> 
    <table cellspacing="6" width="600"> 
     <tr> 
      <td> 
       <a href="link1"><img alt="alt1" border="0" height="22" src="img1" width="92"></a> 
      </td> 
      <td> 
       <a href="link1">name1</a> 
      </td> 
      <td> 
       <a href="link2"><img alt="alt2" border="0" height="22" src="img2" width="92"></a> 
      </td> 
      <td> 
       <a href="link2">name2</a> 
      </td> 
     </tr> 
    </table> 
</div> 

我想什么做的是循环到所有<tr>的,并提取所有href, alt与python scrapy配对。在这个例子中,我应该得到:

link1, alt1 
link2, alt2 

回答

1

下面是来自Scrapy Shell一个例子:

$ scrapy shell index.html 
In [1]: for cell in response.xpath("//div[@class='sth1']/table/tr/td"): 
    ...:  href = cell.xpath("a/@href").extract() 
    ...:  alt = cell.xpath("a/img/@alt").extract() 
    ...:  print href, alt 

[u'link1'] [u'alt1'] 
[u'link1'] [] 
[u'link2'] [u'alt2'] 
[u'link2'] [] 

其中index.html包含的问题提供的样本HTML。

1

你可以尝试Scrapy的内置SelectorList与Python的拉链()合并:

from scrapy.selector import SelectorList 

xpq = '//div[@class="sth1"]/table/tr/td[./a/img]' 
cells = SelectorList(response.xpath(xpq)) 

zip(cells.xpath('a/@href'), cells.xpath('a/img/@alt')) 
=> [('link1', 'alt1'), ('link2', 'alt2')]