2015-07-20 74 views
2

我一直试图在Scrapy中连接一些嵌套文本和xpath。我认为它使用xpath 1.0?我看了一堆其他职位,但似乎没有得到相当我想要的东西在Scrapy中连接Xpath嵌套文本

下面是HTML的特定部分(实际页http://adventuretime.wikia.com/wiki/List_of_episodes):

<tr> 
<td colspan="5" style="border-bottom: #BCD9E3 3px solid"> 
    Finn and Princess Bubblegum must protect the <a href="/wiki/Candy_Kingdom" title="Candy Kingdom">Candy Kingdom</a> from a horde of candy zombies they accidentally created. 
</td> 
</tr> 

<tr> 
<td colspan="5" style="border-bottom: #BCD9E3 3px solid"> 
Finn must travel to <a href="/wiki/Lumpy_Space" title="Lumpy Space">Lumpy Space</a> to find a cure that will save Jake, who was accidentally bitten by <a href="/wiki/Lumpy_Space_Princess" title="Lumpy Space Princess">Lumpy Space Princess</a> at Princess Bubblegum's annual 'Mallow Tea Ceremony.' 
</td> 
</tr> 

(much more stuff here) 

这里是我的结果想回:

[u'Finn and Princess Bubblegum must protect the Candy Kingdom from a horde of candy zombies they accidentally 
    created.\n', u'Finn must travel to Lumpy Space to find a cure that will save Jake, who was accidentally bitten', (more stuff here)] 

我已经使用了答案试图从 HTML XPath: Extracting text mixed in with multiple tags?

description =sel.xpath("//table[@class='wikitable']/tr[position()>1]/td[@colspan='5']/parent::tr/td[descendant-or-self::text()]").extract() 

但这只是让我回来

[u'<td colspan="5" style="border-bottom: #BCD9E3 3px solid">Finn and Princess Bubblegum must protect the <a href="/wiki/ 
Candy_Kingdom" title="Candy Kingdom">Candy Kingdom</a> from a horde of candy zombies they accidentally created.\n</td>', 

string()答案似乎并没有对我也工作...我回来只有一个条目清单,并应该有很多。

我已经得到最接近的是:

description = sel.xpath("//table[@class='wikitable']/tr[position()>1]/td[@colspan='5']//text()").extract() 

,这让我回

[u'Finn and Princess Bubblegum must protect the ', u'Candy Kingdom', u' from a horde of candy zombies they accidentally 
created.\n', u'Finn must travel to ', u'Lumpy Space', u' to find a cure that will save Jake, who was accidentally bitten, (more stuff here)] 

任何人有XPath的技巧上串联?

谢谢!

编辑:蜘蛛代码经由手动join()

class AT_Episode_Detail_Spider_2(Spider): 

    name = "ep_detail_2" 
    allowed_domains = ["adventuretime.wikia.com"] 
    start_urls = [ 
     "http://adventuretime.wikia.com/wiki/List_of_episodes" 
    ] 

    def parse(self, response): 
     sel = Selector(response) 

     description = sel.xpath("//table[@class='wikitable']/tr[position()>1]/td[@colspan='5']//text()").extract() 
     print description 

回答

3

串连:

description = " ".join(sel.xpath("//table[@class='wikitable']/tr[position()>1]/td[@colspan='5']//text()").extract()) 

或者使用结合一个Join()处理器与Item Loader


下面是一个简单的代码来获得插曲说明的列表:

def parse(self, response): 
    description = [" ".join(row.xpath(".//text()[not(ancestor::sup)]").extract()) 
        for row in response.xpath("//table[@class='wikitable']/tr[position()>1]/td[@colspan]")] 
    print description 
+0

'加入()'是不是完全是我要找的。我应该更具体一点。请注意,在我想要返回的数据中,不止有一个字符串。我只想将文本与其他标签组合在一起,但不是将所有文本和标签组合在一起。我会更新我的html真的很快... – pyramidface

+0

@pyramidface你可以也可以用'join()来解决它。除此之外,您可能需要遍历行以制作说明列表。你还可以发布完整的蜘蛛代码,以便我可以更好地理解上下文吗?谢谢! – alecxe

+0

@pyramidface好的,我已经更新了答案,包括获取描述列表的代码。这是你问的吗?谢谢。 – alecxe