2015-10-06 69 views
1

我想刮下面的HTML代码的标题无标签段:我怎样才能凑与Scrapy

def parse_article(self, response): 
       for href in response.xpath('//font[@color="#5FA505"]/'): 

,但标题(男女同校:

<FONT COLOR=#5FA505><B>Claim:</B></FONT> &nbsp; Coed makes unintentionally risqu&eacute; remark about professor's "little quizzies." 
<BR><BR> 
<CENTER><IMG SRC="/images/content-divider.gif"></CENTER> 

我已经尝试使用无意中......)实际上并没有嵌入到任何标签中,所以我一直无法获得该内容。有没有一种方法可以在不嵌入<p>或任何标签的情况下获取内容?

编辑://font[b = "Claim:"]/following-sibling::text()工程,但它也抓住并显示这个底部的一块HTML。

<FONT COLOR=#5FA505 FACE=""><B>Origins:</B></FONT> &nbsp; Print references to the "little quizzies" tale date to 1962, but the tale itself has been around since the early 1950s. It continues to surface among college students to this day. Similar to a number of other college legends 

回答

1

假设你知道,还有就是Claim:文本事先通过其b孩子的文本找到font标签,并获得following text sibling:从Scrapy Shell

//font[b = 'Claim:']/following-sibling::text() 

演示:

In [1]: "".join(map(unicode.strip, response.xpath("//font[b = 'Claim:']/following-sibling::text()").extract())) 
Out[1]: u'Coed makes unintentionally risqu\xe9 remark about professor\'s "little quizzies."' 

请注意,这些连接和剥离调用应理想地由Item Loaders内使用的相应输入或输出处理器所取代。

+0

它的工作原理,我接受了答案,但请看看我的编辑 – Rafa