2015-10-06 55 views
0

我试图刮下面的HTML代码的标题:是否有scrapy跟随同胞计数?

<FONT COLOR=#5FA505><B>Claim:</B></FONT> &nbsp; Coed makes unintentionally risqu&eacute; remark about professor's "little quizzies." 
<BR><BR> 
<CENTER><IMG SRC="/images/content-divider.gif"></CENTER> 

我使用这个代码:

def parse_article(self, response): 
      for href in response.xpath('//font[b = "Claim:"]/following-sibling::text()'): 
         print href.extract() 

,我成功地拉了正确的Claim:值,我从想前面提到过的html代码,但是也有(在同一页面中具有类似结构的其他代码)拉下面的html。我正在定义我的xpath()只需拉入名为Claim:font标记,那么为什么它也拉动下面的Origins?我该如何解决它?我想看到的,如果我能得到的只是下一个following-sibling,而不是所有的人,但没有奏效

<FONT COLOR=#5FA505 FACE=""><B>Origins:</B></FONT> &nbsp; Print references to the "little quizzies" tale date to 1962, but the tale itself has been around since the early 1950s. It continues to surface among college students to this day. Similar to a number of other college legends 
+0

'.extract()[0]' –

+0

@JohnDene我的输出变化,但它只是一堆空的空间,偶尔会出现','每隔一段时间 – Rafa

+1

我认为这是您正在使用for循环的bcoz。如果我知道它是正确的,你只想提取一个值? –

回答

0

我觉得你的XPath是缺少text()预选赛(解释here)。它应该是:

'//font/[b/text()="Claim:"]/following-sibling::text()' 
+0

仍然给了我相同的输出。同时拉动'起源'。 – Rafa

0

following-sibling轴将返回一个元素后面的所有兄弟元素。如果你只想要第一个兄弟,尝试XPath表达式:

//font[b = "Claim:"]/following-sibling::text()[1] 

,或者根据您的具体使用案例:

(//font[b = "Claim:"]/following-sibling::text())[1]