2017-07-31 58 views
0

我试图从Goodreads中删除引号。我只需要引用,而不是作者的名字。从上一个孩子的文本以外的节点刮取文本

以下是HTML源代码。

<div class="quoteText"> 
     “Don't cry because it's over, smile because it happened.” 
    <br> ― 
    <a class="authorOrTitle" href="/author/show/61105.Dr_Seuss">Dr. Seuss</a> 
</div> 

我在下面尝试,但它带有作者信息。

quotes = [quote.text.strip() for quote in soup.findAll('div', {'class':'quoteText'})] 

我也使用contents[0]尝试,但在多报价的情况下失败。请看下图:

<div class="quoteText"> 
     “You've gotta dance like there's nobody watching, 
<br> 
Love like you'll never be hurt, 
<br> 
Sing like there's nobody listening, 
<br> 
And live like it's heaven on earth.” 
    <br> ― 
    <a class="authorOrTitle" href="/author/show/1744830.William_W_Purkey">William W. Purkey</a> 
</div> 

回答

1

这是简单的一个,当你做quote.text.strip()你会得到你可以打出字符串\n这种情况下'“Don't cry because it's over, smile because it happened.”\n ―\n Dr. Seuss',只获得报价。 例: [quote.text.strip().split("\n")[0] for quote in soup.findAll("div", {"class":"quoteText"})]

如果你不想引号(即”和“),您可以通过使用"".replace()

+0

哦,是取代它。奇怪它并没有跨过我的脑海。 –