2016-05-14 68 views
-1

好的,所以我使用bs4(BeautifulSoup)解析通过网站,并找到我正在寻找的具体标题。我的代码如下所示:如何摆脱文本上方的空白,使用bs4

import requests 
from bs4 import BeautifulSoup 
url = 'http://www.ewn.co.za/Categories/Local' 
r = requests.get(url).text 
soup = BeautifulSoup(r) 
for i in soup.find_all(class_='article-short'): 
    if i.a: 
     print(i.a.text.replace('\n', '').strip()) 
    else: 
     print(i.contents[0].strip()) 

此代码的工作,但在其输出节目,如20线空白的第一,从网站上打印申请标题前。我的代码有什么问题,或者有什么我可以做的,以摆脱空白?

+0

随着带的功能,你可以在一个字符串中删除空格(https://docs.python.org/3/library/stdtypes.html#str.strip) – Querenker

回答

0

因为你有这样的内容:

<article class="article-short"> 
<div class="thumb"><a href="http://ewn.co.za/2016/05/14/Contralesa-against-scrapping-initiation-due-to-cold-weather"><img alt="FILE: Boys who have undergone a circumcision ceremony walk near Qunu in the Eastern Cape in 2013. Picture: AFP." height="147" src="http://ewn.co.za/cdn/-%2fmedia%2f3C37CB28056746CD95FC913757AAD41C.ashx%3fas%3d1%26h%3d147%26w%3d234%26crop%3d1;waeb9b8157b3e310df" width="234"/></a></div> 
<h6 class="h6-mega"><a href="http://ewn.co.za/2016/05/14/Contralesa-against-scrapping-initiation-due-to-cold-weather">Contralesa against scrapping initiation due to cold weather</a></h6> 
</article> 

其中第一个链接包含图像,并没有文字。

您应该寻找代替h6标记。所以,像这样的工作:

import requests 
from bs4 import BeautifulSoup 
url = 'http://www.ewn.co.za/Categories/Local' 
r = requests.get(url).text 
soup = BeautifulSoup(r) 
for i in soup.find_all(class_='article-short'): 
    title = (i.h6.text.replace('\n', '') if i.h6 else contents[0]).strip() 
    if title: 
     print(title) 
+0

谢谢! @aldanor现在效果更好! – raid3r