2015-09-28 25 views
3

我想提取的所有网页上有类=“news_item”如何提取特定的div里面的所有的HREF和SRC与beautifulsoup蟒蛇

的HTML看起来像这样的div里面的所有的href和src:

<div class="col"> 
<div class="group"> 
<h4>News</h4> 
<div class="news_item"> 

<a href="www.link.com"> 

<h2 class="link"> 
here is a link-heading 
</h2> 
<div class="Img"> 
<img border="0" src="/image/link" /> 
</div> 
<p></p> 
</a> 
</div> 

从这里我想提取的是:

www.link.com,这里是链接,标题和/图像/链接

我的代码是:

def scrape_a(url): 

     news_links = soup.select("div.news_item [href]") 
     for links in news_links: 
      if news_links: 
      return 'http://www.web.com' + news_links['href'] 

    def scrape_headings(url): 
     for news_headings in soup.select("h2.link"): 
      return str(news_headings.string.strip()) 


    def scrape_images(url): 
     images = soup.select("div.Img[src]") 
     for image in images: 
      if images: 
      return 'http://www.web.com' + news_links['src'] 


    def top_stories(): 


    r = requests.get(url) 
    soup = BeautifulSoup(r.content) 
    link = scrape_a(soup) 
    heading = scrape_headings(soup) 
    image = scrape_images(soup) 
    message = {'heading': heading, 'link': link, 'image': image} 
    print message 

的问题是,它给我的错误:

**TypeError: 'NoneType' object is not callable** 

这里是回溯:

Traceback (most recent call last): 
    File "web_parser.py", line 40, in <module> 
    top_stories() 
    File "web_parser.py", line 32, in top_stories 
    link = scrape_a('www.link.com') 
    File "web_parser.py", line 10, in scrape_a 
    news_links = soup.select_all("div.news_item [href]") 
+1

请粘贴堆栈回溯 – hjpotter92

+0

@ hjpotter92完成后,请再次看到帖子 – Imo

+0

什么'div.news_item [href]'应该匹配/查找? – hjpotter92

回答

1

你应该一次抓住所有的新闻,然后遍历它们。这样就可以很容易地组织你可以管理的数据块(在这种情况下是字典)。尝试这样的事情

url = "http://www.web.com" 
r = requests.get(url) 
soup = BeautifulSoup(r.text) 

messages = [] 

news_links = soup.select("div.news_item") # selects all .news_item's 
for l in news_links: 
    message = {} 

    message['heading'] = l.find("h2").text.strip() 

    link = l.find("a") 
    if link: 
     message['link'] = link['href'] 
    else: 
     continue 

    image = l.find('img') 
    if image: 
     message['image'] = "http://www.web.com{}".format(image['src']) 
    else: 
     continue 

    messages.append(message) 

print messages 
+0

谢谢我觉得这是最接近于回答我的问题,虽然我还是老样子得到这个错误:文件“web_parser.py” 22行 消息[“形象”] =“http://www.web.com {}” .format(image_src) ^ 语法错误:EOL而scannin g字符串文字 – Imo

+0

是的,这是因为我用单引号打开字符串和双引号来关闭它。它现在应该工作。 – wpercy

+0

返回一个无类型的对象:Traceback(最近一次调用最后一个): 文件“web_parser.py”,第20行,在 message ['link'] = l.find(“a”)[“href”] 类型错误:“NoneType”对象有没有属性“__getitem__” – Imo

1

大多数的错误来自于事实news_link找不到。你没有回到你期望的tag

变化:

news_links = soup.select("div.news_item [href]") 
    for links in news_links: 
     if news_links: 
     return 'http://www.web.com' + news_links['href'] 

这个,看看它是否帮助:

news_links = soup.find_all("div", class="news_item") 
    for links in news_links: 
     if news_links: 
       return 'http://www.web.com' + news_links.find("a").get('href') 

另外请注意,return语句会给你像http://www.web.comwww.link.com,我想你不觉得。

0

你的想法的任务分割成不同的方法是相当不错的 -
不错的阅读,改变和重用。

的误差几乎解决了固定,在跟踪中有SELECT_ALL但它不是在beautifulsoup并没有在你的代码和一些其他的东西......长话短说我会做这样的。

# -*- coding: utf-8 -*- 
from bs4 import BeautifulSoup 
from urlparse import urljoin 
import requests 


def news_links(url, soup): 
    links = [] 
    for text in soup.select("div.news_item"): 
     for x in text.find_all(href=True): 
      links.append(urljoin(url, x['href'])) 
    return links 


def news_headings(soup): 
    headings = [] 
    for news_headings in soup.select("h2.link"): 
     heading.append(str(news_headings.string.strip())) 
    return headings 


def news_images(url, soup): 
    sources = [] 
    for image in soup.select("img[src]"): 
     sources.append(urljoin(url, image['src'])) 
    return sources 


def top_stories(): 
    url = 'http://www.web.com/' 
    r = requests.get(url) 
    content = r.content 
    soup = BeautifulSoup(content) 
    message = {'heading': news_headings(soup), 
       'link': news_links(url, soup), 
       'image': news_images(url, soup)} 
    return message 


print top_stories() 

汤是健壮的,你想找到或选择不存在的东西,它会返回一个空的列表。它看起来像你解析一个项目列表 - 代码非常接近用于这个。