如何提取特定的div里面的所有的HREF和SRC与beautifulsoup蟒蛇

我想提取的所有网页上有类=“news_item”如何提取特定的div里面的所有的HREF和SRC与beautifulsoup蟒蛇

的HTML看起来像这样的div里面的所有的href和src：

<div class="col"> 
<div class="group"> 
<h4>News</h4> 
<div class="news_item"> 

<a href="www.link.com"> 

<h2 class="link"> 
here is a link-heading 
</h2> 
<div class="Img"> 
<img border="0" src="/image/link" /> 
</div> 
<p></p> 
</a> 
</div>

从这里我想提取的是：

www.link.com，这里是链接，标题和/图像/链接

我的代码是：

def scrape_a(url): 

     news_links = soup.select("div.news_item [href]") 
     for links in news_links: 
      if news_links: 
      return 'http://www.web.com' + news_links['href'] 

    def scrape_headings(url): 
     for news_headings in soup.select("h2.link"): 
      return str(news_headings.string.strip()) 


    def scrape_images(url): 
     images = soup.select("div.Img[src]") 
     for image in images: 
      if images: 
      return 'http://www.web.com' + news_links['src'] 


    def top_stories(): 


    r = requests.get(url) 
    soup = BeautifulSoup(r.content) 
    link = scrape_a(soup) 
    heading = scrape_headings(soup) 
    image = scrape_images(soup) 
    message = {'heading': heading, 'link': link, 'image': image} 
    print message

的问题是，它给我的错误：

**TypeError: 'NoneType' object is not callable**

这里是回溯：

Traceback (most recent call last): 
    File "web_parser.py", line 40, in <module> 
    top_stories() 
    File "web_parser.py", line 32, in top_stories 
    link = scrape_a('www.link.com') 
    File "web_parser.py", line 10, in scrape_a 
    news_links = soup.select_all("div.news_item [href]")

来源

2015-09-28 Imo

请粘贴堆栈回溯 – hjpotter92

@ hjpotter92完成后，请再次看到帖子 – Imo

什么'div.news_item [href]'应该匹配/查找？ – hjpotter92

你应该一次抓住所有的新闻，然后遍历它们。这样就可以很容易地组织你可以管理的数据块（在这种情况下是字典）。尝试这样的事情

url = "http://www.web.com" 
r = requests.get(url) 
soup = BeautifulSoup(r.text) 

messages = [] 

news_links = soup.select("div.news_item") # selects all .news_item's 
for l in news_links: 
    message = {} 

    message['heading'] = l.find("h2").text.strip() 

    link = l.find("a") 
    if link: 
     message['link'] = link['href'] 
    else: 
     continue 

    image = l.find('img') 
    if image: 
     message['image'] = "http://www.web.com{}".format(image['src']) 
    else: 
     continue 

    messages.append(message) 

print messages

来源

2015-09-28 14:17:02 wpercy

谢谢我觉得这是最接近于回答我的问题，虽然我还是老样子得到这个错误：文件“web_parser.py” 22行消息[“形象”] =“http://www.web.com {}” .format（image_src） ^ 语法错误：EOL而scannin g字符串文字 – Imo

是的，这是因为我用单引号打开字符串和双引号来关闭它。它现在应该工作。 – wpercy

返回一个无类型的对象：Traceback（最近一次调用最后一个）：文件“web_parser.py”，第20行，在 message ['link'] = l.find（“a”）[“href”] 类型错误：“NoneType”对象有没有属性“__getitem__” – Imo

大多数的错误来自于事实news_link找不到。你没有回到你期望的tag。

变化：

news_links = soup.select("div.news_item [href]") 
    for links in news_links: 
     if news_links: 
     return 'http://www.web.com' + news_links['href']

这个，看看它是否帮助：

news_links = soup.find_all("div", class="news_item") 
    for links in news_links: 
     if news_links: 
       return 'http://www.web.com' + news_links.find("a").get('href')

另外请注意，return语句会给你像http://www.web.comwww.link.com，我想你不觉得。

来源

2015-09-28 13:59:30 dstudeba

你的想法的任务分割成不同的方法是相当不错的 -
不错的阅读，改变和重用。

的误差几乎解决了固定，在跟踪中有SELECT_ALL但它不是在beautifulsoup并没有在你的代码和一些其他的东西......长话短说我会做这样的。

# -*- coding: utf-8 -*- 
from bs4 import BeautifulSoup 
from urlparse import urljoin 
import requests 


def news_links(url, soup): 
    links = [] 
    for text in soup.select("div.news_item"): 
     for x in text.find_all(href=True): 
      links.append(urljoin(url, x['href'])) 
    return links 


def news_headings(soup): 
    headings = [] 
    for news_headings in soup.select("h2.link"): 
     heading.append(str(news_headings.string.strip())) 
    return headings 


def news_images(url, soup): 
    sources = [] 
    for image in soup.select("img[src]"): 
     sources.append(urljoin(url, image['src'])) 
    return sources 


def top_stories(): 
    url = 'http://www.web.com/' 
    r = requests.get(url) 
    content = r.content 
    soup = BeautifulSoup(content) 
    message = {'heading': news_headings(soup), 
       'link': news_links(url, soup), 
       'image': news_images(url, soup)} 
    return message 


print top_stories()

汤是健壮的，你想找到或选择不存在的东西，它会返回一个空的列表。它看起来像你解析一个项目列表 - 代码非常接近用于这个。

来源

2015-09-28 18:11:58 rebeling

如何提取特定的div里面的所有的HREF和SRC与beautifulsoup蟒蛇

回答

相关问题