我想提取的所有网页上有类=“news_item”如何提取特定的div里面的所有的HREF和SRC与beautifulsoup蟒蛇
的HTML看起来像这样的div里面的所有的href和src:
<div class="col">
<div class="group">
<h4>News</h4>
<div class="news_item">
<a href="www.link.com">
<h2 class="link">
here is a link-heading
</h2>
<div class="Img">
<img border="0" src="/image/link" />
</div>
<p></p>
</a>
</div>
从这里我想提取的是:
www.link.com,这里是链接,标题和/图像/链接
我的代码是:
def scrape_a(url):
news_links = soup.select("div.news_item [href]")
for links in news_links:
if news_links:
return 'http://www.web.com' + news_links['href']
def scrape_headings(url):
for news_headings in soup.select("h2.link"):
return str(news_headings.string.strip())
def scrape_images(url):
images = soup.select("div.Img[src]")
for image in images:
if images:
return 'http://www.web.com' + news_links['src']
def top_stories():
r = requests.get(url)
soup = BeautifulSoup(r.content)
link = scrape_a(soup)
heading = scrape_headings(soup)
image = scrape_images(soup)
message = {'heading': heading, 'link': link, 'image': image}
print message
的问题是,它给我的错误:
**TypeError: 'NoneType' object is not callable**
这里是回溯:
Traceback (most recent call last):
File "web_parser.py", line 40, in <module>
top_stories()
File "web_parser.py", line 32, in top_stories
link = scrape_a('www.link.com')
File "web_parser.py", line 10, in scrape_a
news_links = soup.select_all("div.news_item [href]")
请粘贴堆栈回溯 – hjpotter92
@ hjpotter92完成后,请再次看到帖子 – Imo
什么'div.news_item [href]'应该匹配/查找? – hjpotter92