1
我刚刚写下了一些代码,它们逐个抄袭了网站上提到的每个GSOC组织的页面。如何在Python中更快更高效地抓取多个页面
目前,这工作正常,但是相当缓慢。 有没有办法让它更快?另外,请提供任何其他建议以改进此代码。
from bs4 import BeautifulSoup
import requests, sys, os
f = open('GSOC-Organizations.txt', 'w')
r = requests.get("https://summerofcode.withgoogle.com/archive/2016/organizations/")
soup = BeautifulSoup(r.content, "html.parser")
a_tags = soup.find_all("a", {"class": "organization-card__link"})
title_heads = soup.find_all("h4", {"class": "organization-card__name"})
links,titles = [],[]
for tag in a_tags:
links.append("https://summerofcode.withgoogle.com"+tag.get('href'))
for title in title_heads:
titles.append(title.getText())
for i in range(0,len(links)):
ct=1
print "Currently Scraping : ",
print titles[i]
name = titles[i] + "\n" + "\tTechnologies: \n"
name = name.encode('utf-8')
f.write(str(name))
req = requests.get(links[i])
page = BeautifulSoup(req.content, "html.parser")
techs = page.find_all("li",{"class": "organization__tag--technology"})
for item in techs:
text,ct = ("\t" + str(ct)+".) " + item.getText()+"\n").encode('utf-8'),ct+1
f.write(str(text))
newlines=("\n\n").encode('utf-8')
f.write(newlines)
这应该在[代码审查]张贴(http://codereview.stackexchange.com/)的同时请求代码改进。那里的人非常有帮助。现在,所有外部'for'循环都可以真正使用'zip'移植到一个循环中。而*很慢*意味着什么?对我来说,所有178个链接都在约4分钟内被取消。这太慢了吗? – Parfait