优化我的Python刮刀

一种冗长的问题，我可能只需要有人指出我在正确的方向。我正在构建一个网页抓取工具，以便从ESPN网站上获取篮球运动员信息。 URL结构很简单，因为每个玩家卡在URL中都有一个特定的ID。为了获得信息，我正在编写1〜6000的循环来从他们的数据库中抓取玩家。我的问题是，是否有更有效的方式来做到这一点？优化我的Python刮刀

from bs4 import BeautifulSoup 
from urllib2 import urlopen 
import requests 
import nltk 
import re 




age = [] # Empty List to store player ages 

BASE = 'http://espn.go.com/nba/player/stats/_/id/' # Base Structure of Player Card URL 
def get_age(BASE): #Creates a function 
    #z = range(1,6000) # Create Range from 1 to 6000 
    for i in range(1, 6000): # This is a for loop 
     BASE_U = BASE + str(i) + '/' # Create URL For Player 
     r = requests.get(BASE_U) 
     soup = BeautifulSoup(r.text) 
     #Prior to this step, I had to print out the soup object and look through the HTML in order to find the tag that contained my desired information 
     # Get Age of Players   
     age_tables = soup.find_all('ul', class_="player-metadata") # Grabs all text in the metadata tag 
     p = str(age_tables) # Turns text into a string 
    #At this point I had to look at all the text in the p object and determine a way to capture the age info 
     if "Age: " not in p: # PLayer ID doesn't exist so go to next to avoid error 
     continue 
     else: 
      start = p.index("Age: ") + len("Age: ") # Gets the location of the players age 
      end = p[start:].index(")") + start 
      player_id.append(i) #Adds player_id to player_id list 
      age.append(p[start:end]) # Adds player's age to age list 

get_age(BASE)

任何帮助，即使很小，将不胜感激。即使它只是指着我在正确的方向，而不一定是直接的解决方案

感谢，本

来源

2015-06-21 mangodreamz

我可能会开始http://espn.go.com/nba/players和使用下面的正则表达式来得到队员名单网址...

\href="(/nba/teams/roster\?team=[^"]+)">([^<]+)</a>\

然后，我会得到最终匹配的群体，其中\ 1是最后的部分的URL和\ 2是团队名称。然后我会使用这些网址刮每个队员名单页面寻找球员的网址...

\href="(http://espn.go.com/nba/player/_/id/[^"]+)">([^<]+)</a>\

我终于得到了最终匹配的群体，其中\ 1是玩家网页和\ 2网址是玩家名称。我会抓取每个产生的URL以获取我需要的信息。

正则表达式是炸弹。

希望这会有所帮助。

来源

2015-06-21 03:37:01 CLaFarge

这就像在网络安全港SCANER，多线程快速地将你的程序非常多。

来源

2015-06-21 01:12:37

啊我听说过多线程。你知道易于遵循在线教程吗？ – mangodreamz

我个人认为'multiprocessing'库的文档是一个很好的开始。如果文档对您来说不够好，您可以查看该库的指南。 –

不仅更高效，而且更有组织和可扩展的方法将涉及到切换到Scrapy网络抓取框架。

你拥有的主要性能问题是因为你目前的做法的“堵”的性质 - Scrapy会解决这个问题外的开箱，因为它是基于twisted，是完全同步的。

来源

2015-06-21 01:21:29 alecxe

谢谢！我会看看scrapy！ – mangodreamz

@当然，让我知道如果你需要帮助制作scrapy蜘蛛。您会惊讶于开始使用Scrapy是多么容易。 – alecxe

谢谢alecxe！我可能会伸手 – mangodreamz

优化我的Python刮刀

回答

相关问题