2017-10-21 132 views
1

所以我试图从here刮起头条新闻。整整十年。Python - 增加代码速度Pandas.append

years这里是一个列表,其中包含

/resources/archive/us/2007.html 
/resources/archive/us/2008.html 
/resources/archive/us/2009.html 
/resources/archive/us/2010.html 
/resources/archive/us/2011.html 
/resources/archive/us/2012.html 
/resources/archive/us/2013.html 
/resources/archive/us/2014.html 
/resources/archive/us/2015.html 
/resources/archive/us/2016.html 

那么我的代码在这里所做的,是它打开每年页面,收集所有日期的链接,然后打开个别及通吃.text,并将每个标题和相应的日期作为行数据框headlines

headlines = pd.DataFrame(columns=["date", "headline"]) 

for y in years: 
    yurl = "http://www.reuters.com"+str(y) 
    response=requests.get(yurl,headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36', }) 
    bs= BeautifulSoup(response.content.decode('ascii', 'ignore'),'lxml') 

    days =[] 
    links = bs.findAll('h5') 
    for mon in links: 
     for day in mon.next_sibling.next_sibling: 
      days.append(day) 

    days = [e for e in days if str(e) not in ('\n')] 
    for ind in days: 
     hlday = ind['href'] 
     date = re.findall('(?!\/)[0-9].+(?=\.)', hlday)[0] 
     date = date[4:6] + '-' + date[6:] + '-' + date[:4] 
     print(date.split('-')[2]) 
     yurl = "http://www.reuters.com"+str(hlday) 
     response=requests.get(yurl,headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36', }) 
     if response.status_code == 404 or response.content == b'': 
      print('') 
     else: 
      bs= BeautifulSoup(response.content.decode('ascii', 'ignore'),'lxml') 
      lines = bs.findAll('div', {'class':'headlineMed'}) 
      for h in lines: 
       headlines = headlines.append([{"date":date, "headline":h.text}], ignore_index = True) 

它永远需要运行,因此,而不是运行在for循环我只是跑这个年度/resources/archive/us/2008.html

已经3个小时了,它仍在运行。

因为我是Python新手,我不明白,我做错了什么,或者我怎么能做得更好。

难道是因为pandas.append是永久存在的,因为每次运行它都必须读取和写入更大的数据帧?

+5

请勿追加数百次。取而代之的是,拥有100个独立数据框的列表,然后最后调用'pd.concat'。 –

回答

1

您正在使用这种反模式:

headlines = pd.DataFrame() 
for for y in years: 
    for ind in days: 
     headlines = headlines.append(blah) 

相反,这样做:

headlines = [] 
for for y in years: 
    for ind in days: 
     headlines.append(pd.DataFrame(blah)) 

headlines = pd.concat(headlines) 

第二个潜在问题是,你正在3650个Web请求。如果我正在经营一个这样的网站,我会加快节奏来减缓像你这样的刮刀。您可能会发现最好收集一次原始数据,将其存储在磁盘上,然后再次处理。然后,每次需要调试程序时,都不会产生3650个Web请求的成本。