Python - 增加代码速度Pandas.append

所以我试图从here刮起头条新闻。整整十年。Python - 增加代码速度Pandas.append

years这里是一个列表，其中包含

/resources/archive/us/2007.html 
/resources/archive/us/2008.html 
/resources/archive/us/2009.html 
/resources/archive/us/2010.html 
/resources/archive/us/2011.html 
/resources/archive/us/2012.html 
/resources/archive/us/2013.html 
/resources/archive/us/2014.html 
/resources/archive/us/2015.html 
/resources/archive/us/2016.html

那么我的代码在这里所做的，是它打开每年页面，收集所有日期的链接，然后打开个别及通吃.text，并将每个标题和相应的日期作为行数据框headlines

headlines = pd.DataFrame(columns=["date", "headline"]) 

for y in years: 
    yurl = "http://www.reuters.com"+str(y) 
    response=requests.get(yurl,headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36', }) 
    bs= BeautifulSoup(response.content.decode('ascii', 'ignore'),'lxml') 

    days =[] 
    links = bs.findAll('h5') 
    for mon in links: 
     for day in mon.next_sibling.next_sibling: 
      days.append(day) 

    days = [e for e in days if str(e) not in ('\n')] 
    for ind in days: 
     hlday = ind['href'] 
     date = re.findall('(?!\/)[0-9].+(?=\.)', hlday)[0] 
     date = date[4:6] + '-' + date[6:] + '-' + date[:4] 
     print(date.split('-')[2]) 
     yurl = "http://www.reuters.com"+str(hlday) 
     response=requests.get(yurl,headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36', }) 
     if response.status_code == 404 or response.content == b'': 
      print('') 
     else: 
      bs= BeautifulSoup(response.content.decode('ascii', 'ignore'),'lxml') 
      lines = bs.findAll('div', {'class':'headlineMed'}) 
      for h in lines: 
       headlines = headlines.append([{"date":date, "headline":h.text}], ignore_index = True)

它永远需要运行，因此，而不是运行在for循环我只是跑这个年度/resources/archive/us/2008.html

已经3个小时了，它仍在运行。

因为我是Python新手，我不明白，我做错了什么，或者我怎么能做得更好。

难道是因为pandas.append是永久存在的，因为每次运行它都必须读取和写入更大的数据帧？

来源

2017-10-21 SamFlynn

请勿追加数百次。取而代之的是，拥有100个独立数据框的列表，然后最后调用'pd.concat'。 –

您正在使用这种反模式：

headlines = pd.DataFrame() 
for for y in years: 
    for ind in days: 
     headlines = headlines.append(blah)

相反，这样做：

headlines = [] 
for for y in years: 
    for ind in days: 
     headlines.append(pd.DataFrame(blah)) 

headlines = pd.concat(headlines)

第二个潜在问题是，你正在3650个Web请求。如果我正在经营一个这样的网站，我会加快节奏来减缓像你这样的刮刀。您可能会发现最好收集一次原始数据，将其存储在磁盘上，然后再次处理。然后，每次需要调试程序时，都不会产生3650个Web请求的成本。

来源

2017-10-21 10:43:49

Python - 增加代码速度Pandas.append

回答

相关问题