2017-08-15 41 views
1

我想从很多不同的网址收集信息,并结合基于年份和高尔夫球手名称的数据。截至目前,我正在尝试将信息写入csv,然后使用pd.merge()进行匹配,但我必须为每个数据帧使用唯一的名称进行合并。我试图使用numpy数组,但我坚持最终获取所有要分离的数据的过程。麻烦合并使用Python中的熊猫和numpy数据刮掉的数据

import csv 
from urllib.request import urlopen 
from bs4 import BeautifulSoup 
import datetime 
import socket 
import urllib.error 
import pandas as pd 
import urllib 
import sqlalchemy 
import numpy as np 

base = 'http://www.pgatour.com/' 
inn = 'stats/stat' 
end = '.html' 
years = ['2017','2016','2015','2014','2013'] 

alpha = [] 
#all pages with links to tables 
urls = ['http://www.pgatour.com/stats.html','http://www.pgatour.com/stats/categories.ROTT_INQ.html','http://www.pgatour.com/stats/categories.RAPP_INQ.html','http://www.pgatour.com/stats/categories.RARG_INQ.html','http://www.pgatour.com/stats/categories.RPUT_INQ.html','http://www.pgatour.com/stats/categories.RSCR_INQ.html','http://www.pgatour.com/stats/categories.RSTR_INQ.html','http://www.pgatour.com/stats/categories.RMNY_INQ.html','http://www.pgatour.com/stats/categories.RPTS_INQ.html'] 
for i in urls: 
    data = urlopen(i) 
    soup = BeautifulSoup(data, "html.parser") 
    for link in soup.find_all('a'): 
     if link.has_attr('href'): 
      alpha.append(base + link['href'][17:]) #may need adjusting 
#data links 
beta = [] 
for i in alpha: 
    if inn in i: 
     beta.append(i) 
#no repeats 
gamma= [] 
for i in beta: 
    if i not in gamma: 
     gamma.append(i) 

#making list of urls with Statistic labels 
jan = [] 
for i in gamma: 
    try: 
     data = urlopen(i) 
     soup = BeautifulSoup(data, "html.parser") 
     for table in soup.find_all('section',{'class':'module-statistics-off-the-tee-details'}): 
      for j in table.find_all('h3'): 
       y=j.get_text().replace(" ","").replace("-","").replace(":","").replace(">","").replace("<","").replace(">","").replace(")","").replace("(","").replace("=","").replace("+","") 
       jan.append([i,str(y+'.csv')]) 
       print([i,str(y+'.csv')]) 
    except Exception as e: 
      print(e) 
      pass 

# practice url 
#jan = [['http://www.pgatour.com/stats/stat.02356.html', 'Last15EventsScoring.csv']] 
#grabbing data 
#write to csv 
row_sp = [] 
rows_sp =[] 
title1 = [] 
title = [] 
for i in jan: 
    try: 
     with open(i[1], 'w+') as fp: 
      writer = csv.writer(fp) 
      for y in years: 
       data = urlopen(i[0][:-4] +y+ end) 
       soup = BeautifulSoup(data, "html.parser") 
       data1 = urlopen(i[0]) 
       soup1 = BeautifulSoup(data1, "html.parser") 
       for table in soup1.find_all('table',{'id':'statsTable'}): 
        title.append('year') 
        for k in table.find_all('tr'): 
         for n in k.find_all('th'): 
          title1.append(n.get_text()) 
          for l in title1: 
           if l not in title: 
            title.append(l) 
        rows_sp.append(title) 
       for table in soup.find_all('table',{'id':'statsTable'}): 
        for h in table.find_all('tr'): 
         row_sp = [y] 
         for j in h.find_all('td'): 
          row_sp.append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d","")) 
         rows_sp.append(row_sp) 
         print(row_sp) 
         writer.writerows([row_sp]) 
    except Exception as e: 
     print(e) 
     pass 

dfs = [df1,df2,df3] # store dataframes in one list 
df_merge = reduce(lambda left,right: pd.merge(left,right,on=['v1'], how='outer'), dfs) 

的网址,统计种类,所需的格式 的......是的只是所有的东西其间 试图让一个行的数据 网址下面的数据[“http://www.pgatour.com/stats/stat.02356.html”,“http://www.pgatour.com/stats/stat.02568.html”, ..., 'http://www.pgatour.com/stats/stat.111.html']

统计标题

LAST 15 EVENTS - SCORING, SG: APPROACH-THE-GREEN, ..., SAND SAVE PERCENTAGE 
year rankthisweek ranklastweek name   events rating rounds avg 
2017 2    3    Rickie Fowler 10  8.8  62 .614  
TOTAL SG:APP MEASURED ROUNDS .... %  # SAVES # BUNKERS TOTAL O/U PAR 
26.386   43    ....70.37 76   108   +7.00 
+1

你的代码在哪里使用熊猫?试图合并的地方在哪里? – Parfait

+0

没有尝试,但它会像dataframes = [df1,df2,df3]#存储在一个列表中 df_merge = reduce(lambda left,right:pd.merge(left,right,on = ['column'],how ='外部'),dataframes),这是我正在尝试完成的过程,但我无法使它达到使用它的程度 –

+1

为什么链不合并工作?错误?不想要的结果?你是不是在阅读数据框的csvs? – Parfait

回答

3

UPDATE(每评论)
这个问题部分是关于技术方法(Pandas merge()),但它似乎是一个讨论数据收集和清理的有用工作流程的机会。因此,我比编码解决方案严格要求的内容增加了更多的细节和解释。

您基本上可以使用与我原始答案相同的方法从不同的URL类别获取数据。我建议在迭代URL列表时保留{url:data}字典的列表,然后从该字典中构建清理过的数据帧。

设置清洁部分涉及一点点工作,因为您需要针对每个URL类别中的不同列进行调整。我已经使用手动方法进行了演示,仅使用少量测试URL。但是,如果您拥有数千个不同的URL类别,那么您可能需要考虑如何以编程方式收集和组织列名称。这感觉超出了这个OP的范围。

只要你确定每个URL中有一个yearPLAYER NAME字段,下面的合并应该可以工作。和以前一样,我们假设您不需要写入CSV,现在让我们不要对您的刮码进行任何优化:

首先,在urls中定义url类别。通过网址类别我指的是http://www.pgatour.com/stats/stat.02356.html将实际上被多次使用,插入一系列的年份到url本身,例如:http://www.pgatour.com/stats/stat.02356.2017.html,http://www.pgatour.com/stats/stat.02356.2016.html。在此示例中,stat.02356.html是包含有关多年玩家数据信息的url类别。

import pandas as pd 

# test urls given by OP 
# note: each url contains >= 1 data fields not shared by the others 
urls = ['http://www.pgatour.com/stats/stat.02356.html', 
     'http://www.pgatour.com/stats/stat.02568.html', 
     'http://www.pgatour.com/stats/stat.111.html'] 

# we'll store data from each url category in this dict. 
url_data = {} 

现在迭代urls。在urls循环中,这段代码与我的原始答案完全相同,而原始答案又来自于OP - 仅调整了一些变量名称以反映我们的新捕获方法。

for url in urls: 
    print("url: ", url) 
    url_data[url] = {"row_sp": [], 
        "rows_sp": [], 
        "title1": [], 
        "title": []} 
    try: 
     #with open(i[1], 'w+') as fp: 
      #writer = csv.writer(fp) 
     for y in years: 
      current_url = url[:-4] +y+ end 
      print("current url is: ", current_url) 
      data = urlopen(current_url) 
      soup = BeautifulSoup(data, "html.parser") 
      data1 = urlopen(url) 
      soup1 = BeautifulSoup(data1, "html.parser") 
      for table in soup1.find_all('table',{'id':'statsTable'}): 
       url_data[url]["title"].append('year') 
       for k in table.find_all('tr'): 
        for n in k.find_all('th'): 
         url_data[url]["title1"].append(n.get_text()) 
         for l in url_data[url]["title1"]: 
          if l not in url_data[url]["title"]: 
           url_data[url]["title"].append(l) 
       url_data[url]["rows_sp"].append(url_data[url]["title"]) 
      for table in soup.find_all('table',{'id':'statsTable'}): 
       for h in table.find_all('tr'): 
        url_data[url]["row_sp"] = [y] 
        for j in h.find_all('td'): 
         url_data[url]["row_sp"].append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d","")) 
        url_data[url]["rows_sp"].append(url_data[url]["row_sp"]) 
        #print(row_sp) 
        #writer.writerows([row_sp]) 
    except Exception as e: 
     print(e) 
     pass 

现在在url_data每个键urlrows_sp包含你感兴趣的特定URL类别的数据。
请注意,rows_sp现在实际上是url_data[url]["rows_sp"],当我们遍历url_data时,但接下来的几个代码块来自我的原始答案,因此使用旧的rows_sp变量名称。

# example rows_sp 
[['year', 
    'RANK THIS WEEK', 
    'RANK LAST WEEK', 
    'PLAYER NAME', 
    'EVENTS', 
    'RATING', 
    'year', 
    'year', 
    'year', 
    'year'], 
['2017'], 
['2017', '1', '1', 'Sam Burns', '1', '9.2'], 
['2017', '2', '3', 'Rickie Fowler', '10', '8.8'], 
['2017', '2', '2', 'Dustin Johnson', '10', '8.8'], 
['2017', '2', '3', 'Whee Kim', '2', '8.8'], 
['2017', '2', '3', 'Thomas Pieters', '3', '8.8'], 
... 
] 

rows_sp直接将数据帧显示的数据是不完全正确的格式:

pd.DataFrame(rows_sp).head() 
     0    1    2    3  4  5  6 \ 
0 year RANK THIS WEEK RANK LAST WEEK  PLAYER NAME EVENTS RATING year 
1 2017   None   None   None None None None 
2 2017    1    1  Sam Burns  1  9.2 None 
3 2017    2    3 Rickie Fowler  10  8.8 None 
4 2017    2    2 Dustin Johnson  10  8.8 None 

     7  8  9 
0 year year year 
1 None None None 
2 None None None 
3 None None None 
4 None None None 

pd.DataFrame(rows_sp).dtypes 
0 object 
1 object 
2 object 
3 object 
4 object 
5 object 
6 object 
7 object 
8 object 
9 object 
dtype: object 

随着一点点的清理,我们可以得到rows_sp与相应的数字数据帧数据类型:

df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0) 
df.columns = ["year","RANK THIS WEEK","RANK LAST WEEK", 
       "PLAYER NAME","EVENTS","RATING", 
       "year1","year2","year3","year4"] 
df.drop(["year1","year2","year3","year4"], 1, inplace=True) 
df = df.loc[df["PLAYER NAME"].notnull()] 
df = df.loc[df.year != "year"] 
num_cols = ["RANK THIS WEEK","RANK LAST WEEK","EVENTS","RATING"] 
df[num_cols] = df[num_cols].apply(pd.to_numeric) 

df.head() 
    year RANK THIS WEEK RANK LAST WEEK  PLAYER NAME EVENTS RATING 
2 2017    1    1.0  Sam Burns  1  9.2 
3 2017    2    3.0 Rickie Fowler  10  8.8 
4 2017    2    2.0 Dustin Johnson  10  8.8 
5 2017    2    3.0  Whee Kim  2  8.8 
6 2017    2    3.0 Thomas Pieters  3  8.8 

修订清洁
现在我们有了一系列的url类别来应对,每个类别都有一组不同的字段进行清理,上面的部分会变得更复杂一些。如果你只有几页,这可能是可行的,只是视觉上查看每个类别的领域,并存储它们,就像这样:

cols = {'stat.02568.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK', 
             'PLAYER NAME', 'ROUNDS', 'AVERAGE', 
             'TOTAL SG:APP', 'MEASURED ROUNDS', 
             'year1', 'year2', 'year3', 'year4'], 
          'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS', 
             'AVERAGE', 'TOTAL SG:APP', 'MEASURED ROUNDS',] 
          }, 
     'stat.111.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK', 
            'PLAYER NAME', 'ROUNDS', '%', '# SAVES', '# BUNKERS', 
            'TOTAL O/U PAR', 'year1', 'year2', 'year3', 'year4'], 
         'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS', 
            '%', '# SAVES', '# BUNKERS', 'TOTAL O/U PAR'] 
         }, 
     'stat.02356.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK', 
             'PLAYER NAME', 'EVENTS', 'RATING', 
             'year1', 'year2', 'year3', 'year4'], 
          'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 
             'EVENTS', 'RATING'] 
          } 
     } 

然后你可以遍历url_data再次并将其存储在一个dfs集合:

dfs = {} 

for url in url_data: 
    page = url.split("/")[-1] 
    colnames = cols[page]["columns"] 
    num_cols = cols[page]["numeric"] 
    rows_sp = url_data[url]["rows_sp"] 
    df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0) 
    df.columns = colnames 
    df.drop(["year1","year2","year3","year4"], 1, inplace=True) 
    df = df.loc[df["PLAYER NAME"].notnull()] 
    df = df.loc[df.year != "year"] 
    # tied ranks (e.g. "T9") mess up to_numeric; remove the tie indicators. 
    df["RANK THIS WEEK"] = df["RANK THIS WEEK"].str.replace("T","") 
    df["RANK LAST WEEK"] = df["RANK LAST WEEK"].str.replace("T","") 
    df[num_cols] = df[num_cols].apply(pd.to_numeric) 
    dfs[url] = df 

在这一点上,我们通过yearPLAYER NAME准备merge所有不同的数据类别。 (你实际上可以反复在清洗循环合并,但我在这里分离为示范的目的。)

master = pd.DataFrame() 
for url in dfs: 
    if master.empty: 
     master = dfs[url] 
    else: 
     master = master.merge(dfs[url], on=['year','PLAYER NAME']) 

现在master包含了每个玩家的一年,合并后的数据。这里有一个观点到的数据,利用groupby()

master.groupby(["PLAYER NAME", "year"]).first().head(4) 
        RANK THIS WEEK_x RANK LAST WEEK_x EVENTS RATING \ 
PLAYER NAME year              
Aam Hawin 2015    66    66.0  7  8.2 
      2016    80    80.0  12  8.1 
      2017    72    45.0  8  8.2 
Aam Scott 2013    45    45.0  10  8.2 

        RANK THIS WEEK_y RANK LAST WEEK_y ROUNDS_x AVERAGE \ 
PLAYER NAME year               
Aam Hawin 2015    136    136  95 -0.183 
      2016    122    122  93 -0.061 
      2017    56    52  84 0.296 
Aam Scott 2013    16    16  61 0.548 

        TOTAL SG:APP MEASURED ROUNDS RANK THIS WEEK \ 
PLAYER NAME year             
Aam Hawin 2015  -14.805    81    86 
      2016  -5.285    87    39 
      2017  18.067    61    8 
Aam Scott 2013  24.125    44    57 

        RANK LAST WEEK ROUNDS_y  % # SAVES # BUNKERS \ 
PLAYER NAME year               
Aam Hawin 2015    86  95 50.96  80  157 
      2016    39  93 54.78  86  157 
      2017    6  84 61.90  91  147 
Aam Scott 2013    57  61 53.85  49   91 

        TOTAL O/U PAR 
PLAYER NAME year     
Aam Hawin 2015   47.0 
      2016   43.0 
      2017   27.0 
Aam Scott 2013   11.0 

您可能需要做合并列多一点的清洁,因为一些跨类别的数据复制(如ROUNDS_xROUNDS_y)。从我所知道的情况来看,重复的字段名称似乎包含完全相同的信息,因此您可以放弃每个版本的_y版本。

+0

谢谢你,这真棒,我不会在数年内汇总数据,我想从所有其他网址获取数据并添加到主数据框中。 –

+1

不客气!这个答案是否为您最初的问题提供了足够的解决方案如果是这样,请考虑通过点击答案左侧的复选标记来标记该答案。如果没有,你陷入了什么困境? –

+0

技术上没有,但它已经回答了我的另一个问题,我陷入了从数据中包含的所有信息中包含的信息中制作一个大型数据框,我无法合并数据,因为我想要这样做的方式是让每个将自己的数据转化为自己的df,然后根据名称年进行合并,以便每个玩家行在每个网址中都包含一个数据帧中的所有信息 –