2017-10-07 38 views
1

我使用以下代码来对刮AFL播放器数据的HTML表:使用rvest包时HTML表具有两个头

library(rvest) 

website <-read_html("https://afltables.com/afl/stats/teams/adelaide/2017_gbg.html") 
table <- website %>% 
      html_nodes("table") %>% 
      .[(1)] %>% 
      html_table() 

所得表是34个OBS。 27个变量,但是nrow(table)ncol(table)都返回NULL。这是否正确,因为数据框中有两行标题?我希望能够根据单个列做计算而下面给出了一个错误:

table[,1] 
# Error in table[, 1] : incorrect number of dimensions 

哪个它产生这个错误,我该如何解决呢?

回答

0

首先,与您的问题无关:不要使用table作为您的对象的名称,因为此名称已被保留用于R中的其他功能。这被认为是不好的做法,我被告知它会回来并且将你捅在底线的某个地方。

转到问题:您正在努力使用html_table()为您提供的数据类型。您将返回一个包含常规data.frame的列表。您输出的列表有NULL为列数和行数,因为该列表只有一个元素:data.frame。通过选择第一个(也是唯一一个)列表的元素,你会得到数据框您真正有趣。这个数据帧有27列和34列

website <-read_html("https://afltables.com/afl/stats/teams/adelaide/2017_gbg.html") 
scraped <- website %>% 
       html_nodes("table") %>% 
       .[(1)] %>% 
       html_table() %>% 
       `[[`(1) # Select the first element of the list, like scraped[[1]] 
ncol(scraped) 
# 27 
nrow(scraped) 
# 34 
+0

第一位是错误的信息和shld被删除或澄清。即使有一个不是函数的table变量(例如'table < - c(2,1,2,5,2,3,...),R足够聪明以至于仍然可以调用table() 1);表(表)')。它不会被“覆盖”。一般来说,这仍然是一种不好的做法,并不是一个好主意,但不是因为你说的原因。 – hrbrmstr

+0

感谢您致电@hrbrmstr。出于兴趣,你是否碰巧有一个消息来源解释了R如何以及为什么有足够的智能来区分?如果R真的很聪明,那么似乎避免像'list','data'和'c'这样的名字的主要(唯一的原因)是让程序员不会感到困惑,因为R似乎正在处理它正好。 –

0


library(rvest) 
#> Le chargement a nécessité le package : xml2 

website <-read_html("https://afltables.com/afl/stats/teams/adelaide/2017_gbg.html") 

在这个网站上,你有几个表格,每个链接在主页面上的印刷表格上方显示一个链接,上面有 。 对html_nodes("tables")的结果使用html_tables可让您一次获取列表中的所有表格。

all_tables <- website %>% 
    html_nodes("table") %>% 
    html_table() 

str(all_tables, 1) 
#> List of 23 
#> $ :'data.frame': 34 obs. of 27 variables: 
#> $ :'data.frame': 34 obs. of 27 variables: 
#> $ :'data.frame': 34 obs. of 27 variables: 
#> $ :'data.frame': 34 obs. of 27 variables: 
#> $ :'data.frame': 34 obs. of 27 variables: 
#> $ :'data.frame': 34 obs. of 27 variables: 
#> $ :'data.frame': 34 obs. of 27 variables: 
#> $ :'data.frame': 34 obs. of 27 variables: 
#> $ :'data.frame': 34 obs. of 27 variables: 
#> $ :'data.frame': 34 obs. of 27 variables: 
#> $ :'data.frame': 34 obs. of 27 variables: 
#> $ :'data.frame': 34 obs. of 27 variables: 
#> $ :'data.frame': 34 obs. of 27 variables: 
#> $ :'data.frame': 34 obs. of 27 variables: 
#> $ :'data.frame': 34 obs. of 27 variables: 
#> $ :'data.frame': 34 obs. of 27 variables: 
#> $ :'data.frame': 34 obs. of 27 variables: 
#> $ :'data.frame': 34 obs. of 27 variables: 
#> $ :'data.frame': 34 obs. of 27 variables: 
#> $ :'data.frame': 34 obs. of 27 variables: 
#> $ :'data.frame': 34 obs. of 27 variables: 
#> $ :'data.frame': 34 obs. of 27 variables: 
#> $ :'data.frame': 34 obs. of 27 variables: 

然后,您可以选择您想要的表,但头仍然不 权

head(all_tables[[1]]) 
#>   Disposals Disposals Disposals Disposals Disposals Disposals 
#> 1   Player  R1  R2  R3  R4  R5 
#> 2  Atkins, Rory  19  19  19  23  29 
#> 3 Beech, Jonathon             
#> 4  Betts, Eddie  18  13  16  22  12 
#> 5  Brown, Luke  18  12  13   9  15 
#> 6 Cameron, Charlie  23  17  16  16  13 
#> Disposals Disposals Disposals Disposals Disposals Disposals Disposals 
#> 1  R6  R7  R8  R9  R10  R11  R12 
#> 2  23  20  21  28  37  14  25 
#> 3                 15 
#> 4  16  13   9  16  14  12  11 
#> 5  17  13  20  25  16  12   
#> 6  13  14  10  18  13   8  13 
#> Disposals Disposals Disposals Disposals Disposals Disposals Disposals 
#> 1  R14  R15  R16  R17  R18  R19  R20 
#> 2  28  15  23  18  19  16  16 
#> 3  12  11             
#> 4  14  11  13  16   8     16 
#> 5  10  15  14  17  11  10  20 
#> 6  15     10  20   6   9  17 
#> Disposals Disposals Disposals Disposals Disposals Disposals Disposals 
#> 1  R21  R22  R23  QF  PF  GF  Tot 
#> 2  27  21  21  16  22  17  536 
#> 3                 38 
#> 4   7  16  12  13  13   7  318 
#> 5  17  17   9  20  10  13  353 
#> 6  13  10  10  15  19  16  334 

使用列表上的一些操作和表与purrrdplyr, 可以格式化你的表,有2个标题:

all_tables <- website %>% 
    html_nodes("table") %>% 
    # do not let httr handles header automatically. 
    html_table(header = FALSE) 

library(purrr) 
#> 
#> Attachement du package : 'purrr' 
#> The following object is masked from 'package:rvest': 
#> 
#>  pluck 
all_tables <- all_tables %>% 
    # get the first column, first row to set the name for the list elements 
    # pluck is a purrr function acting like x[[1]][1, 1] here 
    lmap(~ set_names(.x, nm = pluck(.x, 1, 1, 1))) %>% 
    # For each table, set second line as header 
    # and delete first and second line 
    map(~ set_names(.x, nm = .x[2, ]) %>% slice(-c(1, 2))) 
str(all_tables_res, 1) 
#> List of 23 
#> $ Disposals    :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables: 
#> $ Kicks     :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables: 
#> $ Marks     :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables: 
#> $ Handballs    :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables: 
#> $ Goals     :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables: 
#> $ Behinds    :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables: 
#> $ Hit Outs    :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables: 
#> $ Tackles    :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables: 
#> $ Rebounds    :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables: 
#> $ Inside 50s    :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables: 
#> $ Clearances    :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables: 
#> $ Clangers    :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables: 
#> $ Frees     :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables: 
#> $ Frees Against   :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables: 
#> $ Brownlow Votes   :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables: 
#> $ Contested Possessions :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables: 
#> $ Uncontested Possessions:Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables: 
#> $ Contested Marks  :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables: 
#> $ Marks Inside 50  :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables: 
#> $ One Percenters   :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables: 
#> $ Bounces    :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables: 
#> $ Goal Assists   :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables: 
#> $ % Played    :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 27 variables: 

You can now called any table of the website. 

head(all_tables_res$Goals) 
#> # A tibble: 6 x 27 
#>    Player R1 R2 R3 R4 R5 R6 R7 R8 R9 
#>    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> 
#> 1  Atkins, Rory  3  1  -  2  1  -  1  1  - 
#> 2 Beech, Jonathon              
#> 3  Betts, Eddie  4  3  3  6  3  1  3  2  3 
#> 4  Brown, Luke  -  1  -  -  1  -  -  -  - 
#> 5 Cameron, Charlie  2  1  -  1  2  2  2  -  4 
#> 6  Crouch, Brad        -  -  -  -  1 
#> # ... with 17 more variables: R10 <chr>, R11 <chr>, R12 <chr>, R14 <chr>, 
#> # R15 <chr>, R16 <chr>, R17 <chr>, R18 <chr>, R19 <chr>, R20 <chr>, 
#> # R21 <chr>, R22 <chr>, R23 <chr>, QF <chr>, PF <chr>, GF <chr>, 
#> # Tot <chr>