2017-08-30 57 views
1

我正在尝试为Data Science 101项目刮冰球参考。我遇到了特定表格的问题。网页是:https://www.hockey-reference.com/boxscores/201611090BUF.html。所需表格在“高级统计报告(所有情况)”下。我已经尝试了以下代码:使用rvest来刮取HTML数据

url="https://www.hockey-reference.com/boxscores/201611090BUF.html" 
ret <- url %>% 
    read_html()%>% 
    html_nodes(xpath='//*[contains(concat(" ", @class, " "), concat(" ", "right", " "))]') %>% 
    html_text() 

此代码将从上表中删除所有数据,但在高级表之前停止。我也试图让更多的颗粒具有:

url="https://www.hockey-reference.com/boxscores/201611090BUF.html" 
ret <- url %>% 
    read_html()%>% 
    html_nodes(xpath='//*[(@id = "OTT_adv")]//*[contains(concat(" ", @class, " "), concat(" ", "right", " "))]') %>% 
    html_text() 

其产生的“字符(0)”讯息话题。任何和所有的帮助,将不胜感激..如果它尚未明确,我相当新的R.谢谢!

回答

2

您试图抓取的信息作为评论隐藏在网页上。下面是需要一些工作来清理你的最后结果的解决方案:

library(rvest) 
url="https://www.hockey-reference.com/boxscores/201611090BUF.html" 

page<-read_html(url) # parse html 

commentedNodes<-page %>%     
    html_nodes('div.section_wrapper') %>% # select node with comment 
    html_nodes(xpath = 'comment()') # select comments within node 

#there are multiple (3) nodes containing comments 
#chose the 2 via trail and error 
output<-commentedNodes[2] %>% 
    html_text() %>%    # return contents as text 
    read_html() %>%    # parse text as html 
    html_nodes('table') %>%  # select table node 
    html_table()    # parse table and return data.frame 

输出将是2个元素,每个表的列表。玩家名称和统计信息会在每个可用选项中重复多次,因此您需要清理此数据以达到最终目的。