使用rvest来刮取HTML数据

我正在尝试为Data Science 101项目刮冰球参考。我遇到了特定表格的问题。网页是：https://www.hockey-reference.com/boxscores/201611090BUF.html。所需表格在“高级统计报告（所有情况）”下。我已经尝试了以下代码：使用rvest来刮取HTML数据

url="https://www.hockey-reference.com/boxscores/201611090BUF.html" 
ret <- url %>% 
    read_html()%>% 
    html_nodes(xpath='//*[contains(concat(" ", @class, " "), concat(" ", "right", " "))]') %>% 
    html_text()

此代码将从上表中删除所有数据，但在高级表之前停止。我也试图让更多的颗粒具有：

url="https://www.hockey-reference.com/boxscores/201611090BUF.html" 
ret <- url %>% 
    read_html()%>% 
    html_nodes(xpath='//*[(@id = "OTT_adv")]//*[contains(concat(" ", @class, " "), concat(" ", "right", " "))]') %>% 
    html_text()

其产生的“字符（0）”讯息话题。任何和所有的帮助，将不胜感激..如果它尚未明确，我相当新的R.谢谢！

来源

2017-08-30 Dan L

您试图抓取的信息作为评论隐藏在网页上。下面是需要一些工作来清理你的最后结果的解决方案：

library(rvest) 
url="https://www.hockey-reference.com/boxscores/201611090BUF.html" 

page<-read_html(url) # parse html 

commentedNodes<-page %>%     
    html_nodes('div.section_wrapper') %>% # select node with comment 
    html_nodes(xpath = 'comment()') # select comments within node 

#there are multiple (3) nodes containing comments 
#chose the 2 via trail and error 
output<-commentedNodes[2] %>% 
    html_text() %>%    # return contents as text 
    read_html() %>%    # parse text as html 
    html_nodes('table') %>%  # select table node 
    html_table()    # parse table and return data.frame

输出将是2个元素，每个表的列表。玩家名称和统计信息会在每个可用选项中重复多次，因此您需要清理此数据以达到最终目的。

来源

2017-08-30 23:01:23 Dave2e

使用rvest来刮取HTML数据

回答

相关问题