2017-06-05 73 views
0

我很喜欢R的图书馆刮刮网站,但我正在努力寻找新的东西。从这个网页 - http://www.naia.org/ViewArticle.dbml?ATCLID=205323044 - 我试图刮高校主表。使用R rvest图书馆在iframe中刮脸

这里是我的代码看起来像现在:

NAIA_url = "http://www.naia.org/ViewArticle.dbml?ATCLID=205323044" 
NAIA_page = read_html(NAIA_url) 

tables = html_table(html_nodes(NAIA_page, 'table')) 
# tables returns a length-2 list, however neither of these tables are the table I desire. 

# grab the correct iframe node 
iframe = html_nodes(NAIA_page, "iframe")[3] 

但是我挣扎过去这一点。 (1)由于某种原因,调用html_nodes不是抓住我想要的表。 (2),我不确定是否应该抓取iframe,然后尝试从中抓取表格。

任何帮助表示赞赏!

+1

你应该得到的'iframe'的源,并从那里抢表 – yeedle

回答

1

如果嵌入式iframe是html,则可以抓取iframe源代码并从那里获取所需的表格。


library(rvest) 
#> Loading required package: xml2 
library(magrittr) 
"http://www.naia.org/ViewArticle.dbml?ATCLID=205323044" %>% 
    read_html() %>% 
    html_nodes("iframe") %>% 
    extract(3) %>% 
    html_attr("src") %>% 
    read_html() %>% 
    html_node("#searchResultsTable") %>% 
    html_table() %>% 
    head() 
#>         College or University  City, State 
#> 1     Central Christian College ATHLETICS  McPherson, KS 
#> 2 +     Crowley's Ridge College ATHLETICS  Paragould, AR 
#> 3      Edward Waters College ATHLETICS Jacksonville, Fl 
#> 4     Fisher College ADMISSIONS | ATHLETICS  Boston, MA 
#> 5  Georgia Gwinnett College ADMISSIONS | ATHLETICS Lawrenceville, GA 
#> 6 Lincoln Christian University ADMISSIONS | ATHLETICS  Lincoln, IL 
#> Conference Enrollment 
#> 1  A.I.I.  259 
#> 2  A.I.I.   0 
#> 3  A.I.I.  805 
#> 4  A.I.I.  600 
#> 5  A.I.I.  9,720 
#> 6  A.I.I.  1,060 
+0

精彩,感谢一吨 – Canovice