R中的readHTMLTable - 跳过NULL值

我正在尝试使用R函数readHTMLTable从www.racingpost.com上的联机数据库中收集数据。我有一个包含30,000个独特ID的CSV文件，可用于识别单个马匹。不幸的是，这些ID的少数领导readHTMLTable返回错误：R中的readHTMLTable - 跳过NULL值

错误(function (classes, fdef, mtable)：无法找到继承的方法函数“readHTMLTable”签字“‘NULL’”

我的问题是 - 是否可以设置一个包装函数，它将跳过返回NULL值的ID，但继续读取剩余的HTML表？读数停在每个NULL值。

我至今尝试过是这样的：

ids = c(896119, 766254, 790946, 556341, 62736, 660506, 486791, 580134, 0011, 580134)

这些都是有效的马栏IDS 0011将返回NULL值。然后：

scrapescrape <- function(x) {  
    link <- paste0("http://www.racingpost.com/horses/horse_home.sd?horse_id=",x)  
    if (!is.null(readHTMLTable(link, which=2))) { 
    Frame1 <- readHTMLTable(link, which=2) 
    } 
} 

total_data = c(0) 
for (id in ids) { 
    total_data = rbind(total_data, scrapescrape(id)) 
}

但是，我认为错误返回在if语句，这意味着函数停止时，它达到第一个NULL值。任何帮助将不胜感激 - 非常感谢。

来源

2017-02-16 Robertlemoko

在阅读HTML表格之前，您可以先分析HTML（检查您获得的页面，并找到识别错误结果的方法）。

但你也可以确保该函数返回什么（NA）时抛出一个错误，像这样：

library(XML) 

scrapescrape <- function(x) { 

    link <- paste0("http://www.racingpost.com/horses/horse_home.sd?horse_id=",x) 

    tryCatch(readHTMLTable(link, which=2), error=function(e){NA}) 

    } 
} 

ids <- c(896119, 766254, 790946, 556341, 62736, 660506, 486791, 580134, 0011, 580134) 

lst <- lapply(ids, scrapescrape) 

str(lst)

来源

2017-02-16 09:46:08 Wietze314

使用rvest你可以这样做：

require(rvest) 
require(purrr) 
paste0("http://www.racingpost.com/horses/horse_home.sd?horse_id=", ids) %>% 
    map(possibly(~html_session(.) %>% 
       read_html %>% 
       html_table(fill = TRUE) %>% 
       .[[2]], 
       NULL)) %>% 
    discard(is.null)

最后一行丢弃所有“失败”尝试。如果你想让他们放弃最后一行

来源

2017-02-16 10:03:35 Rentrop

R中的readHTMLTable - 跳过NULL值

回答

相关问题