使用tryCatch与网页抓取

处理错误，我有两个环节使用tryCatch与网页抓取

> primary_link 
[1] "https://en.wikipedia.org/wiki/Kahaani_(2012_film)" 
> secondary_link 
[1] "https://en.wikipedia.org/wiki/Kahaani"

对于主链接我得到一个错误

read_html（primary_link）错误open.connection（X，“RB “）：HTTP错误404.

但是对于辅助链接，我能够完美地读取。

随着tryCatch我试着写，如果主链路提供了其中的错误尝试二次链接

的代码我试图形式的错误处理程序是这样的

web_page <- tryCatch(read_html(primary_link),finally = read_html(secondary_link))

任何帮助，将不胜感激

来源

2016-09-27 Rajarshi Bhadra

如果你想走这条路，那么我认为适当的模式将是第二个tryCatch电话在第一个链接有错误的情况下：

web_page <- tryCatch({ 
    read_html(primary_link) 
}, error = function(e) { 
    tryCatch({ 
     read_html(secondary_link) 
    }, finally = { # cleanup for second call 
    }) 
}, finally = { 
    # cleanup for both calls 
})

来源

2016-09-27 09:59:03

您也可以使用http_error函数来确定页面是否可访问。出现错误时此函数返回TRUE。

primary_link <- "https://en.wikipedia.org/wiki/Kahaani_(2012_film)" 
secondary_link <- "https://en.wikipedia.org/wiki/Kahaani" 

library(httr) 
urls <- c(primary_link, secondary_link) 

sapply(urls, http_error, config(followlocation = 0L), USE.NAMES = F) 
###[1] TRUE FALSE

来源

2016-09-27 10:01:43 GyD

tryCatch()可以使一些扭曲的代码和现在有从purrr包的替代。此外，由于您无疑会多次使用此代码，因此您应该将其封装在一个函数中：

library(purrr) 
library(httr) 

primary_link <- "https://en.wikipedia.org/wiki/Kahaani_(2012_film)" 
secondary_link <- "https://en.wikipedia.org/wiki/Kahaani" 

GET_alt <- function(url_1, url_2, .verbose=TRUE) { 

    # this wraps httr::GET in exception handling code in the 
    # event the site is completely inaccessible and not just 
    # issuing 40x errors 

    sGET <- purrr::safely(GET) 

    res <- sGET(url_1) 

    # Now, check for whether it had a severe error or just 
    # didn't retrieve the content successfully and fetch 
    # the alternate URL if so 

    if (is.null(res$result) | (status_code(res$result) != 200)) { 
    if (.verbose) message("Using alternate URL") 
    res <- sGET(url_2) 
    } 

    # I'd do other error handling here besides just issue a 
    # warning, but I have no idea what you're doing so we'll 
    # just issue a warning 

    if (!is.null(res$result)) { 
    warn_for_status(res$result) 
    } 

    return(res$result) 

} 

GET_alt(primary_link, secondary_link)

来源

2016-09-27 10:53:02 hrbrmstr

使用tryCatch与网页抓取

回答

相关问题