2016-03-28 71 views
0

我有一个包含多个url的chr列表。我想从每个这些网址下载内容'。R:在Rvest中使用pipechain命令刮掉多个网址

为了避免写出数以百计的命令,我希望通过使用lapply的循环自动执行该过程。

但是,我的命令返回一个错误。是否有可能从多个网址中删除?

电流接近

长法:工作,但我希望它自动化

urls <-c("https://en.wikipedia.org/wiki/Belarus","https://en.wikipedia.org/wiki/Russia","https://en.wikipedia.org/wiki/England") 

library(rvest) 
library(httr) # required for user_agent command 

uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36" 
session <- html_session("https://en.wikipedia.org/wiki/Main_Page", user_agent(uastring)) 
session2<-jump_to(session, "https://en.wikipedia.org/wiki/Belarus") 
session3<-jump_to(session, "https://en.wikipedia.org/wiki/Russia") 
writeBin(session2$response$content, "test1.txt") 
writeBin(session3$response$content, "test2.txt") 

自动/循环:不工作。

urls <-c("https://en.wikipedia.org/wiki/Belarus","https://en.wikipedia.org/wiki/Russia","https://en.wikipedia.org/wiki/England") 

library(rvest) 
library(httr) # required for user_agent command 

uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36" 
session <- html_session("https://en.wikipedia.org/wiki/Main_Page", user_agent(uastring)) 
lapply(urls, .%>% jump_to(session)) 
Error: is.session(x) is not TRUE 

摘要

我想下面的两个过程,jump_to()writeBin()自动化,如下面的代码中

session2<-jump_to(session, "https://en.wikipedia.org/wiki/Belarus") 
session3<-jump_to(session, "https://en.wikipedia.org/wiki/Russia") 
writeBin(session2$response$content, "test1.txt") 
writeBin(session3$response$content, "test2.txt") 

回答

0

你可以做这样的事情:

urls <-c("https://en.wikipedia.org/wiki/Belarus","https://en.wikipedia.org/wiki/Russia","https://en.wikipedia.org/wiki/England") 
require(httr) 
require(rvest) 
uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36" 
session <- html_session("https://en.wikipedia.org/wiki/Main_Page", user_agent(uastring)) 

outfile <- sprintf("%s.html", sub(".*/", "", urls)) 

jump_and_write <- function(x, url, out_file){ 
    tmp = jump_to(x, url) 
    writeBin(tmp$response$content, out_file) 
} 

for(i in seq_along(urls)){ 
    jump_and_write(session, urls[i], outfile[i]) 
} 
+0

你能解释为什么使用'lapply()'的原始方法不起作用吗?我的理解是,它在一个列表上循环一个函数,这与'for()'循环中的很多相同。 –

+1

您使用的参数传递顺序错误:'lapply(urls,。%>%jump_to(session))'使用'jump_to(url,session)',但'jump_to'需要'jump_to(session,url)'。你可以通过使用'lapply(url,。%>%jump_to(session,。))'来解决这个问题。看看吗?magrittr ::'%>%'(在%>%附近)' – Rentrop

+0

谢谢。是否有可能使用最后的'writeBin()'命令? –