我试图访问突出显示的响应标头:位置下面的屏幕截图中的文本仅使用R及其基于卷页的webscraping库。可以通过访问http://www.worldvaluessurvey.org/WVSDocumentationWVL.jsp,点击任何数据文件的下载并填写协议表单,在任何网络浏览器中轻松获得此点。下载在Web浏览器中自动开始。以编程方式在R内响应报头
我相信,以获得有效的cookie中的唯一方法是library(curlconverter)
(见How to download a file behind a semi-broken javascript asp function with R),但这个问题的答案似乎没有足够的以编程方式确定该文件的HTTP URL,只一旦它已经知道,就下载压缩文件。
我已经粘贴下面的一些代码有不同HTTR和我周围玩curlconverter代码,但是我在这里失去了一些东西。再次,唯一的目标是以编程方式完全在R(跨平台)内确定突出显示的文本。
library(curlconverter)
library(httr)
browserPOST <-
"curl 'http://www.worldvaluessurvey.org/AJDownload.jsp'
-H 'Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
-H 'Accept-Encoding:gzip, deflate'
-H 'Accept-Language:en-US,en;q=0.8'
-H 'Cache-Control:max-age=0'
--compressed -H 'Connection:keep-alive'
-H 'Content-Length:188'
-H 'Content-Type:application/x-www-form-urlencoded'
-H 'Cookie:ASPSESSIONIDCASQAACD=IBLGBFOAEHFILMMJJCFEOEMI; JSESSIONID=50DABDEDD0B2FC370C415B4BD1855260; __atuvc=13%7C45; __atuvs=58224f37d312c42400c'
-H 'Host:www.worldvaluessurvey.org'
-H 'Origin:http://www.worldvaluessurvey.org'
-H 'Referer:http://www.worldvaluessurvey.org/AJDownloadLicense.jsp'
-H 'Upgrade-Insecure-Requests:1'
-H 'User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'"
form_data <-
list(
ulthost = "WVS" ,
CMSID = "" ,
LITITLE = "" ,
LINOMBRE = "fas" ,
LIEMPRESA = "asf" ,
LIEMAIL = "asdf" ,
LIPROJECT = "asfd" ,
LIUSE = "1" ,
LIPURPOSE = "asdf" ,
LIAGREE = "1" ,
DOID = "3996" ,
CndWAVE = "-1" ,
SAID = "-1" ,
AJArchive = "WVS Data Archive" ,
EdFunction = "" ,
DOP = ""
)
getDATA <- (straighten(browserPOST) %>% make_req)[[1]]()
a <- VERB(verb = "POST", url = "http://www.worldvaluessurvey.org/AJDownload.jsp",
httr::add_headers(Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
`Accept-Encoding` = "gzip, deflate", `Accept-Language` = "en-US,en;q=0.8",
`Cache-Control` = "max-age=0", Connection = "keep-alive",
`Content-Length` = "188", Host = "www.worldvaluessurvey.org",
Origin = "http://www.worldvaluessurvey.org", Referer = "http://www.worldvaluessurvey.org/AJDownloadLicense.jsp",
`Upgrade-Insecure-Requests` = "1", `User-Agent` = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36"),
httr::set_cookies(`Cookie:ASPSESSIONIDCASQAACD` = "IBLGBFOAEHFILMMJJCFEOEMI",
JSESSIONID = "50DABDEDD0B2FC370C415B4BD1855260", `__atuvc` = "13%7C45",
`__atuvs` = "58224f37d312c42400c"), encode = "form",body=form_data)
我添加大小写和标点符号你的问题。请考虑将来自己做这件事,因为我们试图为数十到数千可能随时读取这些数据的人保持良好的质量标准。 –
这里的一个问题是链接被嵌入到另一个iframe中嵌入的iframe中。把它们刮掉并不容易,要温和地说。 – yeedle
投票不清楚根据http://stackoverflow.com/questions/40498277/programmatically-scraping-a-response-header-within-r#comment68826373_40786535 –