要求疾病控制中心使用RSocrata或XML中的数据R

我的目标是从疾病控制中心（CDC）支持的website获得2016年军团病病例1996年第1周至第46周的时间序列）的美国。的同事，试图刮掉只含有军团菌病病例与下面的代码表：要求疾病控制中心使用RSocrata或XML中的数据R

#install.packages('rvest') 
library(rvest) 


## Code to get all URLS 

getUrls <- function(y1,y2,clist){ 
root="https://wonder.cdc.gov/mmwr/mmwr_1995_2014.asp?mmwr_year=" 
root1="&mmwr_week=" 
root2="&mmwr_table=2" 
root3="&request=Submit&mmwr_location=" 

urls <- NULL 
for (year in y1:y2){ 
    for (week in 1:53){ 
    for (part in clist) { 
    urls <- c(urls,(paste(root,year,root1,week,root2,part,root3,sep=""))) 
    } 
    } 
} 
     return(urls) 
    } 

TabList<-c("A","B") ## can change to get not just 2 parts of the table but as many as needed. 

WEB <- as.data.frame(getUrls(1996,2014,TabList)) # Only applies from 1996-2014. After 2014, the root url changes. 
head(WEB) 


#Example of how to extract data from a single webpage. 

url <- 'https://wonder.cdc.gov/mmwr/mmwr_1995_2014.asp? mmwr_year=1996&mmwr_week=20&mmwr_table=2A&request=Submit&mmwr_location=' 

webpage <- read_html(url) 
sb_table <- html_nodes(webpage, 'table') 
sb <- html_table(sb_table, fill = TRUE)[[2]] 

#test if Legionellosis is in the table. Returns a vector showing the columns index if the text is found. 
#Can use this command to filter only pages that you need and select only those columns. 
test <- grep("Leg", sb) 
sb <- sb[,c(1,test)] 


### This code only works if you have 3 columns for headings. Need to adapt to be more general for all tables. 
#Get Column names 
colnames(sb) <- paste(sb[2,], sb[3,], sep="_") 
colnames(sb)[1] <- "Area" 
sb <- sb[-c(1:3),] 

#Remove commas from numbers so that you can then convert columns to numerical values. Only important if numbers above 1000 
Dat <- sapply(sb, FUN= function(x) 
as.character(gsub(",", "", as.character(x), fixed = TRUE))) 

Dat<-as.data.frame(Dat, stringsAsFactors = FALSE)

但是，代码还没有完成，我想这可能是最好使用API，因为表的结构和布局网页更改。这样我们就不必梳理表格，找出布局何时改变以及如何相应地调整网页抓取代码。因此我试图从API中获取数据。

现在，我发现了CDC提供数据的两个帮助文档。一位似乎从2014年开始提供数据，使用RSocrata可以看到here，而另一条指令看起来更加通用，并且使用了http格式的XML格式请求，可以看到here。通过http的XML格式请求需要一个基于数据库的ID我找不到。然后我偶然发现了RSocrata并决定尝试。但提供的代码片段以及我设置的令牌标识不起作用。

install.packages("RSocrata") 

    library("RSocrata") 
    df <- read.socrata("https://data.cdc.gov/resource/cmap-p7au?$$app_token=tdWMkm9ddsLc6QKHvBP6aCiOA")

我该如何解决这个问题？我的最终目标是按国家每周收集1996年至2016年的军团菌病病例表。

来源

2016-11-15 Meli

我建议检出this issue thread in the RSocrata GitHub repo，他们在讨论将令牌传入RSocrata库的类似问题。

与此同时，你实际上可以放弃$$app_token参数，只要你没有满足我们的要求，它就会工作得很好。如果不使用应用令牌，可以潜入限制的限制。

来源

2016-11-15 18:07:07 chrismetcalf

@christmetcalf我删除了$$ app_token参数，它仍然无法工作。我收到了同样的错误错误在read.socrata：text/plain不支持的数据格式。我经历了线程，没有发现对我的场景有用的东西。我试图运行的代码直接来自Socrata页面。 – Meli

啊，看起来像RSocrata代码片段中存在一个错误。该URL的末尾应该有一个'.csv'，就像'https：// data.cdc.gov/resource/cmap-p7au.csv' – chrismetcalf

@christmetcalf该URL最后应该有一个.json。它确实有效，但它没有提供我正在寻找的所有数据，所以我会寻找替代方案。我想从这个网络应用程序的所有legionellosis数据'https：// wonder.cdc.gov/mmwr/mmwrmorb.asp /' – Meli

要求疾病控制中心使用RSocrata或XML中的数据R

回答

相关问题