2017-10-13 141 views
1

我试图在多个csv文件的目录中读取,每个文件约为7K +行和〜1800列。我有一个数据字典,可以读入数据框,数据字典的每一行都标识变量(列)名称以及数据类型。使用数据框中的值指定read_csv中的列类型

查看readr包中的?read_csv,可以指定列类型。但是,鉴于我有近1800列指定,我希望使用可用数据字典中的信息来指定该函数所需的适当格式的列/类型对。

另一种不太理想的方法是将每一列读作字符,然后根据需要手动修改。

任何帮助,你可以提供关于如何指定列类型将不胜感激。

如果有帮助,这里是我的代码来获取和哄数据字典到我指的格式。

## Get the data dictionary 
URL = "https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary.xlsx" 
download.file(URL, destfile="raw-data/dictionary.xlsx") 

## read in the dictionary to get the variables 
dict = read_excel("raw-data/dictionary.xlsx", sheet = "data_dictionary") 
colnames(dict) = tolower(gsub(" ", "_", colnames(dict))) 
dict = dict %>% filter(!is.na(variable_name)) 

## create a data dictionary 
## https://stackoverflow.com/questions/46738968/specify-column-types-in-read-csv-by-using-values-in-a-dataframe/46742411#46742411 
dict <- dict %>% mutate(variable_type = case_when(api_data_type == "integer" ~ "i", 
                api_data_type == "autocomplete" ~ "c", #assumption that this is a string 
                api_data_type == "string" ~ "c", 
                api_data_type == "float" ~ "d")) 

回报:

> ## read in the dictionary to get the variables 
> dict = read_excel("raw-data/dictionary.xlsx", sheet = "data_dictionary") 
> colnames(dict) = tolower(gsub(" ", "_", colnames(dict))) 
> dict = dict %>% filter(!is.na(variable_name)) 
> dict <- dict %>% mutate(variable_type = case_when(api_data_type == "integer" ~ "i", 
+             api_data_type == "autocomplete" ~ "c", #assumption that this is a string 
+             api_data_type == "string" ~ "c", 
+             api_data_type == "float" ~ "d")) 
Error: object 'api_data_type' not found 

和我sessionInfo

> sessionInfo() 
R version 3.3.1 (2016-06-21) 
Platform: x86_64-apple-darwin13.4.0 (64-bit) 
Running under: OS X 10.11.6 (El Capitan) 

locale: 
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 

attached base packages: 
[1] stats  graphics grDevices utils  datasets methods base  

other attached packages: 
[1] stringr_1.2.0 readxl_0.1.1 readr_1.1.0 dplyr_0.5.0 

loaded via a namespace (and not attached): 
[1] rjson_0.2.15 lazyeval_0.2.0 magrittr_1.5 R6_2.2.2  assertthat_0.1 hms_0.2  DBI_0.7  tools_3.3.1 
[9] tibble_1.2  yaml_2.1.14 Rcpp_0.12.11 stringi_1.1.5 jsonlite_1.5 
+0

我不久将发布 “完全” 可重复的解决方案。 – Jas

+0

也许你必须升级你的dplyr版本。我有v0.7.4 – Jas

回答

1

您可以使用mutatecase_when组合来映射使用紧凑的字符串表示api_data_type列。这是每个列类型由单个字母表示的地方:c =字符,i =整数,n =数字,d =双倍,l =逻辑等现在,此字符向量可用于参数read_csv

## Load libraries 
library(dplyr) 
library(readxl) 

## Get the data dictionary 
URL = "https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary.xlsx" 
download.file(URL, destfile="raw-data/dictionary.xlsx") 

## read in the dictionary to get the variables 
dict = read_excel("raw-data/dictionary.xlsx", sheet = "data_dictionary") 
colnames(dict) = tolower(gsub(" ", "_", colnames(dict))) 
dict = dict %>% filter(!is.na(variable_name)) 

unique(dict$api_data_type) 
#> [1] "integer"  "autocomplete" "string"  "float" 

dict <- dict %>% mutate(variable_type = case_when(api_data_type == "integer" ~ "i", 
                api_data_type == "autocomplete" ~ "c", #assumption that this is a string 
                api_data_type == "string" ~ "c", 
                api_data_type == "float" ~ "d" 
               ) 
         ) 
cnames <- dict %>% select(variable_name) %>% pull 
head(cnames) 
#> [1] "UNITID" "OPEID" "OPEID6" "INSTNM" "CITY" "STABBR" 
ctypes <- dict %>% select(variable_type) %>% pull 
head(ctypes) 
#> [1] "i" "i" "i" "c" "c" "c" 
+0

看到上面的更新。我想扩展你在给我你的建议时得到的代码。看到错误,但我不知道'case_when',所以+100这个用例 – Btibert3

+0

没问题,再试一次这个完全可重现的例子。请记住在运行代码之前重新启动会话。 – Jas

+0

我遇到了列不符合数据字典的问题,但这非常有帮助。非常感激 – Btibert3

相关问题