使用R中的查找表查找/替换或映射

我有一个数据框中的数值表示种族（英国人口普查数据）。

# create example data 
id = c(1, 2, 3, 4, 5, 6, 7, 8, 9) 
ethnicode = c(0, 1, 2, 3, 4, 5, 6, 7, 8) 
df = data.frame(id, ethnicode)

我可以做一个映射（或查找/替换）来创建一个列（或修改现有的列），其中包含人类可读的值：

# map values one-to-one from numeric to string 
df$ethnicity <- mapvalues(df$ethnicode, 
          from = c(8, 7, 6, 5, 4, 3, 2, 1, 0), 
          to = c("Other", "Black", "Asian", "Mixed", 
            "WhiteOther", "WhiteIrish", "WhiteUK", 
            "WhiteTotal", "All"))

所有的事情我试过这似乎是最快的（对于900万行，大约20秒，而对于一些方法超过一分钟）。

我似乎无法找到（或从我读过的内容中理解）的东西是如何引用查找表。

# create lookup table 
ethnicode = c(8, 7, 6, 5, 4, 3, 2, 1, 0) 
ethnicity = c(("Other", "Black", "Asian", "Mixed", "WhiteOther", 
       "WhiteIrish", "WhiteUK", "WhiteTotal", "All") 
lookup = data.frame(ethnicode, ethnicity)

问题的关键是，如果我想改变人类可读的字符串，或做其他任何事情的过程中，我宁愿做一次的查表，不是必须这样做，在几个放置在几个脚本中......如果我能更有效地做到这一点（对于900万行，在20秒以内）也是很好的做法。

我也想要很容易地确定“8”仍然等于“其他”（或任何等价物），并且“0”仍然等于“全部”等等，这在视觉上更加困难，上述方法。

在此先感谢。

来源

2016-08-17 Alan Duval

你可以使用这个命名向量。但是，您需要将族群转换为字符。

df = data.frame(
    id = c(1, 2, 3, 4, 5, 6, 7, 8, 9), 
    ethnicode = as.character(c(0, 1, 2, 3, 4, 5, 6, 7, 8)), 
    stringsAsFactors=FALSE 
) 

# create lookup table 
ethnicode = c(8, 7, 6, 5, 4, 3, 2, 1, 0) 
ethnicity = c("Other", "Black", "Asian", "Mixed", "WhiteOther", 
      "WhiteIrish", "WhiteUK", "WhiteTotal", "All") 
lookup = setNames(ethnicity, as.character(ethnicode))

然后，你可以做

df <- transform(df, ethnicity=lookup[ethnicode], stringsAsFactors=FALSE)

和你做。

为了处理900万行，我建议你使用像sqlite或monetdb这样的数据库。 SQLite的，下面的代码可能会有所帮助：

library(RSQLite) 

dbname <- "big_data_mapping.db" # db to create 
csvname <- "data/big_data_mapping.csv" # large dataset 

ethn_codes = data.frame(
    ethnicode= c(8, 7, 6, 5, 4, 3, 2, 1, 0), 
    ethnicity= c("Other", "Black", "Asian", "Mixed", "WhiteOther", "WhiteIrish", "WhiteUK", "WhiteTotal", "All") 
) 

# build db 
con <- dbConnect(SQLite(), dbname) 
dbWriteTable(con, name="main", value=csvname, overwrite=TRUE) 
dbWriteTable(con, name="ethn_codes", ethn_codes, overwrite=TRUE) 

# join the tables 
dat <- dbGetQuery(con, "SELECT main.id, ethn_codes.ethnicity FROM main JOIN ethn_codes ON main.ethnicode=ethn_codes.ethnicode") 

# finish 
dbDisconnect(con) 
#file.remove(dbname)

monetdb被认为是更适合你平时有R完成的任务，所以它是definitly值得一看。

来源

2016-08-17 15:51:40

感谢这个，@Karsten W. 我遇到的问题是，建议的解决方案似乎考虑的问题，因为尽管它涉及到例如dataframes，而正因为如此，我可以为他们”进行修改重新创建，而不是我将它们作为预先存在的数据框（通过导入CSV创建）。我试图将“as.character”和“stringsAsFactors”应用于各自数据框中的相关列，以便使用“变换”，但是，强制执行“字符”需要几秒钟，而“变换”一直挂起。 –

为大数据方面编辑答案。 –

很酷，谢谢。我会有一点戏剧，并看看我如何去。 –

使用R中的查找表查找/替换或映射

回答

相关问题