基于R的优化重新编码

我是一个基于一些相当长的字符串重新编码的变量，这里以字符串A，B，C，D，E和G为例。我想知道是否有方法来重新编码无需使用base R重复12次对df$foo的引用？也许有一些更聪明的更快的方式我可以探索？这真的是R中最聪明的方法吗？基于R的优化重新编码

df <- data.frame(
    foo = 1000:1010, 
    bar = letters[1:11]) 
df 
    foo bar 
1 1000 a 
2 1001 b 
3 1002 c 
4 1003 d 
5 1004 e 
6 1005 f 
7 1006 g 
8 1007 h 
9 1008 i 
10 1009 j 
11 1010 k 

A <- c(1002) 
B <- c(1007, 1008) 
C <- c(1001, 1003) 
D <- c(1004, 1006) 
E <- c(1000, 1005) 
G <- c(1010, 1009) 

df$foo[df$foo %in% A] <- 1 
df$foo[df$foo %in% B] <- 2 
df$foo[df$foo %in% C] <- 3 
df$foo[df$foo %in% D] <- 4 
df$foo[df$foo %in% E] <- 5 
df$foo[df$foo %in% G] <- 7 
df 
    foo bar 
1 5 a 
2 3 b 
3 1 c 
4 3 d 
5 4 e 
6 5 f 
7 4 g 
8 2 h 
9 2 i 
10 7 j 
11 7 k

更新于2013年3月11日05：28：061Z，

我已经重写五大解决方案的功能，能够使用微基准测试包对它们进行比较，其结果是，泰勒林克而flodel的解决方案是最快的解决方案（请参见下面的结果），并不是说这个问题是关于速度问题。我也在寻求简洁和智能的解决方案。出于好奇，我还添加了一个使用汽车包装中的Recode功能的解决方案。如果我能够以更优化的方式重写解决方案，或者如果microbenchmark软件包不是比较这些功能的最佳方法，请随时告诉我。

df <- data.frame(
    foo = sample(1000:1010, 1e5+22, replace = TRUE), 
    bar = rep(letters, 3847)) 
str(df) 

A <- c(1002) 
B <- c(1007, 1008) 
C <- c(1001, 1003) 
D <- c(1004, 1006) 
E <- c(1000, 1005) 
G <- c(1010, 1009) 

# juba's solution 
juba <- function(df,foo) within(df, {foo[foo %in% A] <- 1; foo[foo %in% B] <- 2;foo[foo %in% C] <- 3;foo[foo %in% D] <- 4;foo[foo %in% E] <- 5;foo[foo %in% G] <- 7}) 
# Arun's solution 
Arun <- function(df,x) factor(df[,x], levels=c(A,B,C,D,E,G), labels=c(1, rep(c(2:5, 7), each=2))) 
# flodel's solution 
flodel <- function(df,x) rep(c(1, 2, 3, 4, 5, 7), sapply(list(A, B, C, D, E, G), length))[match(df[,x], unlist(list(A, B, C, D, E, G)))] 
# Tyler Rinker's solution 
TylerRinker <- function(df,x) data.frame(vals = unlist(list(A = c(1002),B = c(1007, 1008),C = c(1001, 1003),D = c(1004, 1006),E = c(1000, 1005), G = c(1010, 1009))), labs = c(1, rep(c(2:5, 7), each=2)))[match(df[,x], unlist(list(A = c(1002),B = c(1007, 1008),C = c(1001, 1003),D = c(1004, 1006),E = c(1000, 1005), G = c(1010, 1009)))), 2] 
# agstudy's solution 
agstudy <- function(df,foo) merge(df,data.frame(foo=unlist(list(A, B, C, D, E, G)), val =rep((1:7)[-6],rapply(list(A, B, C, D, E, G), length)))) 
# Recode from the car package 
ReINcar <- function(df,x) Recode(df[,x], "A='A'; B='B'; C='C'; D='D'; E='E'; G='G'") 

# install.packages("microbenchmark", dependencies = TRUE) 
require(microbenchmark) 

# run test 
res <- microbenchmark(juba(df, foo), Arun(df, 1), flodel(df, 1), TylerRinker(df,1) ,agstudy(df, foo), ReINcar(df, 1), times = 25) 
There were 15 warnings (use warnings() to see them) # warning duo to x's solution 

## Print results: 
print(res)

数字，

Unit: milliseconds 
        expr  min   lq  median   uq  max neval 
      juba(df, foo) 37.944355 39.521603 41.987174 46.385974 79.559750 25 
      Arun(df, 1) 23.833334 24.115776 24.648842 26.987431 55.466448 25 
      flodel(df, 1) 3.586179 3.637024 3.956814 6.468735 28.404166 25 
    TylerRinker(df, 1) 3.919563 4.115994 4.529926 5.532688 8.508956 25 
     agstudy(df, foo) 301.487732 324.641734 334.801005 352.753496 415.421212 25 
     ReINcar(df, 1) 73.655566 77.903088 81.745037 101.038791 125.158208 25 


### Plot results: 
boxplot(res)

微基准测试结果的

箱线图，

Box Plot of microbenchmark results

来源

2013-03-09 Eric Fail

A和B有重复的值。对不对？ – Arun 2013-03-09 23:23:48

@阿伦，不。这是我的一个错字。我已经更新了我的问题。谢谢！ – 2013-03-09 23:29:25

你也可以看看'memisc'和'car'包中的'recode'函数。 – juba 2013-03-09 23:44:03

这是一个普遍的（可扩展）的方式，速度非常快过：

sets <- list(A, B, C, D, E, G) 
vals <- c(1, 2, 3, 4, 5, 7) 

keys <- unlist(sets) 
values <- rep(vals, sapply(sets, length)) 
df$foo <- values[match(df$foo, keys)]

来源

2013-03-10 00:09:42 flodel

我用原始基准测试更新了我的问题，我将您的解决方案与其他四种解决方案的速度进行了比较。请随时纠正我在测试中用于表示解决方案的函数。 – 2013-03-11 08:10:41

使用within可以帮你节省一些按键：

df <- within(df, 
     {foo[foo %in% A] <- 1; 
     foo[foo %in% B] <- 2; 
     foo[foo %in% C] <- 3; 
     foo[foo %in% D] <- 4; 
     foo[foo %in% E] <- 5; 
     foo[foo %in% G] <- 7})

来源

2013-03-09 23:17:19 juba

谢谢你的回应。我喜欢你设法删除12 *'df $'，但它对于'foo'仍然有点重复。你能解释一下';'的用法吗？ – 2013-03-09 23:25:51

@EricFail'within'将表达式作为第二个参数。这里我们想要执行几个语句，所以我把它们传递给''内部''，用''括起来，并用';'分隔。这与你可以在R行传递几条语句的方式相同。 – juba 2013-03-09 23:29:46

这可能是一个操作系统问题，但在我的机器上（* NIX），我的代码在没有';'的情况下工作得很好。我的意思是，你的代码分开。 – 2013-03-09 23:35:52

你也可以这样做：（编辑）

> df$foo <- factor(df$foo, levels=c(A,B,C,D,E,G), labels=c(1, rep(c(2:5, 7), each=2))) 

# Warning message: 
# In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, : 
# duplicated levels will not be allowed in factors anymore 

# foo bar 
# 1 5 a 
# 2 3 b 
# 3 1 c 
# 4 3 d 
# 5 4 e 
# 6 5 f 
# 7 4 g 
# 8 2 h 
# 9 2 i 
# 10 7 j 
# 11 7 k

来源

2013-03-09 23:26:00 Arun

感谢您回复我的问题。我更新了我的问题，现在我得到了一个不同的错误（关于标签长度）。我觉得你的解决方案很有趣！ – 2013-03-09 23:32:40

我编辑了新数据的解决方案。似乎警告仍然存在，是因为这些级别没有独特的标签。但它仍然返回正确的结果。 – Arun 2013-03-09 23:35:54

谢谢，虽然我必须承认警告让我有些紧张。 – 2013-03-09 23:38:05

我的做法（失去了A，B，C ...所有在一起，但我看到flodel的是非常相似的）。

keyL <- list(
    A = c(1002), 
    B = c(1007, 1008), 
    C = c(1001, 1003), 
    D = c(1004, 1006), 
    E = c(1000, 1005), 
    G = c(1010, 1009) 
) 

key <- data.frame(vals = unlist(keyL), labs = c(1, rep(c(2:5, 7), each=2))) 

df$foo2 <- key[match(df$foo, key$vals), 2]

我不喜欢写旧的列，所以创建了一个新的。我也会将密钥存储为一个命名列表。

来源

2013-03-10 00:12:31

我已经用原始基准测试更新了我的问题，我将您的解决方案与其他四种解决方案的速度进行了比较。请随时纠正我在测试中用于表示解决方案的函数。 – 2013-03-11 08:11:04

另一种选择是使用merge，非常类似于@flodel和@Tyler方法

sets <- list(A, B, C, D, E, G) 
df.code = data.frame(foo=unlist(sets), 
        val =rep((1:7)[-6],rapply(sets, length))) 
> merge(df,df.code) 
    foo bar val 
1 1000 a 5 
2 1001 b 3 
3 1002 c 1 
4 1003 d 3 
5 1004 e 4 
6 1005 f 5 
7 1006 g 4 
8 1007 h 2 
9 1008 i 2 
10 1009 j 7 
11 1010 k 7

来源

2013-03-10 03:29:49 agstudy

我已经用原始基准测试更新了我的问题，其中我将您的解决方案与其他四种解决方案的速度进行了比较。请随时纠正我在测试中用于表示解决方案的函数。 – 2013-03-11 08:11:45

我想这你想要做什么，尽管使用格式稍有不同。它可能是最快的方法。

library(data.table) 

## Create the sample data: 
dt <- data.table(foo=sample(1000:1010, 1e5+22, replace = TRUE), bar=rep(letters, 3847), key="foo") 

## Create the table that maps the old value of foo to the new one: 
dt.recode<-data.table(foo_old=1000:1010, foo_new=c(5L, 3L, 1L, 3L, 4L, 5L, 4L, 2L, 2L, 7L, 7L), key="foo_old") 

## Show the result of the join/merge between the original and recoded table: 
## (not necesary if you only want to update the original table) 
dt[dt.recode] 
##  foo bar foo_new 
## 1: 1000 a  5 
## 2: 1001 b  3 
## 3: 1002 c  1 
## 4: 1003 d  3 
## 5: 1004 e  4 
## 6: 1005 f  5 
## 7: 1006 g  4 
## 8: 1007 h  2 
## 9: 1008 i  2 
## 10: 1009 j  7 
## 11: 1010 k  7 

## Same as above, but updates the value of foo in the original table: 
dt[dt.recode,foo:=foo_new][] 
##  foo bar 
## 1: 5 a 
## 2: 3 b 
## 3: 1 c 
## 4: 3 d 
## 5: 4 e 
## 6: 5 f 
## 7: 4 g 
## 8: 2 h 
## 9: 2 i 
## 10: 7 j 
## 11: 7 k

以下是如何将数据帧转换为数据表（并添加是必要的关键，为后来加入），如果你喜欢的，而不是从零开始创建数据表：

dt <- as.data.table(df) 
setkey(dt,foo)

我不知道你想如何计算这种方法的时间，但假设dt和dt.recode已经存在并且已被键入，然后运行更新该表的单行显示0在我的系统上经过的时间。此外，如果您的A，B，C，D，E，G组有任何内在含义，我会将它们作为列添加到您的原始表中。那么你可以加入这个领域，dt.recode只需要6行（假设你有6个组）。

来源

2013-06-22 23:07:14 dnlbrky

感谢您的回复，但我已经使用了他于3月10日发布的[flodel's answer]（http://stackoverflow.com/a/15317417/1305688）。无论如何，我感谢您的意见！ – 2013-06-26 09:32:59

基于R的优化重新编码

更新于2013年3月11日05：28：061Z，

箱线图，

回答

相关问题