2017-04-20 81 views
5

假设我有下面的data.frame,其中treat == 1表示id接受治疗,prob是计算的概率treat == 1如何在R中进行组匹配?

set.seed(1) 
df <- data.frame(id = 1:10, treat = sample(0:1, 10, replace = T)) 
df$prob <- ifelse(df$treat, rnorm(10, .8, .1), rnorm(10, .4, .4)) 
df 
    id treat  prob 
1 1  0 0.3820266 
2 2  0 0.3935239 
3 3  1 0.8738325 
4 4  1 0.8575781 
5 5  0 0.6375605 
6 6  1 0.9511781 
7 7  1 0.8389843 
8 8  1 0.7378759 
9 9  1 0.5785300 
10 10  0 0.6479303 

为了最小化的选择偏差,我现在想的treatprob的值的基础上创建伪治疗组和对照组:

  • 当与treat == 1任何id为内0.1 prob的任何idtreat == 0,我想要group的值被“处理”。

  • 当与treat == 0任何id为内0.1 probtreat == 1任何id的,我想的group值设定为“控制”。

下面是我想什么,结果是一个例子。

df$group <- c(NA, NA, NA, NA, 'control', NA, NA, 'treated', 'treated', 'control') 
df 
    id treat  prob group 
1 1  0 0.3820266 <NA> 
2 2  0 0.3935239 <NA> 
3 3  1 0.8738325 <NA> 
4 4  1 0.8575781 <NA> 
5 5  0 0.6375605 control 
6 6  1 0.9511781 <NA> 
7 7  1 0.8389843 <NA> 
8 8  1 0.7378759 treated 
9 9  1 0.5785300 treated 
10 10  0 0.6479303 control 

我该怎么做呢?在上面的例子中,匹配是通过替换来完成的,但是没有替换的解决方案也是受欢迎的。

回答

2

我觉得这个问题是非常适合cut在基地R.这里是你如何能在一个量化的方式做到这一点:

f <- function(r) { 
     x <- cut(df[r,]$prob, breaks = c(df[!r,]$prob-0.1, df[!r,]$prob+0.1)) 
     df[r,][!is.na(x),]$id 
} 

ones <- df$treat==1 
df$group <- NA 

df[df$id %in% f(ones),]$group <- "treated" 
df[df$id %in% f(!ones),]$group <- "control" 

> df 

    # id treat  prob group 
# 1 1  0 0.3820266 <NA> 
# 2 2  0 0.3935239 <NA> 
# 3 3  1 0.8738325 <NA> 
# 4 4  1 0.8575781 <NA> 
# 5 5  0 0.6375605 control 
# 6 6  1 0.9511781 <NA> 
# 7 7  1 0.8389843 <NA> 
# 8 8  1 0.7378759 treated 
# 9 9  1 0.5785300 treated 
# 10 10  0 0.6479303 control 
+1

我接受了这个答案它是一个完整的解决方案,使用基本的R函数并使用预定义的函数和条件整齐地执行匹配。 – lillemets

1

也许不是最优雅,但它似乎为我工作:

df %>% group_by(id,treat) %>% mutate(group2 = ifelse(treat==1, 
               ifelse(any(abs(prob-df[df$treat==0,3])<0.1),"treated","NA"), 
               ifelse(any(abs(prob-df[df$treat==1,3])<0.1),"control","NA"))) # treat==0 
1

这是你想要的吗?

#Base R: 

apply(df[df$treat == 1, ],1, function(x){ 
    ifelse(any(df[df$treat == 0, 'prob'] -.1 < x[3] & x[3] < df[df$treat == 0, 'prob'] +.1), 'treated', NA) 
}) 

您可以反$treat条款,以反映控制组和附加变量的自由度。

4

您可以尝试

foo <- function(x){ 
    TR <- range(x$prob[x$treat == 0]) 
    CT <- range(x$prob[x$treat == 1]) 
    tmp <- sapply(1:nrow(x), function(y, z){ 
    if(z$treat[y] == 1){ 
    ifelse(any(abs(z$prob[y] - TR) <= 0.1), "treated", "NA") 
    }else{ 
    ifelse(any(abs(z$prob[y] - CT) <= 0.1), "control", "NA") 
    }}, x) 
    cbind(x, group = tmp) 
    } 

foo(df)  
    id treat  prob group 
1 1  0 0.3820266  NA 
2 2  0 0.3935239  NA 
3 3  1 0.8738325  NA 
4 4  1 0.8575781  NA 
5 5  0 0.6375605 control 
6 6  1 0.9511781  NA 
7 7  1 0.8389843  NA 
8 8  1 0.7378759 treated 
9 9  1 0.5785300 treated 
10 10  0 0.6479303 control 
相关问题