2012-07-31 43 views
0

在几列中折叠具有不同值的复制行在我的数据框中,测试年份和年龄有相同ID但行数不同的行。我想折叠重复的行并为不同的值创建新的列。使用R

我是新来的R,一直在努力。

这是该数据帧:

 
>df 
    id  project  testyr1  testyr2 age1  age2 
1 16S  AS  2008   NA   29  NA 
2 32S  AS  2004   NA   30  NA 
3 37S  AS  NA   2011  NA  36 
4 50S  AS  2004   NA   23  NA 
5 50S  AS  1998   NA   16  NA 
6 55S  AS  2007   NA   28  NA 

testyr1应该有最早的一年testyr2最近一年。 age1应该是较年轻的年龄段和age2年龄较大的年龄段。

输出应该是:

 
     id project testyr1 testyr2 age1 age2 
1  16S  AS 2008  NA  29  NA 
2  32S  AS 2004  NA  30  NA 
3  37S  AS NA  2011  NA  36 
4  50S  AS 1998  2004  16  23 
6  55S  AS 2007  NA  28  NA 

我试着写一个循环,但不知道如何收场吧:

df.undup <- c() 
df.undup <- c()  
for (i in 1:nrow(df)){ 
    if i == i+1  
    df$testyr1 != NA { 

    testyr2 = max(testyr1) 
    testyr1 = min(testyr1) 
    nage2 = max(nage1) 
    nage1 = min(nage1) 
    } 
else{ 
    testyr2 = max(testyr2) 
    testyr1 = min(testyr2) 
    nage2 = max(nage2) 
    nage1 = min(nage2) 
    } 
} 

任何帮助将不胜感激。

+0

你可以只有两个重复? – nico 2012-07-31 20:30:09

回答

3
library(plyr) 

data <- read.csv(textConnection("id,project,testyr1,testyr2,age1,age2 
16S,AS,2008,NA,29,NA 
32S,AS,2004,NA,30,NA 
37S,AS,NA,2011,NA,36 
50S,AS,2004,NA,23,NA 
50S,AS,1998,NA,16,NA 
55S,AS,2007,NA,28,NA")) 


new_data <- ddply(data, .(id), function(x) { 
    return(data.frame(id = unique(x$id), project = unique(x$project), 
    testyr1 = min(x$testyr1), 
    testyr2 = max(x$testyr2), age1= min(x$age1), age2 = max(x$age2))) 
    }) 

> new_data 

    id project testyr1 testyr2 age1 age2 
1 16S  AS 2008  NA 29 NA 
2 32S  AS 2004  NA 30 NA 
3 37S  AS  NA 2011 NA 36 
4 50S  AS 2004  NA 23 NA 
5 50S  AS 1998  NA 16 NA 
6 55S  AS 2007  NA 28 NA 

# But your result example suggests you want the lowest 
# of testyr to be in testyr1 and the highest of the combined 
# testyrs to be in testyr2. Same logic for ages. 
# If so, the one below should work: 

new_data <- ddply(data, .(id), function(x) { 
    if(dim(x)[1]>1) { 
    years <- c(x$testyr1, x$testyr2) 
    ages <- c(x$age1, x$age2) 
    return(data.frame(id = unique(x$id), project = unique(x$project), 
     testyr1 = min(years, na.rm=T), testyr2 = max(years , na.rm=T), 
     age1= min(ages, na.rm=T), age2 = max(ages, na.rm=T))) 
    } else { 
    return(data.frame(id = unique(x$id), project = unique(x$project), 
     testyr1 = x$testyr1, testyr2 = x$testyr2, 
     age1= x$age1, age2 = x$age2)) 
    }  
    }) 

> new_data 
    id project testyr1 testyr2 age1 age2 
1 16S  AS 2008  NA 29 NA 
2 32S  AS 2004  NA 30 NA 
3 37S  AS  NA 2011 NA 36 
4 50S  AS 1998 2004 16 23 
5 55S  AS 2007  NA 28 NA 
+0

嗨Maiassaura,感谢您的代码。然而,我收到了一条错误消息:data.frame(id = unique(CADD.age.bd.u $ id),project = unique(CADD.age.bd.u $ project),: 参数中的错误意味着不同数量的rows:1,0 – user1566478 2012-07-31 22:27:10

+0

这意味着每个ID有多个项目,如果是,则将'ddply'调用更改为'ddply(data,。(id,project)',以便按照该组合拆分。 – Maiasaura 2012-07-31 23:29:26

0

我真的无疑这是做到这一点的最有效的方法,但我的大脑并没有在此刻发挥作用。

temp = names(which(table(df$id) > 1)) 
temp1 = vector("list") 
for (i in 1:length(temp)) { 
    temp1[[i]] = df[df$id == temp[i], ] 
    temp1[[i]] = data.frame(temp1[[i]][1, 1:2], 
        testyr1 = min(temp1[[i]]$testyr1), 
        testyr2 = max(temp1[[i]]$testyr1), 
        age1 = min(temp1[[i]]$age1), 
        age2 = max(temp1[[i]]$age1)) 
} 

rbind(df[-c(which(df$id %in% temp)), ], do.call(rbind, temp1)) 
# id project testyr1 testyr2 age1 age2 
# 1 16S  AS 2008  NA 29 NA 
# 2 32S  AS 2004  NA 30 NA 
# 3 37S  AS  NA 2011 NA 36 
# 6 55S  AS 2007  NA 28 NA 
# 4 50S  AS 1998 2004 16 23 

### rm(i, temp, temp1) ### Cleanup the workspace 
+0

对于脚本mrdwab!但是我收到了错误消息:rbind(deparse.level,...)中的错误: 参数列的数量不匹配 – user1566478 2012-07-31 22:58:54

+0

我删除了一些额外的列并且它工作。非常感谢您的帮助! – user1566478 2012-07-31 23:50:41

+0

@ user1566478,如果这个或其他建议为你工作,确保对它们进行投票或标记为已接受,以帮助保持R标签中的“未答复”队列整齐。 – A5C1D2H2I1M1N2O1R2T1 2012-12-06 10:15:37