使用R

在几列中折叠具有不同值的复制行在我的数据框中，测试年份和年龄有相同ID但行数不同的行。我想折叠重复的行并为不同的值创建新的列。使用R

我是新来的R，一直在努力。

这是该数据帧：

 
>df 
    id  project  testyr1  testyr2 age1  age2 
1 16S  AS  2008   NA   29  NA 
2 32S  AS  2004   NA   30  NA 
3 37S  AS  NA   2011  NA  36 
4 50S  AS  2004   NA   23  NA 
5 50S  AS  1998   NA   16  NA 
6 55S  AS  2007   NA   28  NA

testyr1应该有最早的一年testyr2最近一年。 age1应该是较年轻的年龄段和age2年龄较大的年龄段。

输出应该是：

 
     id project testyr1 testyr2 age1 age2 
1  16S  AS 2008  NA  29  NA 
2  32S  AS 2004  NA  30  NA 
3  37S  AS NA  2011  NA  36 
4  50S  AS 1998  2004  16  23 
6  55S  AS 2007  NA  28  NA

我试着写一个循环，但不知道如何收场吧：

df.undup <- c() 
df.undup <- c()  
for (i in 1:nrow(df)){ 
    if i == i+1  
    df$testyr1 != NA { 

    testyr2 = max(testyr1) 
    testyr1 = min(testyr1) 
    nage2 = max(nage1) 
    nage1 = min(nage1) 
    } 
else{ 
    testyr2 = max(testyr2) 
    testyr1 = min(testyr2) 
    nage2 = max(nage2) 
    nage1 = min(nage2) 
    } 
}

任何帮助将不胜感激。

来源

2012-07-31 user1566478

你可以只有两个重复？ – nico 2012-07-31 20:30:09

library(plyr) 

data <- read.csv(textConnection("id,project,testyr1,testyr2,age1,age2 
16S,AS,2008,NA,29,NA 
32S,AS,2004,NA,30,NA 
37S,AS,NA,2011,NA,36 
50S,AS,2004,NA,23,NA 
50S,AS,1998,NA,16,NA 
55S,AS,2007,NA,28,NA")) 


new_data <- ddply(data, .(id), function(x) { 
    return(data.frame(id = unique(x$id), project = unique(x$project), 
    testyr1 = min(x$testyr1), 
    testyr2 = max(x$testyr2), age1= min(x$age1), age2 = max(x$age2))) 
    }) 

> new_data 

    id project testyr1 testyr2 age1 age2 
1 16S  AS 2008  NA 29 NA 
2 32S  AS 2004  NA 30 NA 
3 37S  AS  NA 2011 NA 36 
4 50S  AS 2004  NA 23 NA 
5 50S  AS 1998  NA 16 NA 
6 55S  AS 2007  NA 28 NA 

# But your result example suggests you want the lowest 
# of testyr to be in testyr1 and the highest of the combined 
# testyrs to be in testyr2. Same logic for ages. 
# If so, the one below should work: 

new_data <- ddply(data, .(id), function(x) { 
    if(dim(x)[1]>1) { 
    years <- c(x$testyr1, x$testyr2) 
    ages <- c(x$age1, x$age2) 
    return(data.frame(id = unique(x$id), project = unique(x$project), 
     testyr1 = min(years, na.rm=T), testyr2 = max(years , na.rm=T), 
     age1= min(ages, na.rm=T), age2 = max(ages, na.rm=T))) 
    } else { 
    return(data.frame(id = unique(x$id), project = unique(x$project), 
     testyr1 = x$testyr1, testyr2 = x$testyr2, 
     age1= x$age1, age2 = x$age2)) 
    }  
    }) 

> new_data 
    id project testyr1 testyr2 age1 age2 
1 16S  AS 2008  NA 29 NA 
2 32S  AS 2004  NA 30 NA 
3 37S  AS  NA 2011 NA 36 
4 50S  AS 1998 2004 16 23 
5 55S  AS 2007  NA 28 NA

来源

2012-07-31 20:41:27 Maiasaura

嗨Maiassaura，感谢您的代码。然而，我收到了一条错误消息：data.frame（id = unique（CADD.age.bd.u $ id），project = unique（CADD.age.bd.u $ project），：参数中的错误意味着不同数量的rows：1,0 – user1566478 2012-07-31 22:27:10

这意味着每个ID有多个项目，如果是，则将'ddply'调用更改为'ddply（data，。（id，project）'，以便按照该组合拆分。 – Maiasaura 2012-07-31 23:29:26

我真的无疑这是做到这一点的最有效的方法，但我的大脑并没有在此刻发挥作用。

temp = names(which(table(df$id) > 1)) 
temp1 = vector("list") 
for (i in 1:length(temp)) { 
    temp1[[i]] = df[df$id == temp[i], ] 
    temp1[[i]] = data.frame(temp1[[i]][1, 1:2], 
        testyr1 = min(temp1[[i]]$testyr1), 
        testyr2 = max(temp1[[i]]$testyr1), 
        age1 = min(temp1[[i]]$age1), 
        age2 = max(temp1[[i]]$age1)) 
} 

rbind(df[-c(which(df$id %in% temp)), ], do.call(rbind, temp1)) 
# id project testyr1 testyr2 age1 age2 
# 1 16S  AS 2008  NA 29 NA 
# 2 32S  AS 2004  NA 30 NA 
# 3 37S  AS  NA 2011 NA 36 
# 6 55S  AS 2007  NA 28 NA 
# 4 50S  AS 1998 2004 16 23 

### rm(i, temp, temp1) ### Cleanup the workspace

来源

2012-07-31 20:46:08 A5C1D2H2I1M1N2O1R2T1

对于脚本mrdwab！但是我收到了错误消息：rbind（deparse.level，...）中的错误：参数列的数量不匹配 – user1566478 2012-07-31 22:58:54

我删除了一些额外的列并且它工作。非常感谢您的帮助！ – user1566478 2012-07-31 23:50:41

@ user1566478，如果这个或其他建议为你工作，确保对它们进行投票或标记为已接受，以帮助保持R标签中的“未答复”队列整齐。 – A5C1D2H2I1M1N2O1R2T1 2012-12-06 10:15:37

回答

相关问题