2013-02-18 76 views
4

实施例:我有一个数据帧在数据帧上的列名的基础添加两列

> a = data.frame(T_a_1=c(1,2,3,4,5),T_a_2=c(2,3,4,5,6),T_b_1=c(3,4,5,6,7),T_c_1=c(4,5,6,7,8),length=c(1,2,3,4,5)) 
> a  
T_a_1 T_a_2 T_b_1 T_c_1 length 
1  2  3  4  1 
2  3  4  5  2 
3  4  5  6  3 
4  5  6  7  4 
5  6  7  8  5 

我要添加(或做类似(列1 +列2)/长度上的基础上,列一些其它操作名字。 像T_A(T_a_1和T_a_2)是两列(第一和第二)之间共同的名字,所以我想补充它们。

回答

3

我会用grep命令作业列名对一些模式相匹配。下面是一些例子:

> a = data.frame(T_a_1=c(1,2,3,4,5), 
+    T_a_2=c(2,3,4,5,6), 
+    T_b_1=c(3,4,5,6,7), 
+    T_c_1=c(4,5,6,7,8), 
+    length=c(1,2,3,4,5)) 
> 
> # display only columns that match T_a 
> a[,grep('T_a', colnames(a))] 
    T_a_1 T_a_2 
1  1  2 
2  2  3 
3  3  4 
4  4  5 
5  5  6 
> 
> # sum 
> sum(a[,grep('T_a', colnames(a))]) 
[1] 35 
> 
> #rowsum 
> rowSums(a[,grep('T_a', colnames(a))]) 
[1] 3 5 7 9 11 
> 
> # your example (row1 + row2)/length 
> rowSums(a[,grep('T_a', colnames(a))])/a$length 
[1] 3.000000 2.500000 2.333333 2.250000 2.200000 

更新:

从下面的评论,我明白你想总结匹配的行按常用前缀分组和长度分割列。以下代码是用于该问题的不雅溶液:

> a = data.frame(ES51_223_1=c(1,2,3,4,5), 
+    ES51_312_1=c(2,3,4,5,6), 
+    ES52_223_2=c(3,4,5,6,7), 
+    ES52_312_2=c(4,5,6,7,8), 
+    ES53_223_3=c(1,2,3,4,5), 
+    length=c(1,2,3,4,5)) 
> 
> # get the unique prefixes 
> prefixes = unique(unlist(lapply(colnames(subset(a, select=-length)), function(x) { strsplit(x, '_')[[1]][[1]]}))) 
> 
> f = function(prefix) { 
+ return (rowSums(subset(a, select=grep(prefix, colnames(a))))/a$length) 
+ } 
> m = matrix(unlist(lapply(prefixes, f)), nrow=nrow(a)) 
> colnames(m) = prefixes 
> m 
     ES51  ES52 ES53 
[1,] 3.000000 7.000000 1 
[2,] 2.500000 4.500000 1 
[3,] 2.333333 3.666667 1 
[4,] 2.250000 3.250000 1 
[5,] 2.200000 3.000000 1 

m是包含在不同的列不同的前缀的结果矩阵。

+0

谢谢。但我不知道这种模式。我只知道会有“* _ *”。正试图用lapply使用strsplit,但我不知道我在做什么 – user1631306 2013-02-18 20:26:25

+0

@ user1631306,什么**确实**是你的列的格式? – Arun 2013-02-18 20:28:42

+0

它们是“ES51_223_1 ES51_312_1 ES52_223_2 ES52_312_2 ES53_223_3”。所以,我会考虑“_”前的第一部分 – user1631306 2013-02-18 20:30:00

2

这个怎么样?

# data 
df <- structure(list(ES51_223_1 = 1:5, ES51_312_1 = 2:6, ES52_223_2 = 3:7, 
     ES52_312_2 = 4:8, ES53_223_3 = 1:5, length = 1:5), 
     .Names = c("ES51_223_1", "ES51_312_1", "ES52_223_2", "ES52_312_2", 
     "ES53_223_3", "length"), row.names = c(NA, -5L), class = "data.frame") 

# create indices from factor levels (shortcut) 
ids <- gsub("_.*$", "", setdiff(names(df), "length")) 
ids <- factor(as.numeric(factor(ids))) 
> ids 
# [1] 1 1 2 2 3 
# Levels: 1 2 3 

# use the levels to fetch columns and sum them 
o <- sapply(as.numeric(levels(ids)), function(x) { 
    rowSums(df[which(ids == x)])/df$length 
}) 

> o 
#   [,1]  [,2] [,3] 
# [1,] 3.000000 7.000000 1 
# [2,] 2.500000 4.500000 1 
# [3,] 2.333333 3.666667 1 
# [4,] 2.250000 3.250000 1 
# [5,] 2.200000 3.000000 1