2017-02-27 124 views
0

我有一个对称矩阵,我需要按列列出子集,并根据列表应用函数并将函数应用于每个子集。我如何加快流程或改进流程?通过列表并应用函数来列表矩阵

我当前的代码是类似这样的:

funs <- function(x, y, data) { 
    if (all(colnames(data) %in% x) & all(colnames(data) %in% y)) { 
     mean(data[x, y]) 
    } else if (any(colnames(data) %in% x) & any(colnames(data) %in% y)) { 
     mean(data[colnames(data) %in% x, colnames(data) %in% y]) 
    } else{ 
     NA 
    } 
} 

vfuns <- Vectorize(funs, vectorize.args = c("x", "y")) 

outer(l, l, vfuns, data = mat) 
      2 9 10 15 16 18 
2 0.2277186 NA NA NA NA NA 
9   NA NA NA NA NA NA 
10  NA NA NA NA NA NA 
15  NA NA NA NA NA NA 
16  NA NA NA NA NA NA 
18  NA NA NA NA NA NA 

在早期版本的我计算每个组合的矩阵,但这种方式最终计算两次(或更多)的一些比较,是相当缓慢的。通过这种方式,我也计算了两次比较结果funs("2", "9", data = mat) == funs("9", "2", data = mat),但不是更多。我想提高性能的东西:

  • “告诉”外面的结果是对称的:怎么样?
  • 将列表转换为环境以加快查找速度(Error: attempt to replicate an object of type 'environment'
  • 并行外部?
  • ??

列表:

l <- structure(list(`2` = c("109582", "114608", "140837", "140877", 
"1474228", "1474244", "162582", "194315", "194840", "76002", 
"76005"), `9` = c("1430728", "156580", "156582", "211859"), `10` = c("1430728", 
"156580", "156582", "211859"), `15` = c("1430728", "209776", 
"209931", "71291"), `16` = c("379716", "379724", "74160"), `18` = c("112310", 
"112315", "112316", "888590", "916853")), .Names = c("2", "9", 
"10", "15", "16", "18")) 

矩阵:

mat <- structure(c(1, 0.305084745762712, 0.0728051391862955, 0.151950718685832, 
0.035778175313059, 0.128755364806867, 0.157080523601745, 0.127659574468085, 
0.0452173913043478, 0.591549295774648, 0.32089552238806, 0.305084745762712, 
1, 0.102040816326531, 0.186440677966102, 0.0421052631578947, 
0.127272727272727, 0.0306691449814126, 0.0232558139534884, 0.00970873786407767, 
0.6, 0.970059880239521, 0.0728051391862955, 0.102040816326531, 
1, 0.62962962962963, 0.0317460317460317, 0.0225563909774436, 
0.00383141762452107, 0.00546448087431694, 0.0140845070422535, 
0.0970873786407767, 0.0970873786407767, 0.151950718685832, 0.186440677966102, 
0.62962962962963, 1, 0.0273972602739726, 0.041958041958042, 0.00759013282732448, 
0.00518134715025907, 0., 0.150442477876106, 0.178861788617886, 
0.035778175313059, 0.0421052631578947, 0.0317460317460317, 0.0273972602739726, 
1, 0.608938547486033, 0.0284403669724771, 0.0131004366812227, 
0.00854700854700855, 0.0402684563758389, 0.041025641025641, 0.128755364806867, 
0.127272727272727, 0.0225563909774436, 0.041958041958042, 0.608938547486033, 
1, 0.0491379310344828, 0.0133779264214047, 0.0053475935828877, 
0.10958904109589, 0.13134328358209, 0.157080523601745, 0.0306691449814126, 
0.00383141762452107, 0.00759013282732448, 0.0284403669724771, 
0.0491379310344828, 1, 0.288429752066116, 0.11384335154827, 0.111504424778761, 
0.0333796940194715, 0.127659574468085, 0.0232558139534884, 0.00546448087431694, 
0.00518134715025907, 0.0131004366812227, 0.0133779264214047, 
0.288429752066116, 1, 0.527426160337553, 0.0780669144981413, 
0.0229885057471264, 0.0452173913043478, 0.00970873786407767, 
0.0140845070422535, 0., 0.00854700854700855, 
0.0053475935828877, 0.11384335154827, 0.527426160337553, 1, 0.0636942675159236, 
0.00947867298578199, 0.591549295774648, 0.6, 0.0970873786407767, 
0.150442477876106, 0.0402684563758389, 0.10958904109589, 0.111504424778761, 
0.0780669144981413, 0.0636942675159236, 1, 0.625454545454545, 
0.32089552238806, 0.970059880239521, 0.0970873786407767, 0.178861788617886, 
0.041025641025641, 0.13134328358209, 0.0333796940194715, 0.0229885057471264, 
0.00947867298578199, 0.625454545454545, 1), .Dim = c(11L, 11L 
), .Dimnames = list(c("109582", "114608", "140837", "140877", 
"1474228", "1474244", "162582", "194315", "194840", "76002", 
"76005"), c("109582", "114608", "140837", "140877", "1474228", 
"1474244", "162582", "194315", "194840", "76002", "76005"))) 

回答

-1

也许我误解你的问题,但我认为你有x %in% colnames(mat)相反,你会想colnames(mat) %in% x

x <- l[[1]][c(1,3,5)] # x is the 1st, 3rd, 5th entry 
x %in% colnames(mat) 
# [1] TRUE TRUE TRUE # returns a vector length 3 
# index mat by x %in% colnames(mat) returns the full matrix as c(TRUE,TRUE,TRUE) is simply repeated upto dim of mat 
mat[x %in% colnames(mat), x %in% colnames(mat)] 

colnames(mat) %in% x 
# [1] TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE 
# returns TRUE only for 1st, 3rd, 5th element which is what we want 
mat[colnames(mat) %in% x, colnames(mat) %in% x] # 3 x 3 matrix 

现在使用colnames(mat) %in% x您不需要funs函数中的if语句,mat[colnames(mat) %in% x, colnames(mat) %in% x]将确保只返回真值。

# compare 
x <- l[[1]] # all cases 
mat[colnames(mat) %in% x, colnames(mat) %in% x] 
x <- l[[1]][c(1,3,5)] # any cases 
mat[colnames(mat) %in% x, colnames(mat) %in% x] 
x <- c(l[[1]][c(1,3,5)], "2", "22", "222") # any cases 
mat[colnames(mat) %in% x, colnames(mat) %in% x] 
x <- c("2", "22", "222") # none 
mat[colnames(mat) %in% x, colnames(mat) %in% x] # empty matrix 

现在你可以仅仅是在原地用sapply或您funs功能找到矩阵子

sapply(l, function(x) mean(mat[colnames(mat) %in% x, colnames(mat) %in% x])) 
sapply(l, function(x) mean(mat[colnames(mat) %in% x, colnames(mat) %in% x], na.rm=TRUE)) # also consider na.rm parameter if needed 

输出产生NaN的空矩阵的均值的平均值,但你完全可以替代所有NaN与NA之后。

编辑延伸到所请求的两两比较

# matrix form 
sapply(l, function(x) sapply(l, function (y) mean(mat[colnames(mat) %in% x, colnames(mat) %in% y]))) 

# list form 
lapply(l, function(x) sapply(l, function (y) mean(mat[colnames(mat) %in% x, colnames(mat) %in% y]))) 
+0

虽然这肯定是我的问题的错误,这个问题本身是关于做交运集团 – Llopis

+0

所有元素的两两比较看看编辑的答案在那里我将延伸到成对比较。你也曾要求改进,我提出你最初编写'%colnames(mat)'的索引是不正确的,应该是'%x'中的'colnames(mat)%,并且if语句不是必需的,并且您只需使用funs < - 函数(x,y,数据)平均值(%x中的数据[colnames(数据)%,%y])'中的NaN而不是NA得到相似的结果。 – Djork

+0

外部已经做了我想要的,如何使用两个嵌套sapply调用更快更好? – Llopis