2017-03-05 34 views
3

如果我有以下数据表:Correlationmatrix从数据表

set.seed(1) 
TDT <- data.table(Group = c(rep("A",40),rep("B",60)), 
         Id = c(rep(1,20),rep(2,20),rep(3,20),rep(4,20),rep(5,20)), 
         Time = rep(seq(as.Date("2010-01-03"), length=20, by="1 month") - 1,5), 
         norm = round(runif(100)/10,2), 
         x1 = sample(100,100), 
         x2 = round(rnorm(100,0.75,0.3),2), 
         x3 = round(rnorm(100,0.75,0.3),2), 
         x4 = round(rnorm(100,0.75,0.3),2), 
         x5 = round(rnorm(100,0.75,0.3),2)) 

我怎样才能通过时间计算X1,X2,X3,X4和X5之间的关系?

此:

TDT[,x:= list(cor(TDT[,5:9])), by = Time] 

不起作用。

如何在datatable中完成?

+0

你的数据不具备标识和时间的每个组合的多次观察,因为有必要计算的相关性。试试'TDT [Id == 1&Time ==“2010-01-02”]',或Id和Time的任何其他组合。每个只有一行。 –

+0

@玫瑰哈特曼对不起,我的意思只是时间 – user3032689

回答

1

你这么亲近你的尝试!你错过的是一个额外的list()

这工作:

TDT[,x:= list(list(cor(TDT[,5:9]))), by = Time] 

而且TDT$x回报:

[[1]] 
      x1   x2   x3   x4   x5 
x1 1.00000000 0.72185099 0.07368766 -0.7031890 -0.36895449 
x2 0.72185099 1.00000000 0.68058833 -0.7393130 0.05066973 
x3 0.07368766 0.68058833 1.00000000 -0.5021462 0.10645894 
x4 -0.70318896 -0.73931299 -0.50214616 1.0000000 0.11671020 
x5 -0.36895449 0.05066973 0.10645894 0.1167102 1.00000000 

[[2]] 
      x1   x2   x3   x4   x5 
x1 1.0000000 -0.1011948 -0.85191422 -0.15571603 0.4855237 
x2 -0.1011948 1.0000000 0.56691559 -0.44002621 -0.6699172 
x3 -0.8519142 0.5669156 1.00000000 0.02189754 -0.6168013 
x4 -0.1557160 -0.4400262 0.02189754 1.00000000 0.2236542 
x5 0.4855237 -0.6699172 -0.61680132 0.22365419 1.0000000 

[...] 

额外list()是因为如何data.table解析DT[1,2]语法的第二个要素需要。这已在其他地方的stackoverflow中进行了深入讨论,我邀请您阅读most excellent answer

作为一个方面说明,似乎最好用.()替换最外面的呼叫list()以阐明意图。我还想明确列出参考.SD.SDcols的列。在相同的结果,你可以重写你的代码为:

TDT[, x := .(list(cor(.SD))), by = Time, .SDcols = 5:9] 
1

您可能会发现corrr程序包对此很有用。结合dplyr命令,您可以轻松获得不同组的相关矩阵。

library(data.table) # not necessary unless you want the data in this format for other reasons 
library(dplyr) 
library(corrr) 

每个ID获取相关矩阵:

> TDT %>% 
+ group_by(Id) %>% 
+ do({ 
+  correlate(select(., x1:x5)) 
+  }) 
Source: local data frame [25 x 7] 
Groups: Id [5] 

     Id rowname   x1   x2   x3   x4   x5 
    <dbl> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
1  1  x1   NA -0.246252411 -0.24589380 -0.181120555 0.14781414 
2  1  x2 -0.24625241   NA 0.32098291 -0.175603686 -0.08863810 
3  1  x3 -0.24589380 0.320982911   NA 0.161336670 0.07934436 
4  1  x4 -0.18112056 -0.175603686 0.16133667   NA -0.19662700 
5  1  x5 0.14781414 -0.088638098 0.07934436 -0.196627000   NA 
6  2  x1   NA 0.075760735 0.41276725 0.425032505 0.37178993 
7  2  x2 0.07576074   NA 0.07747543 -0.004202306 -0.08086958 
8  2  x3 0.41276725 0.077475426   NA 0.248151847 0.07619264 
9  2  x4 0.42503251 -0.004202306 0.24815185   NA 0.37647798 
10  2  x5 0.37178993 -0.080869584 0.07619264 0.376477979   NA 
# ... with 15 more rows 

获取相关矩阵的每个时间:

> TDT %>% 
+ group_by(Time) %>% 
+ do({ 
+  correlate(select(., x1:x5)) 
+ }) 
Source: local data frame [100 x 7] 
Groups: Time [20] 

     Time rowname   x1   x2   x3   x4   x5 
     <date> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
1 2010-01-02  x1   NA -0.66584960 -0.58788152 0.92540707 0.37316217 
2 2010-01-02  x2 -0.66584960   NA -0.06102424 -0.69292534 0.19440850 
3 2010-01-02  x3 -0.58788152 -0.06102424   NA -0.54623949 -0.78714932 
4 2010-01-02  x4 0.92540707 -0.69292534 -0.54623949   NA 0.53697784 
5 2010-01-02  x5 0.37316217 0.19440850 -0.78714932 0.53697784   NA 
6 2010-02-02  x1   NA -0.10444724 -0.62424401 0.30109335 0.04834057 
7 2010-02-02  x2 -0.10444724   NA -0.12010431 0.08966978 -0.68762698 
8 2010-02-02  x3 -0.62424401 -0.12010431   NA -0.92782037 0.52099983 
9 2010-02-02  x4 0.30109335 0.08966978 -0.92782037   NA -0.58214861 
10 2010-02-02  x5 0.04834057 -0.68762698 0.52099983 -0.58214861   NA 
# ... with 90 more rows 
+0

非常好,ty。你也知道如何在'data table'中做到这一点? – user3032689

+0

不,我不会,对不起:) –

1

split通过Time,然后为每个子组

运行 cor
lapply(split(TDT, TDT$Time), function(a) cor(a[,5:9])) 

#OR 

lapply(split(TDT[,5:9], TDT$Time), cor) 
+0

谢谢,它也有效,但它并不使用'datatable'syntax。 – user3032689

+0

@ user3032689,'TDT [,5:9] [,cor(.SD),by = TDT $ Time]'? –

+1

哦,这很有效,但对我来说,你可以用'时间'来分割,这在'TDT [,5:9]'中不再包含',这似乎很奇怪。 – user3032689