2013-04-21 84 views
2

我有以下代码:如何避免在优化警告data.table

> dt <- data.table(a=c(rep(3,5),rep(4,5)),b=1:10,c=11:20,d=21:30,key="a") 
> dt 
    a b c d 
1: 3 1 11 21 
2: 3 2 12 22 
3: 3 3 13 23 
4: 3 4 14 24 
5: 3 5 15 25 
6: 4 6 16 26 
7: 4 7 17 27 
8: 4 8 18 28 
9: 4 9 19 29 
10: 4 10 20 30 
> dt[,lapply(.SD,sum),by="a"] 
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0 
Optimized j from 'lapply(.SD, sum)' to 'list(sum(b), sum(c), sum(d))' 
Starting dogroups ... done dogroups in 0 secs 
    a b c d 
1: 3 15 65 115 
2: 4 40 90 140 
> dt[,c(count=.N,lapply(.SD,sum)),by="a"] 
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0 
Optimization is on but j left unchanged as 'c(count = .N, lapply(.SD, sum))' 
Starting dogroups ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future. 
done dogroups in 0 secs 
    a count b c d 
1: 3  5 15 65 115 
2: 4  5 40 90 140 

如何避免可怕的“效率极低”的警告?

我可以添加count列前加入:

> dt$count <- 1 
> dt 
    a b c d count 
1: 3 1 11 21  1 
2: 3 2 12 22  1 
3: 3 3 13 23  1 
4: 3 4 14 24  1 
5: 3 5 15 25  1 
6: 4 6 16 26  1 
7: 4 7 17 27  1 
8: 4 8 18 28  1 
9: 4 9 19 29  1 
10: 4 10 20 30  1 
> dt[,lapply(.SD,sum),by="a"] 
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0 
Optimized j from 'lapply(.SD, sum)' to 'list(sum(b), sum(c), sum(d), sum(count))' 
Starting dogroups ... done dogroups in 0 secs 
    a b c d count 
1: 3 15 65 115  5 
2: 4 40 90 140  5 

但这并不显得过于优雅...

+1

你要 “抑制” 的警告或高效地做事情? – Arun 2013-04-21 15:27:15

+1

我从来没有说过“压制”。我说“避免”,这意味着我想做正确的事情,并使我的代码正确,高效地工作,以便不需要警告。 – sds 2013-04-21 16:23:56

+0

很明显,我不太确定您是要“避免”“看到”警告还是“避免”“有”该警告。 – Arun 2013-04-21 16:37:52

回答

2

一个我能想到的方法是参考以分配count

dt.out <- dt[, lapply(.SD,sum), by = a] 
dt.out[, count := dt[, .N, by=a][, N]] 
# alternatively: count := table(dt$a) 

# a b c d count 
# 1: 3 15 65 115  5 
# 2: 4 40 90 140  5 

编辑1:我仍然认为这只是消息而不是警告。但是,如果你仍然想避免这种情况,只是做:

dt.out[, count := as.numeric(dt[, .N, by=a][, N])] 

编辑2:非常有趣。做相当于多个:=分配产生相同的消息。

dt.out[, `:=`(count = dt[, .N, by=a][, N])] 
# Detected that j uses these columns: a 
# Finding groups (bysameorder=TRUE) ... done in 0.001secs. bysameorder=TRUE and o__ is length 0 
# Detected that j uses these columns: <none> 
# Optimization is on but j left unchanged as '.N' 
# Starting dogroups ... done dogroups in 0 secs 
# Detected that j uses these columns: N 
# Assigning to all 2 rows 
# Direct plonk of unnamed RHS, no copy. 
+0

这会产生一个警告“项目1的RHS已被复制,要么是NAMED矢量,要么是再循环列表RHS。” – sds 2013-04-21 16:41:12

+0

How do you say这是一个警告?它没有提到任何有关无效率的信息......这只是一个信息。无论如何,我已经做了一个编辑,不要得到这个消息。 – Arun 2013-04-21 17:14:22

+0

我想你可能会发现'dt [,.N,by = a] [['N']]更高效,因为在简单地进行子集化时,不需要调用'[.data.table'的开销。 – mnel 2013-04-21 23:48:28

2

此解决方案删除有关指定元素的消息。但是你必须在之后放回这些名字。

require(data.table) 
options(datatable.verbose = TRUE) 

dt <- data.table(a=c(rep(3,5),rep(4,5)),b=1:10,c=11:20,d=21:30,key="a") 

dt[, c(.N, unname(lapply(.SD, sum))), by = "a"] 

输出

> dt[, c(.N, unname(lapply(.SD, sum))), by = "a"] 
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0 
Optimization is on but j left unchanged as 'c(.N, unname(lapply(.SD, sum)))' 
Starting dogroups ... done dogroups in 0.001 secs 
    a V1 V2 V3 V4 
1: 3 5 15 65 115 
2: 4 5 40 90 140 
+0

好(更好)的选择。在最后使用'.N'后,使用setnames(dt.out,c(names(dt),“N”))(稍微简单一些)就可以更容易地设置名称。 – Arun 2013-04-21 17:45:19

+0

*显着*较慢:'开始Dogroups ...完成dogroups在0.277秒vs'开始Dogroups ...在2.929秒完成dogroup' – sds 2013-04-21 17:53:25

+0

@sds,你不清楚你比较哪两个解决方案。 – djhurio 2013-04-21 18:01:31