如何避免在优化警告data.table

我有以下代码：如何避免在优化警告data.table

> dt <- data.table(a=c(rep(3,5),rep(4,5)),b=1:10,c=11:20,d=21:30,key="a") 
> dt 
    a b c d 
1: 3 1 11 21 
2: 3 2 12 22 
3: 3 3 13 23 
4: 3 4 14 24 
5: 3 5 15 25 
6: 4 6 16 26 
7: 4 7 17 27 
8: 4 8 18 28 
9: 4 9 19 29 
10: 4 10 20 30 
> dt[,lapply(.SD,sum),by="a"] 
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0 
Optimized j from 'lapply(.SD, sum)' to 'list(sum(b), sum(c), sum(d))' 
Starting dogroups ... done dogroups in 0 secs 
    a b c d 
1: 3 15 65 115 
2: 4 40 90 140 
> dt[,c(count=.N,lapply(.SD,sum)),by="a"] 
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0 
Optimization is on but j left unchanged as 'c(count = .N, lapply(.SD, sum))' 
Starting dogroups ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future. 
done dogroups in 0 secs 
    a count b c d 
1: 3  5 15 65 115 
2: 4  5 40 90 140

如何避免可怕的“效率极低”的警告？

我可以添加count列前加入：

> dt$count <- 1 
> dt 
    a b c d count 
1: 3 1 11 21  1 
2: 3 2 12 22  1 
3: 3 3 13 23  1 
4: 3 4 14 24  1 
5: 3 5 15 25  1 
6: 4 6 16 26  1 
7: 4 7 17 27  1 
8: 4 8 18 28  1 
9: 4 9 19 29  1 
10: 4 10 20 30  1 
> dt[,lapply(.SD,sum),by="a"] 
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0 
Optimized j from 'lapply(.SD, sum)' to 'list(sum(b), sum(c), sum(d), sum(count))' 
Starting dogroups ... done dogroups in 0 secs 
    a b c d count 
1: 3 15 65 115  5 
2: 4 40 90 140  5

但这并不显得过于优雅...

来源

2013-04-21 sds

你要 “抑制” 的警告或高效地做事情？ – Arun 2013-04-21 15:27:15

我从来没有说过“压制”。我说“避免”，这意味着我想做正确的事情，并使我的代码正确，高效地工作，以便不需要警告。 – sds 2013-04-21 16:23:56

很明显，我不太确定您是要“避免”“看到”警告还是“避免”“有”该警告。 – Arun 2013-04-21 16:37:52

一个我能想到的方法是参考以分配count：

dt.out <- dt[, lapply(.SD,sum), by = a] 
dt.out[, count := dt[, .N, by=a][, N]] 
# alternatively: count := table(dt$a) 

# a b c d count 
# 1: 3 15 65 115  5 
# 2: 4 40 90 140  5

编辑1：我仍然认为这只是消息而不是警告。但是，如果你仍然想避免这种情况，只是做：

dt.out[, count := as.numeric(dt[, .N, by=a][, N])]

编辑2：非常有趣。做相当于多个:=分配不产生相同的消息。

dt.out[, `:=`(count = dt[, .N, by=a][, N])] 
# Detected that j uses these columns: a 
# Finding groups (bysameorder=TRUE) ... done in 0.001secs. bysameorder=TRUE and o__ is length 0 
# Detected that j uses these columns: <none> 
# Optimization is on but j left unchanged as '.N' 
# Starting dogroups ... done dogroups in 0 secs 
# Detected that j uses these columns: N 
# Assigning to all 2 rows 
# Direct plonk of unnamed RHS, no copy.

来源

2013-04-21 15:23:44 Arun

这会产生一个警告“项目1的RHS已被复制，要么是NAMED矢量，要么是再循环列表RHS。” – sds 2013-04-21 16:41:12

How do you say这是一个警告？它没有提到任何有关无效率的信息......这只是一个信息。无论如何，我已经做了一个编辑，不要得到这个消息。 – Arun 2013-04-21 17:14:22

我想你可能会发现'dt [，.N，by = a] [['N']]更高效，因为在简单地进行子集化时，不需要调用'[.data.table'的开销。 – mnel 2013-04-21 23:48:28

此解决方案删除有关指定元素的消息。但是你必须在之后放回这些名字。

require(data.table) 
options(datatable.verbose = TRUE) 

dt <- data.table(a=c(rep(3,5),rep(4,5)),b=1:10,c=11:20,d=21:30,key="a") 

dt[, c(.N, unname(lapply(.SD, sum))), by = "a"]

输出

> dt[, c(.N, unname(lapply(.SD, sum))), by = "a"] 
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0 
Optimization is on but j left unchanged as 'c(.N, unname(lapply(.SD, sum)))' 
Starting dogroups ... done dogroups in 0.001 secs 
    a V1 V2 V3 V4 
1: 3 5 15 65 115 
2: 4 5 40 90 140

来源

2013-04-21 17:33:01 djhurio

好（更好）的选择。在最后使用'.N'后，使用setnames（dt.out，c（names（dt），“N”））（稍微简单一些）就可以更容易地设置名称。 – Arun 2013-04-21 17:45:19

*显着*较慢：'开始Dogroups ...完成dogroups在0.277秒vs'开始Dogroups ...在2.929秒完成dogroup' – sds 2013-04-21 17:53:25

@sds，你不清楚你比较哪两个解决方案。 – djhurio 2013-04-21 18:01:31

如何避免在优化警告data.table

回答

相关问题