2015-12-21 37 views
2

查找每个ID的最常见类别的高效优雅的data.table语法是什么?我保持指示NA位置(用于其他目的)的布尔矢量组的模式值(最常见)的简明R data.table语法

dt = data.table(id=rep(1:2,7), category=c("x","y",NA)) 
print(dt) 

在该玩具实例中,忽略NA,xid==1yid==2常见类别。

回答

5

如果你想忽略NA的,你必须与!is.na(category),组由idcategoryby = .(id, category))首先排除,然后创建.N频率变量:这给

dt[!is.na(category), .N, by = .(id, category)] 

id category N 
1: 1  x 3 
2: 2  y 3 
3: 2  x 2 
4: 1  y 2 

订购此款id会给你一个更清晰的画面:

dt[!is.na(category), .N, by = .(id, category)][order(id)] 

导致:

id category N 
1: 1  x 3 
2: 1  y 2 
3: 2  y 3 
4: 2  x 2 

如果你只是想这表明顶部结果行:

dt[!is.na(category), .N, by = .(id, category)][order(id, -N), head(.SD,1), by = id] 

或:

dt[!is.na(category), .N, by = .(id, category)][, .SD[which.max(N)], by = id] 

这两个给:

id category N 
1: 1  x 3 
2: 2  y 3 
+0

这样做有可能导致放弃只有NAs的组,也就是说,可能会将它们加入回来,或者在这种情况下加入它们'dt [!is.na(category)] [,.N,by =。(id,或者只给非NA的排序优先选择:'[order(-N)] [。(unique(dt $ id)),on =。(id),.SD [1L],by = id] dt [,.N,by =。(id,category)] [order(is.na(category),-N),.SD [1L],by = id]' – Frank