[R data.table：

组

亚组加权百分比我有一个像data.table：[R data.table：

library(data.table) 
widgets <- data.table(serial_no=1:100, 
         color=rep_len(c("red","green","blue","black"),length.out=100), 
         style=rep_len(c("round","pointy","flat"),length.out=100), 
         weight=rep_len(1:5,length.out=100))

虽然我不知道这是最data.table的方式，我可以使用table和length计算按组群频率只需一个步骤 - 例如，回答“红色小部件百分之几是圆的？”的问题。

编辑：这个代码不提供正确的答案

# example A 
widgets[, list(style = unique(style), 
       style_pct_of_color_by_count = 
       as.numeric(table(style)/length(style))), by=color] 

# color style style_pct_of_color_by_count 
# 1: red round      0.32 
# 2: red pointy      0.32 
# 3: red flat      0.36 
# 4: green pointy      0.32 
# ...

但我不能用这种方法来回答这样的问题“按重量计算，百分之多少的红色部件是圆的？”我只能想出一个两步走的方法：

# example B 
widgets[,list(cs_weight=sum(weight)),by=list(color,style)][,list(style, style_pct_of_color_by_weight=cs_weight/sum(cs_weight)),by=color] 

# color style style_pct_of_color_by_weight 
# 1: red round     0.3466667 
# 2: red pointy     0.3466667 
# 3: red flat     0.3066667 
# 4: green pointy     0.3333333 
# ...

我正在寻找一个单一步骤的方法，以B，和A如果改善的，在加深我的理解data.table语法为副解释集团运营。请注意，这个问题与Weighted sum of variables by groups with data.table不同，因为我涉及子组并避免多个步骤。 TYVM。

来源

2015-06-19 C8H10N4O2

看着从@Frank响应下面我发现我的尝试不仅是尴尬，但不正确 - 例如，我检查'小部件[，总和（style ==“round”＆color ==“red”）/ sum（color ==“red”）]＃0.36' – C8H10N4O2

这几乎是一个单一的步骤：

# A 
widgets[,{ 
    totwt = .N 
    .SD[,.(frac=.N/totwt),by=style] 
},by=color] 
    # color style frac 
# 1: red round 0.36 
# 2: red pointy 0.32 
# 3: red flat 0.32 
# 4: green pointy 0.36 
# 5: green flat 0.32 
# 6: green round 0.32 
# 7: blue flat 0.36 
# 8: blue round 0.32 
# 9: blue pointy 0.32 
# 10: black round 0.36 
# 11: black pointy 0.32 
# 12: black flat 0.32 

# B 
widgets[,{ 
    totwt = sum(weight) 
    .SD[,.(frac=sum(weight)/totwt),by=style] 
},by=color] 
# color style  frac 
# 1: red round 0.3466667 
# 2: red pointy 0.3466667 
# 3: red flat 0.3066667 
# 4: green pointy 0.3333333 
# 5: green flat 0.3200000 
# 6: green round 0.3466667 
# 7: blue flat 0.3866667 
# 8: blue round 0.2933333 
# 9: blue pointy 0.3200000 
# 10: black round 0.3733333 
# 11: black pointy 0.3333333 
# 12: black flat 0.2933333

它是如何工作：构建你的分母才去细组（color与style）制表顶级组（color）。

替代。如果style s各自color和此内重复仅用于显示目的，尝试了table：

# A 
widgets[, 
    prop.table(table(color,style),1) 
] 
#  style 
# color flat pointy round 
# black 0.32 0.32 0.36 
# blue 0.36 0.32 0.32 
# green 0.32 0.36 0.32 
# red 0.32 0.32 0.36 

# B 
widgets[,rep(1L,sum(weight)),by=.(color,style)][, 
    prop.table(table(color,style),1) 
] 

#  style 
# color  flat pointy  round 
# black 0.2933333 0.3333333 0.3733333 
# blue 0.3866667 0.3200000 0.2933333 
# green 0.3200000 0.3333333 0.3466667 
# red 0.3066667 0.3466667 0.3466667

对于B，所以不存在用于重中的每个单元中的一个观测该扩展的数据。对于大数据，这样的扩展将是一个坏主意（因为它会花费太多的内存）。另外，weight必须是一个整数;否则，其总和将被无声地截断为一个（例如，尝试rep(1,2.5) # [1] 1 1）。

来源

2015-06-19 17:56:04 Frank

这就是我所做的，但我也有兴趣找到更好的方法。 – Frank

谢谢@Frank - 这会花费我一段时间来挖掘点符号和嵌入式任务，但这是一个很好的方法。 – C8H10N4O2

你的第一个版本可以在没有temp变量的情况下重写，如下所示：'widgets [，。（frac = .SD [，.N，by = style] $ N/.N），by = color]' – Arun

它可能是一个好主意，用dplyr

df <- widgets %>% 
    group_by(color, style) %>% 
    summarise(count = n()) %>% 
    mutate(freq = count/sum(count)) 

df2 <- widgets %>% 
    group_by(color, style) %>% 
    summarise(count_w = sum(weight)) %>% 
    mutate(freq = count_w/sum(count_w))

来源

2015-06-19 18:53:27 drsh1

Thanks @ drsh1我明白'dplyr'在这里很直观有用。具体而言，我的问题是如何使用'data.table'语法。 – C8H10N4O2

[R data.table：

回答

相关问题