为什么group_by和group_by_在通过两个变量进行汇总时给出不同的答案？

在以下示例中，我想通过两个变量创建汇总统计信息。当我用dplyr::group_by做到这一点时，我得到了正确的答案，当我用dplyr::group_by_这样做时，它总结了比我想要的更多的一个级别。为什么group_by和group_by_在通过两个变量进行汇总时给出不同的答案？

library(dplyr) 
set.seed(919) 
df <- data.frame(
    a = c(1, 1, 1, 2, 2, 2), 
    b = c(3, 3, 4, 4, 5, 5), 
    x = runif(6) 
) 

# Gives correct answer 
df %>% 
    group_by(a, b) %>% 
    summarize(total = sum(x)) 

# Source: local data frame [4 x 3] 
# Groups: a [?] 
# 
#  a  b  total 
# <dbl> <dbl>  <dbl> 
# 1  1  3 1.5214746 
# 2  1  4 0.7150204 
# 3  2  4 0.1234555 
# 4  2  5 0.8208454 

# Wrong answer -- too many levels summarized 
df %>% 
    group_by_(c("a", "b")) %>% 
    summarize(total = sum(x)) 
# # A tibble: 2 × 2 
#  a  total 
# <dbl>  <dbl> 
# 1  1 2.2364950 
# 2  2 0.9443009

发生了什么事？

来源

2016-11-08 Jake Fisher

可能需要帮助： http://stackoverflow.com/questions/28667059/dplyr-whats-the-difference-between-group-by-and-group-by-functions – wbrugato

谢谢@wbrugato。我确实看到了。它解释了函数的输入是如何不同的（引用与未引用的字符串），但它不能解释为什么函数会给出来自相同输入的不同输出（但请让我知道我是否错过了某些东西！）。 –

您需要'group_by _（。dots = c（“a”，“b”））''或'group_by _（“a”，“b”）''。 – Psidom

如果你想使用的变量名的载体，你可以把它传递给.dots参数为：

df %>% 
     group_by_(.dots = c("a", "b")) %>% 
     summarize(total = sum(x)) 

#Source: local data frame [4 x 3] 
#Groups: a [?] 

#  a  b  total 
# <dbl> <dbl>  <dbl> 
#1  1  3 1.5214746 
#2  1  4 0.7150204 
#3  2  4 0.1234555 
#4  2  5 0.8208454

或者您可以使用它以同样的方式，你会在NSE的方式做：

df %>% 
    group_by_("a", "b") %>% 
    summarize(total = sum(x)) 

#Source: local data frame [4 x 3] 
#Groups: a [?] 

#  a  b  total 
# <dbl> <dbl>  <dbl> 
#1  1  3 1.5214746 
#2  1  4 0.7150204 
#3  2  4 0.1234555 
#4  2  5 0.8208454

来源

2016-11-08 20:07:45 Psidom

为什么group_by和group_by_在通过两个变量进行汇总时给出不同的答案？

回答

相关问题