2016-12-29 105 views
3

我正在探索使用data.table(也提供了一个dplyr示例)来包装聚合函数(但实际上它可以是任何类型的函数)的不同方法,以及在函数式编程/元编程方面的最佳实践不知道到r data.table语言中的函数式编程/元编程/计算

  • 性能(不落实此事相对于该data.table可申请潜在的优化)
  • 可读性(有没有共同商定的标准例如,在大多数包使用data.table)
  • ge的易用性neralization(在那里的方式元编程差异是“普及”)

基本应用是聚集的表灵活的,即参数化聚合中的变量,所述尺寸通过聚集,两者的各自得到的变量名和聚合功能。我已经实现了(几乎)同样的功能在三个data.table和一个dplyr方式:

  1. fn_dt_agg1(在这里我无法弄清楚如何参数的聚合函数)
  2. fn_dt_agg2(由@jangorecki启发“的回答here他称之为“上的语言计算”)
  3. fn_dt_agg3(由@Arun的回答here这似乎是元编程的另一种方法)
  4. fn_df_agg1(在dplyr我一样的卑微的做法启发)

library(data.table) 
library(dplyr) 

数据

n_size <- 1*10^6 
sample_metrics <- sample(seq(from = 1, to = 100, by = 1), n_size, rep = T) 
sample_dimensions <- sample(letters[10:12], n_size, rep = T) 
df <- 
    data.frame(
    a = sample_metrics, 
    b = sample_metrics, 
    c = sample_dimensions, 
    d = sample_dimensions, 
    x = sample_metrics, 
    y = sample_dimensions, 
    stringsAsFactors = F) 

dt <- as.data.table(df) 

实现

1. fn_dt_agg1

fn_dt_agg1 <- 
    function(dt, metric, metric_name, dimension, dimension_name) { 

    temp <- dt[, setNames(lapply(.SD, function(x) {sum(x, na.rm = T)}), 
         metric_name), 
      keyby = dimension, .SDcols = metric] 
    temp[] 
} 

res_dt1 <- 
    fn_dt_agg1(
    dt = dt, metric = c("a", "b"), metric_name = c("a", "b"), 
    dimension = c("c", "d"), dimension_name = c("c", "d")) 

2. fn_dt_agg2

fn_dt_agg2 <- 
    function(dt, metric, metric_name, dimension, dimension_name, 
      agg_type) { 

    j_call = as.call(c(
    as.name("."), 
    sapply(setNames(metric, metric_name), 
      function(var) as.call(list(as.name(agg_type), 
             as.name(var), na.rm = T)), 
      simplify = F) 
    )) 

    dt[, eval(j_call), keyby = dimension][] 
} 

res_dt2 <- 
    fn_dt_agg2(
    dt = dt, metric = c("a", "b"), metric_name = c("a", "b"), 
    dimension = c("c", "d"), dimension_name = c("c", "d"), 
    agg_type = c("sum")) 

all.equal(res_dt1, res_dt2) 
#TRUE 

3. fn_dt_agg3

fn_dt_agg3 <- 
    function(dt, metric, metric_name, dimension, dimension_name, agg_type) { 

    e <- eval(parse(text=paste0("function(x) {", 
           agg_type, "(", "x, na.rm = T)}"))) 

    temp <- dt[, setNames(lapply(.SD, e), 
         metric_name), 
      keyby = dimension, .SDcols = metric] 
    temp[] 
} 

res_dt3 <- 
    fn_dt_agg3(
    dt = dt, metric = c("a", "b"), metric_name = c("a", "b"), 
    dimension = c("c", "d"), dimension_name = c("c", "d"), 
    agg_type = "sum") 

all.equal(res_dt1, res_dt3) 
#TRUE 

4. fn_df_agg1

fn_df_agg1 <- 
    function(df, metric, metric_name, dimension, dimension_name, agg_type) { 

    all_vars <- c(dimension, metric) 
    all_vars_new <- c(dimension_name, metric_name) 
    dots_group <- lapply(dimension, as.name) 

    e <- eval(parse(text=paste0("function(x) {", 
           agg_type, "(", "x, na.rm = T)}"))) 

    df %>% 
     select_(.dots = all_vars) %>% 
     group_by_(.dots = dots_group) %>% 
     summarise_each_(funs(e), metric) %>% 
     rename_(.dots = setNames(all_vars, all_vars_new)) 
} 

res_df1 <- 
    fn_df_agg1(
    df = df, metric = c("a", "b"), metric_name = c("a", "b"), 
    dimension = c("c", "d"), dimension_name = c("c", "d"), 
    agg_type = "sum") 

all.equal(res_dt1, as.data.table(res_df1)) 
#"Datasets has different keys. 'target': c, d. 'current' has no key." 

标杆

只是出于好奇和对我的未来的自己和其他有关方面,我跑了所有4所实现的基准,这对性能问题可能已经揭示光(虽然我不是一个标杆专家,所以请原谅如果我没有应用普遍认可的最佳实践)。我期望fn_dt_agg1是最快的,因为它有一个参数少(聚合函数),但似乎没有相当大的影响。我也对dplyr函数相对较慢感到惊讶,但这可能是由于我的设计选择不当造成的。

library(microbenchmark) 
bench_res <- 
    microbenchmark(
    fn_dt_agg1 = 
     fn_dt_agg1(
     dt = dt, metric = c("a", "b"), 
     metric_name = c("a", "b"), 
     dimension = c("c", "d"), 
     dimension_name = c("c", "d")), 
    fn_dt_agg2 = 
     fn_dt_agg2(
     dt = dt, metric = c("a", "b"), 
     metric_name = c("a", "b"), 
     dimension = c("c", "d"), 
     dimension_name = c("c", "d"), 
     agg_type = c("sum")), 
    fn_dt_agg3 = 
     fn_dt_agg3(
     dt = dt, metric = c("a", "b"), 
     metric_name = c("a", "b"), 
     dimension = c("c", "d"), 
     dimension_name = c("c", "d"), 
     agg_type = c("sum")), 
    fn_df_agg1 = 
     fn_df_agg1(
     df = df, metric = c("a", "b"), metric_name = c("a", "b"), 
     dimension = c("c", "d"), dimension_name = c("c", "d"), 
     agg_type = "sum"), 
    times = 100L) 

bench_res 

# Unit: milliseconds 
#  expr  min  lq  mean median  uq  max neval 
# fn_dt_agg1 28.96324 30.49507 35.60988 32.62860 37.43578 140.32975 100 
# fn_dt_agg2 27.51993 28.41329 31.80023 28.93523 33.17064 84.56375 100 
# fn_dt_agg3 25.46765 26.04711 30.11860 26.64817 30.28980 153.09715 100 
# fn_df_agg1 88.33516 90.23776 97.84826 94.28843 97.97154 172.87838 100 

其他资源

+0

回复:agg2“他称之为'用语言计算'” - 不是我,而是你在底部链接的官方R郎定义。 – jangorecki

回答

4

我不建议eval(parse())。如果没有它,您可以实现与方法三中的相同:

fn_dt_agg4 <- 
    function(dt, metric, metric_name, dimension, dimension_name, agg_type) { 

    e <- function(x) getFunction(agg_type)(x, na.rm = T) 

    temp <- dt[, setNames(lapply(.SD, e), 
          metric_name), 
       keyby = dimension, .SDcols = metric] 
    temp[] 
    } 

这也避免了一些安全风险。

PS:您可以通过设置options("datatable.verbose" = TRUE)来查看有关优化的数据表。

+0

'getFunction'和'match.fun'之间是否有重要区别? – Axeman

+0

不错。我不知道'getFunction'。到目前为止还没有见过其他任何地方。但为什么'eval(parse))'不被推荐?我在@Matt Dowle的其他答案中看到了它[这里](http://stackoverflow.com/questions/10675182/in-r-data-table-how-do-i-pass-variable-parameters-to-an - 表达式)和@Arun [here](http://stackoverflow.com/questions/26883859/using-eval-in-data-table?rq=1) – Triamus

+1

@Axeman我不知道。后者允许输入字符以外的内容。 – Roland