2017-03-02 29 views
1

我在AWS Redshift数据库上使用dplyr的数据库后端。而且由于有些查询需要永久返回,所以我想缓存它们。我知道底层数据不会改变,所以如果查询没有改变,那么结果集也不会改变。从dplyr数据库后端缓存结果

我已经采取了在其他地方实现这一目的的方法是

  • 哈希查询字符串
  • 查询的结果保存到一个文件{hash}.rds
  • 上的脚本的下一次运行,如果散列没有改变,从磁盘读取结果,否则重新运行查询

我一直在尝试与dplyr相同的方法。不幸的dplyr生成SQL查询字符串改变,即使操作保持不变:

df %>% 
    select(week, person_id) %>% 
    group_by(person_id) %>% 
    mutate(weeks_active = n()) %>% 
    arrange(weeks_active) %>% 
    dplyr::sql_render() 

第二生成

<SQL> SELECT * 
FROM (SELECT "week", "person_id", COUNT(*) OVER (PARTITION BY "person_id") AS "weeks_active" 
FROM (SELECT "week" AS "week", "person_id" AS "person_id" 
FROM "fct_person_week") "zznunjjdwe") "ltyyfmiahu" 
ORDER BY "weeks_active" 
在第一次运行

<SQL> SELECT * 
FROM (SELECT "week", "person_id", COUNT(*) OVER (PARTITION BY "person_id") AS "weeks_active" 
FROM (SELECT "week" AS "week", "person_id" AS "person_id" 
FROM "fct_person_week") "stxupavckd") "oaknuxjexc" 
ORDER BY "weeks_active" 

。有没有办法保持表别名的固定?查询的其他汇总是否会在多次运行中保持一致?或者我应该研究其他缓存方法?

+0

你可以为散列键设置某种种子吗? – sconfluentus

回答

0

您可能可以使用compute()来创建一个临时表。另一个选择是获取生成的SQL并将其转换为View,因此R开发人员只需将其称为表名即可。