1
我在AWS Redshift数据库上使用dplyr
的数据库后端。而且由于有些查询需要永久返回,所以我想缓存它们。我知道底层数据不会改变,所以如果查询没有改变,那么结果集也不会改变。从dplyr数据库后端缓存结果
我已经采取了在其他地方实现这一目的的方法是
- 哈希查询字符串
- 查询的结果保存到一个文件
{hash}.rds
- 上的脚本的下一次运行,如果散列没有改变,从磁盘读取结果,否则重新运行查询
我一直在尝试与dplyr
相同的方法。不幸的dplyr生成SQL查询字符串改变,即使操作保持不变:
df %>%
select(week, person_id) %>%
group_by(person_id) %>%
mutate(weeks_active = n()) %>%
arrange(weeks_active) %>%
dplyr::sql_render()
第二生成
<SQL> SELECT *
FROM (SELECT "week", "person_id", COUNT(*) OVER (PARTITION BY "person_id") AS "weeks_active"
FROM (SELECT "week" AS "week", "person_id" AS "person_id"
FROM "fct_person_week") "zznunjjdwe") "ltyyfmiahu"
ORDER BY "weeks_active"
在第一次运行
和
<SQL> SELECT *
FROM (SELECT "week", "person_id", COUNT(*) OVER (PARTITION BY "person_id") AS "weeks_active"
FROM (SELECT "week" AS "week", "person_id" AS "person_id"
FROM "fct_person_week") "stxupavckd") "oaknuxjexc"
ORDER BY "weeks_active"
。有没有办法保持表别名的固定?查询的其他汇总是否会在多次运行中保持一致?或者我应该研究其他缓存方法?
你可以为散列键设置某种种子吗? – sconfluentus