2013-11-26 65 views
5

我试图通过时间序列获取Redshift中不同对象的累计计数。直接的方法是使用COUNT(DISTINCT myfield)OVER(ORDER BY timefield DESC ROWS UNBOUNDED PRECEDING),但Redshift给出了“不支持窗口定义”的错误。尝试使用Redshift计算累积的不同实体SQL

例如,下面的代码试图找出从第一周到现在每周的累积不同用户。但是,我得到“窗口功能不支持”的错误。

SELECT user_time.weeks_ago, 
     COUNT(distinct user_time.user_id) OVER 
      (ORDER BY weeks_ago desc ROWS UNBOUNDED PRECEDING) as count 
FROM (SELECT FLOOR(EXTRACT(DAY FROM sysdate - ev.time)/7) AS weeks_ago, 
       ev.user_id as user_id 
     FROM events as ev 
     WHERE ev.action='some_user_action') as user_time 

目标是建立执行操作的唯一用户的累计时间序列。任何想法如何做到这一点?

回答

3

找出答案。诀窍结果是一组嵌套子查询,内部计算每个用户的第一个动作的时间。中间的子查询计算每个时间段的总行动,并最终外部查询执行在时间序列的累计总和:

(SELECT engaged_per_week.week as week, 
     SUM(engaged_per_week.total) over (order by engaged_per_week.week DESC ROWS UNBOUNDED PRECEDING) as total 
FROM 
    -- COUNT OF FIRST TIME ENGAGEMENTS PER WEEK 
    (SELECT engaged.first_week AS week, 
      count(engaged.first_week) AS total 
    FROM 
     -- WEEK OF FIRST ENGAGEMENT FOR EACH USER 
     (SELECT MAX(FLOOR(EXTRACT(DAY FROM sysdate - ev.time)/7)) as first_week 
     FROM  events ev 
     WHERE ev.name='some_user_action' 
     GROUP BY ev.user_id) AS engaged 

    GROUP BY week) as engaged_per_week 
ORDER BY week DESC) as cumulative_engaged 
1

以下是如何将其应用到引here一个例子,再加上我已经添加了另一行为'2015-01-01'复制'table'来演示这个计数如何区分。

该示例的作者对解决方案有误,但我只是使用他的示例。

create table public.test 
(
    "date" date, 
    item varchar(8), 
    measure int 
) 

insert into public.test 
    values 
     ('2015-01-01', 'table', 12), 
     ('2015-01-01', 'table', 120), 
     ('2015-01-01', 'chair', 51), 
     ('2015-01-01', 'lamp', 8), 
     ('2015-01-02', 'table', 17), 
     ('2015-01-02', 'chair', 72), 
     ('2015-01-02', 'lamp', 23), 
     ('2015-01-02', 'bed',  1), 
     ('2015-01-02', 'dresser', 2), 
     ('2015-01-03', 'bed',  1); 

WITH x AS (
    SELECT 
     *, 
     DENSE_RANK() 
     OVER (PARTITION BY date 
     ORDER BY item) AS dense_rank 
    FROM public.test 
) 
SELECT 
    "date", 
    item, 
    measure, 
    max(dense_rank) 
    OVER (PARTITION BY "date") 
FROM x 
ORDER BY 1; 

子查询让你每日期的每个项目的密集排名,然后主查询让你每日期是密集的等级,即每个项目的日期重复计数的最大值。

您需要密集的排名而不是直接排名来计数的区别。

+0

我看到了同样的链接例如不工作。但是这有所帮助。谢谢。 – systemjack

+0

当你不想用'select *'返回每一行时,你会做什么?我有一种情况,我想在一个月的时间间隔内统计不同的客户,但是当我通过分区中的'customer_id'命令返回集给出每个等级值时,即使我只想要该月的最大值。 – Merlin

2

您应该使用DENSE_RANK代替计数(不同)的:

DENSE_RANK() OVER(PARTITION BY weeks_ago ORDER BY user_time.user_id)