2016-08-22 41 views
0

我有一个用户(作为guid)的大表,一些相关的值以及插入每行时的时间戳。用户可能与此表中的许多行相关联。在增长时间窗口中计数新的唯一值

guid | <other columns> | insertdate 

我想统计每个月:有多少独特的新用户被插入。这很容易做手工:

select count(distinct guid) 
from table 
where insertdate >= '20060201' and insertdate < '20060301' 
and guid not in (select guid from table where 
         insertdate >= '20060101' and insertdate < '20060201') 

这怎么可能在SQL每个连续一个月做什么?

我想使用RANK函数关联明确每个GUID用了一个月:

select guid, 
,dense_rank() over (order by datepart(YYYY, insertdate), 
    datepart(m, t.TransactionDateTime)) as MonthRank 
from table 

,然后遍历在每个等级值:

declare @no_times int 
declare @counter int = 1 
set @no_times = select count(distinct concat(datepart(year, t.TransactionDateTime), 
    datepart(month, t.TransactionDateTime))) from table 
while @no_times > 0 do 
(
select count(*), @counter 
where guid not in (select guid from table where rank = @counter) 
and rank = @int + 1 
@counter += 1 
@no_times -= 1 
union all 
) 
end 

我知道这个策略可能是错误的有关事情的方式。

理想情况下,我想一个结果集是这样的:

MonthRank | NoNewUsers 

我会非常感兴趣,请将一个SQL向导可以在正确的方向指向我。

+0

你可以只组了,不是吗?'datepart(mm,insertdate)'从表组中选择[count](独立的guid),datepart(mm,insertdate)作为[Month] – scsimon

回答

0
SELECT 
    DATEPART(year,t.insertdate) AS YearNum 
    ,DATEPART(mm,t.insertdate) as MonthNum 
    ,COUNT(DISTINCT guid) AS NoNewUsers 
    ,DENSE_RANK() OVER (ORDER BY COUNT(DISTINCT t.guid) DESC) AS MonthRank 
FROM 
    table t 
    LEFT JOIN table t2 
    ON t.guid = t2.guid 
    AND t.insertdate > t2.insertdate 
WHERE 
    t2.guid IS NULL 
GROUP BY 
    DATEPART(year,t.insertdate) 
    ,DATEPART(mm,t.insertdate) 

使用左连接,看看表曾经作为前插入日期存在,如果他们不使用聚合就像你通常会再算上他们。如果你想添加一个排名来查看哪个月的新用户数量最多,那么你可以使用你的DENSE_RANK()函数,但是因为你已经按照你希望的分组,所以你不需要分区子句。

0

如果您想要第一个时间输入一个​​,那么您的查询不完全工作。您可以在第一时间拿到两个聚合:

select year(first_insertdate), month(first_insertdate), count(*) 
from (select t.guid, min(insertdate) as first_insertdate 
     from t 
     group by t.guid 
    ) t 
group by year(first_insertdate), month(first_insertdate) 
order by year(first_insertdate), month(first_insertdate); 

如果您正在寻找计数​​s各自他们跳过了一个月的时间,那么你可以使用lag()

select year(insertdate), month(insertdate), count(*) 
from (select t.*, 
      lag(insertdate) over (partition by guid order by insertdate) as prev_insertdate 
     from t 
    ) t 
where prev_insertdate is null or 
     datediff(month, prev_insertdate, insertdate) >= 2 
group by year(insertdate), month(insertdate) 
order by year(insertdate), month(insertdate); 
0

我与解决它糟糕的while循环,然后一个朋友帮助我以另一种方式更有效地解决它。

环路版本:

--ranked by month 
select t.TransactionID 
,t.BuyerUserID 
,concat(datepart(year, t.InsertDate), datepart(month, 
t.InsertDate)) MonthRankName 
,dense_rank() over (order by datepart(YYYY, t.InsertDate), 
datepart(m, t.InsertDate)) as MonthRank 
into #ranked 
from table t; 

--iteratate 
declare @counter int = 1 
declare @no_times int 
select @no_times = count(distinct concat(datepart(year, t.InsertDate), 
    datepart(month, t.InsertDate))) from table t; 
select count(distinct r.guid) as NewUnique, r.Monthrank into #results 
    from #ranked r 
    where r.MonthRank = 1 group by r.MonthRank; 
while @no_times > 1 
begin 
insert into #results 
select count(distinct rt.guid) as NewUnique, @counter + 1 as MonthRank 
from #ranked r 
where rt.guid not in 
(
select rt2.guid from #ranked rt2 
where rt2.MonthRank = @counter 
) 
and rt.MonthRank = @counter + 1 
set @counter = @counter+1 
set @no_times = @no_times-1 
end 

select * from #results r 

事实证明,这非常缓慢运行(正如您所料)

什么结果通过10倍要快是这样的方法:

select t.guid, 
cast (concat(datepart(year, min(t.InsertDate)), 
case when datepart(month, min(t.InsertDate)) < 10 then 
'0'+cast(datepart(month, min(t.InsertDate)) as varchar(10)) 
else cast (datepart(month, min(t.InsertDate)) as varchar(10)) end 
) as int) as MonthRankName 

into #NewUnique 
from table t 
group by t.guid; 

select count(1) as NewUniques, t.MonthRankName from #NewUnique t 
group by t.MonthRankName 
order by t.MonthRankName 

只需确定每个guid出现的第一个月,然后计算每个月出现的这些数量。随着黑客攻击的一位拿到YearMonth很好的格式化(这个似乎比格式更高效([日],“YYYYMM”),但需要更多的实验上。