在增长时间窗口中计数新的唯一值

我有一个用户（作为guid）的大表，一些相关的值以及插入每行时的时间戳。用户可能与此表中的许多行相关联。在增长时间窗口中计数新的唯一值

guid | <other columns> | insertdate

我想统计每个月：有多少独特的新用户被插入。这很容易做手工：

select count(distinct guid) 
from table 
where insertdate >= '20060201' and insertdate < '20060301' 
and guid not in (select guid from table where 
         insertdate >= '20060101' and insertdate < '20060201')

这怎么可能在SQL每个连续一个月做什么？

我想使用RANK函数关联明确每个GUID用了一个月：

select guid, 
,dense_rank() over (order by datepart(YYYY, insertdate), 
    datepart(m, t.TransactionDateTime)) as MonthRank 
from table

，然后遍历在每个等级值：

declare @no_times int 
declare @counter int = 1 
set @no_times = select count(distinct concat(datepart(year, t.TransactionDateTime), 
    datepart(month, t.TransactionDateTime))) from table 
while @no_times > 0 do 
(
select count(*), @counter 
where guid not in (select guid from table where rank = @counter) 
and rank = @int + 1 
@counter += 1 
@no_times -= 1 
union all 
) 
end

我知道这个策略可能是错误的有关事情的方式。

理想情况下，我想一个结果集是这样的：

MonthRank | NoNewUsers

我会非常感兴趣，请将一个SQL向导可以在正确的方向指向我。

来源

2016-08-22 titangroan

你可以只组了，不是吗？'datepart（mm，insertdate）'从表组中选择[count]（独立的guid），datepart（mm，insertdate）作为[Month] – scsimon

SELECT 
    DATEPART(year,t.insertdate) AS YearNum 
    ,DATEPART(mm,t.insertdate) as MonthNum 
    ,COUNT(DISTINCT guid) AS NoNewUsers 
    ,DENSE_RANK() OVER (ORDER BY COUNT(DISTINCT t.guid) DESC) AS MonthRank 
FROM 
    table t 
    LEFT JOIN table t2 
    ON t.guid = t2.guid 
    AND t.insertdate > t2.insertdate 
WHERE 
    t2.guid IS NULL 
GROUP BY 
    DATEPART(year,t.insertdate) 
    ,DATEPART(mm,t.insertdate)

使用左连接，看看表曾经作为前插入日期存在，如果他们不使用聚合就像你通常会再算上他们。如果你想添加一个排名来查看哪个月的新用户数量最多，那么你可以使用你的DENSE_RANK（）函数，但是因为你已经按照你希望的分组，所以你不需要分区子句。

来源

2016-08-22 20:21:01 Matt

如果您想要第一个时间输入一个，那么您的查询不完全工作。您可以在第一时间拿到两个聚合：

select year(first_insertdate), month(first_insertdate), count(*) 
from (select t.guid, min(insertdate) as first_insertdate 
     from t 
     group by t.guid 
    ) t 
group by year(first_insertdate), month(first_insertdate) 
order by year(first_insertdate), month(first_insertdate);

如果您正在寻找计数s各自他们跳过了一个月的时间，那么你可以使用lag()：

select year(insertdate), month(insertdate), count(*) 
from (select t.*, 
      lag(insertdate) over (partition by guid order by insertdate) as prev_insertdate 
     from t 
    ) t 
where prev_insertdate is null or 
     datediff(month, prev_insertdate, insertdate) >= 2 
group by year(insertdate), month(insertdate) 
order by year(insertdate), month(insertdate);

来源

2016-08-22 20:26:08

我与解决它糟糕的while循环，然后一个朋友帮助我以另一种方式更有效地解决它。

环路版本：

--ranked by month 
select t.TransactionID 
,t.BuyerUserID 
,concat(datepart(year, t.InsertDate), datepart(month, 
t.InsertDate)) MonthRankName 
,dense_rank() over (order by datepart(YYYY, t.InsertDate), 
datepart(m, t.InsertDate)) as MonthRank 
into #ranked 
from table t; 

--iteratate 
declare @counter int = 1 
declare @no_times int 
select @no_times = count(distinct concat(datepart(year, t.InsertDate), 
    datepart(month, t.InsertDate))) from table t; 
select count(distinct r.guid) as NewUnique, r.Monthrank into #results 
    from #ranked r 
    where r.MonthRank = 1 group by r.MonthRank; 
while @no_times > 1 
begin 
insert into #results 
select count(distinct rt.guid) as NewUnique, @counter + 1 as MonthRank 
from #ranked r 
where rt.guid not in 
(
select rt2.guid from #ranked rt2 
where rt2.MonthRank = @counter 
) 
and rt.MonthRank = @counter + 1 
set @counter = @counter+1 
set @no_times = @no_times-1 
end 

select * from #results r

事实证明，这非常缓慢运行（正如您所料）

什么结果通过10倍要快是这样的方法：

select t.guid, 
cast (concat(datepart(year, min(t.InsertDate)), 
case when datepart(month, min(t.InsertDate)) < 10 then 
'0'+cast(datepart(month, min(t.InsertDate)) as varchar(10)) 
else cast (datepart(month, min(t.InsertDate)) as varchar(10)) end 
) as int) as MonthRankName 

into #NewUnique 
from table t 
group by t.guid; 

select count(1) as NewUniques, t.MonthRankName from #NewUnique t 
group by t.MonthRankName 
order by t.MonthRankName

只需确定每个guid出现的第一个月，然后计算每个月出现的这些数量。随着黑客攻击的一位拿到YearMonth很好的格式化（这个似乎比格式更高效（[日]，“YYYYMM”），但需要更多的实验上。

来源

2016-08-24 20:47:54 titangroan

在增长时间窗口中计数新的唯一值

回答

相关问题