2017-03-22 52 views
0

从商店购买商品时,我有一个BigQuery表记录。它包含一个ItemID和一个时间戳。我对购买的每件商品的运行总数感兴趣。我有这个查询生成运行总计:BigQuery:如何随时间采样运行总计

SELECT ItemID,timestamp,count(*) 
OVER 
    (PARTITION BY ItemID 
    ORDER BY timestamp ASC, ItemID) AS runningtotal 
from 
(
    SELECT * FROM [mydb.purchases] 
) 
ORDER BY timestamp 

此表有成百上千的行。 我现在想要做的是花费一段时间(例如一周),并在该周内为每个ItemID获取100个运行总计样本(以绘制没有太多数据点的图)。 我不知道如何做到这一点。我可以通过过滤诸如“where(rownumber%(rowcount/100)= 0”)来获得100个样本,但我怎样才能为表中的每个ItemID执行此操作?是否需要为每个ItemID执行多个子查询,然后创建工会感谢

+0

重要的SO - 你可以'标志使用左侧的刻度接受answer'发布的答案,低于投票。看到http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235为什么它很重要!对答案投票也很重要。表决有用的答案。 ...当某人回答你的问题时,你可以检查该怎么做 - http://stackoverflow.com/help/someone-answers。 –

回答

0

使用标准的SQL,你可以使用里面ARRAY_AGG功能LIMIT条款首先收集100个时间戳的样本:

#standardSQL 
SELECT ItemID, timestamp, COUNT(*) 
OVER (PARTITION BY ItemID ORDER BY timestamp ASC) AS running_total 
FROM (
SELECT ItemID, ARRAY_AGG(timestamp LIMIT 100) timestamps 
FROM `mydb.purchases`) t, t.timestamps timestamp 
ORDER BY timestamp 

如果不这样做,你可以使用RAND()洗牌时间戳随机抽样:

#standardSQL 
SELECT ItemID, timestamp, COUNT(*) 
OVER (PARTITION BY ItemID ORDER BY timestamp ASC) AS running_total 
FROM (
SELECT ItemID, ARRAY_AGG(timestamp ORDER BY RAND() LIMIT 100) timestamps 
FROM `mydb.purchases`) t, t.timestamps timestamp 
ORDER BY timestamp 
0

下面究竟是干什么的,你在采样
的感觉描述我离开selecting week worse of data方面了,因为它是琐碎

#standardSQL 
SELECT 
    ItemID, 
    timestamp, 
    runningtotal 
FROM (
    SELECT 
    ItemID, 
    timestamp, 
    COUNT(1) OVER (PARTITION BY ItemID ORDER BY timestamp ASC) AS runningtotal, 
    ROW_NUMBER() OVER (PARTITION BY ItemID ORDER BY timestamp ASC) AS rownumber, 
    COUNT(1) OVER(PARTITION BY ItemID) AS rowcount 
    FROM `mydb.purchases` 
) 
WHERE MOD(rownumber, CAST(rowcount/100 AS INT64)) = 0 
-- ORDER BY ItemID, timestamp