2015-09-27 133 views
0

考虑下表(简本):BIGQUERY选择不重复的记录

id int, 
amount decimal, 
transaction_no, 
location_id int, 
created_at datetime 

上面的架构用于存储POS收据餐厅。现在,此表有时包含同一日期的收据,相同transaction_no at same location_id

在这种情况下,我想要做的就是最后收到的是为了通过created_at递减 LOCATION_ID & transaction_no。

在MySQL中,我使用下面的查询这让我最后一个(max(created_at)收到一个LOCATION_ID & transaction_no:BigQuery中的

SELECT id, amount, transaction_no, location_id, created_at 
FROM receipts r JOIN 
(SELECT transaction_no, max(created_at) AS maxca 
    FROM receipts r 
    GROUP BY transaction_no 
) t 
ON r.transaction_no = t.transaction_no AND r.created_at = t.maxca 
group by location_id; 

但是当我运行相同的,我得到以下错误:

Query Failed Error: Shuffle reached broadcast limit for table __I0 (broadcasted at least 150393576 bytes). Consider using partitioned joins instead of broadcast joins . Job ID: circular-gist-812:job_A_CfsSKJICuRs07j7LHVbkqcpSg

任何想法如何使上述查询在BigQuery中工作?

回答

2
SELECT id, amount, transaction_no, location_id, created_at 
FROM (
    SELECT 
    id, amount, transaction_no, location_id, created_at, 
    ROW_NUMBER() OVER(PARTITION BY transaction_no, location_id 
        ORDER BY created_at DESC) as last 
    FROM your_dataset.your_table 
) 
WHERE last = 1 
+0

这给了我扁平化的输出。有没有一种方法可以保持嵌套结构?我已经在[bigquery web ui](https://www.evernote.com/l/ACRfvDG5ERpJPLBZ9ooZpOOIb_WKFvhEs6wB/image.png)中取消选中了Flatten结果选项 – CuriousMind