在插入BigQuery表之前检查数据是否已经存在（使用Python）

我正在设置每日cron作业，该作业追加BigQuery表（使用Python）的一行，但是插入了重复数据。我已经在网上搜索，我知道有一种方法可以手动remove duplicate数据，但我想看看是否可以避免这种重复。在插入BigQuery表之前检查数据是否已经存在（使用Python）

有没有办法检查BigQuery表来查看数据记录是否已经存在第一个为了避免插入重复数据？谢谢。

代码片段：

import webapp2 
import logging 
from googleapiclient import discovery 
from oath2client.client import GoogleCredentials 

PROJECT_ID = 'foo' 
DATASET_ID = 'bar' 
TABLE_ID = 'foo_bar_table’ 

class UpdateTableHandler(webapp2.RequestHandler): 
    def get(self): 
     credentials = GoogleCredentials.get_application_default() 
     service = discovery.build('bigquery', 'v2', credentials=credentials) 

    try: 

    the_fruits = Stuff.query(Stuff.fruitTotal >= 5).filter(Stuff.fruitColor == 'orange').fetch(); 

    for fruit in the_fruits: 
     #some code here 

    basket = dict() 
    basket['id'] = fruit.fruitId 
    basket['Total'] = fruit.fruitTotal 
    basket['PrimaryVitamin'] = fruit.fruitVitamin 
    basket['SafeRaw'] = fruit.fruitEdibleRaw 
    basket['Color'] = fruit.fruitColor 
    basket['Country'] = fruit.fruitCountry 

      body = { 
       'rows': [ 
        { 
         'json': basket, 
         'insertId': str(uuid.uuid4()) 
        } 
       ] 
      } 

      response = bigquery_service.tabledata().insertAll(projectId=PROJECT_ID, 
                   datasetId=DATASET_ID, 
                   tableId=TABLE_ID, 
                   body=body).execute(num_retries=5) 
      logging.info(response) 

    except Exception, e: 
     logging.error(e) 

app = webapp2.WSGIApplication([ 
    ('/update_table', UpdateTableHandler), 
], debug=True)

来源

2016-10-04 fragilewindows

似乎搜索会很昂贵，除非数据在过去的24小时内，然后只搜索该分区。 –

只有这样，才能测试数据是否已经存在是运行一个查询。

如果表中有大量数据，该查询可能很昂贵，所以在大多数情况下，我们建议您继续并插入重复项，然后再合并重复项。

正如Zig Mandel在评论中建议的那样，您可以通过日期分区查询是否知道预计要查看记录的日期，但与插入和删除重复项相比，这仍然很昂贵。

来源

2016-10-04 16:28:25

在插入BigQuery表之前检查数据是否已经存在（使用Python）

回答

相关问题