2017-06-22 91 views
1

我想使用queryset迭代器遍历大型数据集。 Django为此提供了iterator(),但每次迭代都会触发数据库。我发现下面的代码块迭代 -Django Queryset迭代器有序查询集

def queryset_iterator(queryset, chunksize=1000): 
    ''''' 
    Iterate over a Django Queryset ordered by the primary key 
    This method loads a maximum of chunksize (default: 1000) rows in it's 
    memory at the same time while django normally would load all rows in it's 
    memory. Using the iterator() method only causes it to not preload all the 
    classes. 
    Note that the implementation of the iterator 
    does not support ordered query sets. 
    ''' 
    pk = 0 
    last_pk = queryset.order_by('-pk').values_list('pk', flat=True).first() 
    if last_pk is not None: 
     queryset = queryset.order_by('pk') 
     while pk < last_pk: 
      for row in queryset.filter(pk__gt=pk)[:chunksize]: 
       pk = row.pk 
       yield row 
      gc.collect() 

这适用于无序的queryset。是否有任何解决方案/解决方法在有序查询集上执行此操作?

回答

1

这是我的,具有排序功能。

顺便说一句,你正在使用的迭代器有一个“永久循环”在进程中查询集项目被修改:删除或添加,甚至一个项目。

及以下的迭代器对last_pk

def queryset_iterator(queryset, chunksize=10000, key=None): 
    key = [key] if isinstance(key, basestring) else (key or ['pk']) 
    counter = 0 
    count = chunksize 
    while count == chunksize: 
     offset = counter - counter % chunksize 
     count = 0 
     for item in queryset.all().order_by(*key)[offset:offset + chunksize]: 
      count += 1 
      yield item 
     counter += count 
     gc.collect() 
没有无用的查询