从itertools构建一个大的numpy数组产品

-1

我想从itertools.product的结果构建一个numpy数组。我的第一种方法是简单的：从itertools构建一个大的numpy数组产品

from itertools import product 
import numpy as np 

max_init = 6 
init_values = range(1, max_init + 1) 
repetitions = 12 

result = np.array(list(product(init_values, repeat=repetitions)))

此代码可以很好地用于“小” repetitions（如< = 4），但与“大”的值（> = 12）它完全猪存储器和崩溃。我认为构建这个列表就是吃所有RAM的东西，所以我搜索了如何直接使用数组。我发现Numpy equivalent of itertools.product和Using numpy to build an array of all combinations of two arrays。

所以，我测试了以下选择：

备选＃1：

results = np.empty((max_init**repetitions, repetitions)) 
for i, row in enumerate(product(init_values, repeat=repetitions)): 
    result[:][i] = row

备选＃2：

init_values_args = [init_values] * repetitions 
results = np.array(np.meshgrid(*init_values_args)).T.reshape(-1, repetitions)

备选＃3：

results = np.indices([sides] * num_dice).reshape(num_dice, -1).T + 1

＃1是非常缓慢的。我没有足够的耐心让它完成（在2017年MacBook Pro上处理几分钟后）。＃2 and ＃3直到python解释器崩溃，像最初的方法一样，吃掉所有的内存。

之后，我认为我可以以不同的方式表达相同的信息，这对我仍然有用：一个dict其中键将是所有可能的（排序的）组合，并且值将是这些组合。所以，我想：

替代＃4：

from collections import Counter 

def sorted_product(iterable, repeat=1): 
    for el in product(iterable, repeat=repeat): 
     yield tuple(sorted(el)) 

def count_product(repetitions=1, max_init=6): 
    init_values = range(1, max_init + 1) 
    sp = sorted_product(init_values, repeat=repetitions) 
    counted_sp = Counter(sp) 
    return np.array(list(counted_sp.values())), \ 
     np.array(list(counted_sp.keys())) 

cnt, values = count(repetitions=repetitions, max_init=max_init)

而行counted_sp = Counter(sp)，从而触发得到发电机的所有值，也就是我的需要太慢（也花了好几分钟，我取消了）。

是否有另一种方式来生成相同的数据（或包含相同信息的不同数据结构），但没有提到的太慢或使用太多内存的缺点？ PS：我测试了以上所有的实现对我的测试与一个小的repetitions，并通过了所有的测试，所以他们给出一致的结果。

我希望编辑这个问题是扩展它的最好方法。否则，让我知道，我会在我应该的地方编辑帖子。

在阅读下面的前两个答案并思考它之后，我同意我从错误的角度接近问题。我不应该用“暴力”的方法来使用概率，而应该使用它。

我的意图是，稍后，对于每个组合： - 计算有多少个值低于阈值X. - 计算有多少值等于或超过阈值X并低于阈值Y. - 计算有多少值超过阈值Y. 并对具有相同计数的组合进行分组。

作为说明性实例：如果我辊6下侧的12个骰子，什么是具有M个骰子具有值< 3，N骰子具有值> = 3和< 4，和P骰子具有值的概率> 5，所有可能的M，N和P组合？

所以，我认为我会在几天内完成这个问题，而我将采用这种新方法。感谢您的所有反馈和您的时间！

来源

2017-10-15 Sergi

可能想看看这个：https://stackoverflow.com/a/11146645/4909087 –

为什么你想这样做？假设你可以成功创建np.array，你会怎么做？ –

list(product(range(1,7), repeats=12))所做的数字元组是6 ** 12,2176,782,336。无论是对于大多数计算机来说可能太大的列表或数组。

In [119]: len(list(product(range(1,7),repeat=12))) 
.... 
MemoryError:

试图让该大小直接的数组：

In [129]: A = np.ones((6**12,12),int) 
--------------------------------------------------------------------------- 
ValueError        Traceback (most recent call last) 
<ipython-input-129-e833a9e859e0> in <module>() 
----> 1 A = np.ones((6**12,12),int) 

/usr/local/lib/python3.5/dist-packages/numpy/core/numeric.py in ones(shape, dtype, order) 
    190 
    191  """ 
--> 192  a = empty(shape, dtype, order) 
    193  multiarray.copyto(a, 1, casting='unsafe') 
    194  return a 

ValueError: Maximum allowed dimension exceeded

阵列存储器的大小，每个项目

In [130]: 4*12*6**12 
Out[130]: 104,485,552,128

100GB≠4个字节

为什么你需要生成7个数字的2B组合？

因此，与您的柜台从36减少项目

的

In [138]: sp = sorted_product(range(1,7), 2) 
In [139]: counter=Counter(sp) 
In [140]: counter 
Out[140]: 
Counter({(1, 1): 1, 
     (1, 2): 2, 
     (1, 3): 2, 
     (1, 4): 2, 
     (1, 5): 2, 
     (1, 6): 2, 
     (2, 2): 1, 
     (2, 3): 2, 
     (2, 4): 2, 
     (2, 5): 2, 
     (2, 6): 2, 
     (3, 3): 1, 
     (3, 4): 2, 
     (3, 5): 2, 
     (3, 6): 2, 
     (4, 4): 1, 
     (4, 5): 2, 
     (4, 6): 2, 
     (5, 5): 1, 
     (5, 6): 2, 
     (6, 6): 1})

数21（2重复）。把它推广到更多的重复（组合？排列？）应该不难，它仍然会推动时间和/或内存边界。使用mgrid

一个变种上meshgrid：

In [175]: n=7; A=np.mgrid[[slice(1,7)]*n].reshape(n,-1).T 
In [176]: A.shape 
Out[176]: (279936, 7) 
In [177]: B=np.array(list(product(range(1,7),repeat=7))) 
In [178]: B.shape 
Out[178]: (279936, 7) 
In [179]: A[:10] 
Out[179]: 
array([[1, 1, 1, 1, 1, 1, 1], 
     [1, 1, 1, 1, 1, 1, 2], 
     [1, 1, 1, 1, 1, 1, 3], 
     [1, 1, 1, 1, 1, 1, 4], 
     [1, 1, 1, 1, 1, 1, 5], 
     [1, 1, 1, 1, 1, 1, 6], 
     [1, 1, 1, 1, 1, 2, 1], 
     [1, 1, 1, 1, 1, 2, 2], 
     [1, 1, 1, 1, 1, 2, 3], 
     [1, 1, 1, 1, 1, 2, 4]]) 
In [180]: B[:10] 
Out[180]: 
array([[1, 1, 1, 1, 1, 1, 1], 
     [1, 1, 1, 1, 1, 1, 2], 
     [1, 1, 1, 1, 1, 1, 3], 
     [1, 1, 1, 1, 1, 1, 4], 
     [1, 1, 1, 1, 1, 1, 5], 
     [1, 1, 1, 1, 1, 1, 6], 
     [1, 1, 1, 1, 1, 2, 1], 
     [1, 1, 1, 1, 1, 2, 2], 
     [1, 1, 1, 1, 1, 2, 3], 
     [1, 1, 1, 1, 1, 2, 4]]) 
In [181]: np.allclose(A,B)

mgrid是相当快一点：

In [182]: timeit B=np.array(list(product(range(1,7),repeat=7))) 
317 ms ± 3.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 
In [183]: timeit A=np.mgrid[[slice(1,7)]*n].reshape(n,-1).T 
13.9 ms ± 242 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

但是，是的，它会具有相同的总内存使用和限制。

当n = 10，

ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.

来源

2017-10-15 23:14:03 hpaulj

正确的答案是：不。无论您想要对所有这些组合进行什么操作，都要调整您的方法，以便一次生成一个并立即使用它们，而不必存储它们，或者更好地找到一种方法在不检查每个组合的情况下完成工作。您当前的解决方案适用于玩具问题，但不适用于较大的参数。解释你在做什么，也许这里有人可以提供帮助。

来源

2017-10-15 23:52:02 alexis

从itertools构建一个大的numpy数组产品

回答

相关问题