Scikit-learn Minibatch KMeans：稀疏与稠密矩阵聚类

我刚刚写了一段代码，我很难理解，任何帮助都会非常感激。问题是：为什么在一个稀疏矩阵上进行聚类需要更多的时间，更多的内存和更多的行为，而不是聚集在密集格式的相同矩阵上？Scikit-learn Minibatch KMeans：稀疏与稠密矩阵聚类

这是代码。它只是在做下面的密集和稀疏矩阵：

创建一个100K×500矩阵
飞度MinibatchKMeans估计在矩阵（我们不关心结果）
显示适合估算器的时间

在两个基准测试之间，手动存储器被垃圾收集（以确保我们重新开始）。

#!.env/bin/python 

# -*- coding: utf8 -*- 

import time 
import gc 
import numpy as np 
from scipy.sparse import csr_matrix 
from sklearn.cluster import MiniBatchKMeans 
from memory_profiler import profile 


@profile 
def bench_dense(): 
    print ">>>>> Dense Matrix Clustering" 
    # create a random dense matrix 
    dense_matrix = np.random.random((
     100000, # 100K 'fake' documents 
     500 # 500 dimensions 
    )) 
    s = time.time() 
    km = MiniBatchKMeans(
     n_clusters=20, init='k-means++', batch_size=100, n_init=10, verbose=1) 
    km.fit_predict(dense_matrix) # cluster the points 
    print "Clustered dense matrix in: %.3fs" % (time.time() - s) 


@profile 
def bench_sparse(): 
    print ">>>>>> Sparse Matrix Clustering" 
    # convert the dense matrix in sparse format 
    sparse_matrix = csr_matrix(np.random.random((
     100000, # 100K 'fake' documents 
     500 # 500 dimensions 
    ))) 
    s = time.time() 
    km = MiniBatchKMeans(
     n_clusters=20, init='k-means++', batch_size=100, n_init=10, verbose=1) 
    km.fit_predict(sparse_matrix) 
    print "Clustered sparse matrix in: %.3fs" % (time.time() - s) 



if __name__ == '__main__': 
    np.random.seed(42) 
    bench_dense() 
    gc.collect() 
    np.random.seed(42) 
    bench_sparse()

当运行这段代码几次（使K均值算法确保随机性不是我发现的原因），我有一些惊喜：

为什么花〜当使用矩阵的密集表示时，聚类算法收敛40倍的迭代次数？
为什么需要两倍的时间来使用稀疏表示与密集表示进行收敛，尽管执行的迭代次数少得多？
最后，我猜测在稀疏版本的基准测试期间分配更多内存的原因是因为矩阵（随机创建的）不包含任何0，这使得稀疏格式的内存效率更低。我对吗？

这里是基准的输出：提前

>>>>> Dense Matrix Clustering 
Init 1/10 with method: k-means++ 
Inertia for init 1/10: 11546.570096 
[...] 
Init 10/10 with method: k-means++ 
Inertia for init 10/10: 11554.093346 
Minibatch iteration 1/100000: mean batch inertia: 42.160602, ewa inertia: 42.160602 
Minibatch iteration 2/100000: mean batch inertia: 41.914472, ewa inertia: 42.160110 
[...] 
Minibatch iteration 977/100000: mean batch inertia: 41.750966, ewa inertia: 41.581670 
Minibatch iteration 978/100000: mean batch inertia: 41.719181, ewa inertia: 41.581945 
Converged (lack of improvement in inertia) at iteration 978/100000 
Computing label assignment and total inertia 
Clustered dense matrix in: 7.363s 
Filename: experiments/dense_sparse_bench.py 

Line # Mem usage Increment Line Contents 
================================================ 
    13  33.2 MiB  0.0 MiB @profile 
    14        def bench_dense(): 
    15         # create a random dense matrix 
    16  33.2 MiB  0.0 MiB  dense_matrix = np.random.random((
    17          100000, # 100K 'fake' documents 
    18 241.2 MiB 208.0 MiB   500 # 500 dimensions 
    19        )) 
    20 241.3 MiB  0.1 MiB  s = time.time() 
    21 241.3 MiB  0.0 MiB  km = MiniBatchKMeans(
    22 241.4 MiB  0.2 MiB   n_clusters=20, init='k-means++', batch_size=100, n_init=10, verbose=1) 
    23 405.0 MiB 163.6 MiB  km.fit_predict(dense_matrix) # cluster the points 
    24 405.0 MiB  0.0 MiB  print "Clustered dense matrix in: %.3fs" % (time.time() - s) 

>>>>> Sparse Matrix Clustering 
Init 1/10 with method: k-means++ 
Inertia for init 1/10: 11618.817774 
[...] 
Init 10/10 with method: k-means++ 
Inertia for init 10/10: 11609.579624 
Minibatch iteration 1/100000: mean batch inertia: 42.105951, ewa inertia: 42.105951 
Minibatch iteration 2/100000: mean batch inertia: 42.375899, ewa inertia: 42.106491 
[...] 
Minibatch iteration 21/100000: mean batch inertia: 41.912611, ewa inertia: 42.258551 
Minibatch iteration 22/100000: mean batch inertia: 41.662418, ewa inertia: 42.257358 
Converged (lack of improvement in inertia) at iteration 22/100000 
Computing label assignment and total inertia 
Clustered sparse matrix in: 14.243s 
Filename: experiments/dense_sparse_bench.py 

Line # Mem usage Increment Line Contents 
================================================ 
    27  38.5 MiB  0.0 MiB @profile 
    28        def bench_sparse(): 
    29         # convert the dense matrix in sparse format 
    30  38.5 MiB  0.0 MiB  sparse_matrix = csr_matrix(np.random.random((
    31          100000, # 100K 'fake' documents 
    32 271.0 MiB 232.5 MiB   500 # 500 dimensions 
    33        ))) 
    34 271.1 MiB  0.1 MiB  s = time.time() 
    35 271.1 MiB  0.0 MiB  km = MiniBatchKMeans(
    36 271.2 MiB  0.1 MiB   n_clusters=20, init='k-means++', batch_size=100, n_init=10, verbose=1) 
    37 598.5 MiB 327.3 MiB  km.fit_predict(sparse_matrix) 
    38 598.5 MiB  0.0 MiB  print "Clustered sparse matrix in: %.3fs" % (time.time() - s)

谢谢！

来源

2015-07-10 Romain G

我会首先使用*相同*随机矩阵重新运行此基准。如果不是通过传递相同的矩阵，那么通过将种子设置在每个函数的顶部。 – Andreus

好点 - 完成。我很确定结果会是一样的，但是我得到的结果差异很大，不能仅仅由矩阵生成的随机性来解释（记住有500 * 100K，即50M的随机数为每一步生成......）顺便说一下，我不想使用_exact_相同的矩阵，因为我想释放两个步骤之间矩阵使用的内存。 –

因此，如果我使用常规（非MiniBatch）KMeans，收敛迭代和性能是相同的，密度为812秒，稀疏为3654秒，这对您的矩阵有意义（您的稀疏矩阵不稀疏，的稀疏编码实质上增加了运行时间）。 – Andreus

你......发现了一个错误。

您可以使用您的平台，sklearn版本等对此进行评论，以便我可以向sklearn开发人员报告此问题？这是一个错误assign_rows_csr

我已经收紧你的脚本（在MiniBatchKMeans构造函数中分配random_state以确保“相同”结果），然后开始挖掘。第一次群集重新分配后，您的结果发生分歧。因此，我修改了k_means_.py函数来吐出一些变量。在this line，我添加了这些打印语句“如果n_reassign”循环中：

 print "to_reassign",to_reassign 
     print "np.where(to_reassign)",np.where(to_reassign) 
     print "new_centers", new_centers 
     print "centers", centers[:,0] 
     assert False

然后，我改变冗长0，得到这个输出：

>>>>> Dense Matrix Clustering 
b 
to_reassign [False False False False False False False False False False True False 
False True False False True False True False] 
np.where(to_reassign) (array([10, 13, 16, 18], dtype=int64),) 
new_centers [11 24 33 72] 
centers [ 0.51612664 0.48724141 0.50478939 0.46328761 0.41928756 0.50768023 
    0.48635517 0.48744328 0.59401064 0.55509388 0.33723042 0.37875769 
    0.5366691 0.71604087 0.36911868 0.4626776 0.37506238 0.60670616 
    0.21136754 0.54321791] 

>>>>>> Sparse Matrix Clustering 
a 
to_reassign [False False False False False False False False False False True False 
False True False False True False True False] 
np.where(to_reassign) (array([10, 13, 16, 18], dtype=int64),) 
new_centers [11 24 33 72] 
centers [ 0.   0.   0.   0.   0.   0.   0. 
    0.   0.   0.   0.33723042 0.   0. 
    0.71604087 0.   0.   0.37506238 0.   0.21136754 
    0.  ]

我的脚本的修改版本：

import time 
import gc 
import numpy as np 
from scipy.sparse import csr_matrix 
from sklearn.cluster import MiniBatchKMeans as MiniBatchKMeans 
#from memory_profiler import profile 


#@profile 
def bench_dense(a_random_matrix): 
    print ">>>>> Dense Matrix Clustering" 
    # create a random dense matrix 
    dense_matrix = a_random_matrix.copy() 
    s = time.time() 
    km = MiniBatchKMeans(
     n_clusters=20, init='k-means++', 
     batch_size=100, 
     n_init=10, verbose=0, 
     random_state=37,) 
    km.fit_predict(dense_matrix) # cluster the points 
    print "Clustered dense matrix in: %.3fs" % (time.time() - s) 


#@profile 
def bench_sparse(a_random_matrix): 
    print ">>>>>> Sparse Matrix Clustering" 
    # convert the dense matrix in sparse format 
    sparse_matrix = csr_matrix(a_random_matrix.copy()) 
    assert np.all((sparse_matrix == a_random_matrix).sum()) 
    s = time.time() 
    km = MiniBatchKMeans(
     n_clusters=20, init='k-means++', 
     batch_size=100, 
     n_init=10, verbose=0, 
     random_state=37,) 
    km.fit_predict(sparse_matrix) 
    print "Clustered sparse matrix in: %.3fs" % (time.time() - s) 



if __name__ == '__main__': 
    a_random_matrix = np.random.random((
     100000, # 100K 'fake' documents 
     500 # 500 dimensions 
    )) 
    try: 
     np.random.seed(42) 
     bench_dense(a_random_matrix) 
    except AssertionError, e: 
     print e 
    gc.collect() 
    try: 
     np.random.seed(42) 
     bench_sparse(a_random_matrix) 
    except AssertionError, e: 
     print e

来源

2015-07-16 13:55:17 Andreus

我在Win 7 64位，蟒蛇2.1.0 64位，Python 2.7，sklearn 0.15.2，numpy 1.9.0 – Andreus

我打开了[Github上的问题]（https ：//github.com/scikit-learn/scikit-learn/issues/4985） – Andreus

我正在运行Mac OS X 10.9.5 64位，Python 2.7，scikit-learn 0.15.0，numpy 1.8.0，scipy 0.14.0。当我有时间承诺一点时，我会试着调查一下这个问题。 –

Scikit-learn Minibatch KMeans：稀疏与稠密矩阵聚类

回答

相关问题