我刚刚写了一段代码,我很难理解,任何帮助都会非常感激。问题是:为什么在一个稀疏矩阵上进行聚类需要更多的时间,更多的内存和更多的行为,而不是聚集在密集格式的相同矩阵上?Scikit-learn Minibatch KMeans:稀疏与稠密矩阵聚类
这是代码。它只是在做下面的密集和稀疏矩阵:
- 创建一个100K×500矩阵
- 飞度MinibatchKMeans估计在矩阵(我们不关心结果)
- 显示适合估算器的时间
在两个基准测试之间,手动存储器被垃圾收集(以确保我们重新开始)。
#!.env/bin/python
# -*- coding: utf8 -*-
import time
import gc
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.cluster import MiniBatchKMeans
from memory_profiler import profile
@profile
def bench_dense():
print ">>>>> Dense Matrix Clustering"
# create a random dense matrix
dense_matrix = np.random.random((
100000, # 100K 'fake' documents
500 # 500 dimensions
))
s = time.time()
km = MiniBatchKMeans(
n_clusters=20, init='k-means++', batch_size=100, n_init=10, verbose=1)
km.fit_predict(dense_matrix) # cluster the points
print "Clustered dense matrix in: %.3fs" % (time.time() - s)
@profile
def bench_sparse():
print ">>>>>> Sparse Matrix Clustering"
# convert the dense matrix in sparse format
sparse_matrix = csr_matrix(np.random.random((
100000, # 100K 'fake' documents
500 # 500 dimensions
)))
s = time.time()
km = MiniBatchKMeans(
n_clusters=20, init='k-means++', batch_size=100, n_init=10, verbose=1)
km.fit_predict(sparse_matrix)
print "Clustered sparse matrix in: %.3fs" % (time.time() - s)
if __name__ == '__main__':
np.random.seed(42)
bench_dense()
gc.collect()
np.random.seed(42)
bench_sparse()
当运行这段代码几次(使K均值算法确保随机性不是我发现的原因),我有一些惊喜:
- 为什么花〜当使用矩阵的密集表示时,聚类算法收敛40倍的迭代次数?
- 为什么需要两倍的时间来使用稀疏表示与密集表示进行收敛,尽管执行的迭代次数少得多?
- 最后,我猜测在稀疏版本的基准测试期间分配更多内存的原因是因为矩阵(随机创建的)不包含任何0,这使得稀疏格式的内存效率更低。我对吗?
这里是基准的输出:提前
>>>>> Dense Matrix Clustering
Init 1/10 with method: k-means++
Inertia for init 1/10: 11546.570096
[...]
Init 10/10 with method: k-means++
Inertia for init 10/10: 11554.093346
Minibatch iteration 1/100000: mean batch inertia: 42.160602, ewa inertia: 42.160602
Minibatch iteration 2/100000: mean batch inertia: 41.914472, ewa inertia: 42.160110
[...]
Minibatch iteration 977/100000: mean batch inertia: 41.750966, ewa inertia: 41.581670
Minibatch iteration 978/100000: mean batch inertia: 41.719181, ewa inertia: 41.581945
Converged (lack of improvement in inertia) at iteration 978/100000
Computing label assignment and total inertia
Clustered dense matrix in: 7.363s
Filename: experiments/dense_sparse_bench.py
Line # Mem usage Increment Line Contents
================================================
13 33.2 MiB 0.0 MiB @profile
14 def bench_dense():
15 # create a random dense matrix
16 33.2 MiB 0.0 MiB dense_matrix = np.random.random((
17 100000, # 100K 'fake' documents
18 241.2 MiB 208.0 MiB 500 # 500 dimensions
19 ))
20 241.3 MiB 0.1 MiB s = time.time()
21 241.3 MiB 0.0 MiB km = MiniBatchKMeans(
22 241.4 MiB 0.2 MiB n_clusters=20, init='k-means++', batch_size=100, n_init=10, verbose=1)
23 405.0 MiB 163.6 MiB km.fit_predict(dense_matrix) # cluster the points
24 405.0 MiB 0.0 MiB print "Clustered dense matrix in: %.3fs" % (time.time() - s)
>>>>> Sparse Matrix Clustering
Init 1/10 with method: k-means++
Inertia for init 1/10: 11618.817774
[...]
Init 10/10 with method: k-means++
Inertia for init 10/10: 11609.579624
Minibatch iteration 1/100000: mean batch inertia: 42.105951, ewa inertia: 42.105951
Minibatch iteration 2/100000: mean batch inertia: 42.375899, ewa inertia: 42.106491
[...]
Minibatch iteration 21/100000: mean batch inertia: 41.912611, ewa inertia: 42.258551
Minibatch iteration 22/100000: mean batch inertia: 41.662418, ewa inertia: 42.257358
Converged (lack of improvement in inertia) at iteration 22/100000
Computing label assignment and total inertia
Clustered sparse matrix in: 14.243s
Filename: experiments/dense_sparse_bench.py
Line # Mem usage Increment Line Contents
================================================
27 38.5 MiB 0.0 MiB @profile
28 def bench_sparse():
29 # convert the dense matrix in sparse format
30 38.5 MiB 0.0 MiB sparse_matrix = csr_matrix(np.random.random((
31 100000, # 100K 'fake' documents
32 271.0 MiB 232.5 MiB 500 # 500 dimensions
33 )))
34 271.1 MiB 0.1 MiB s = time.time()
35 271.1 MiB 0.0 MiB km = MiniBatchKMeans(
36 271.2 MiB 0.1 MiB n_clusters=20, init='k-means++', batch_size=100, n_init=10, verbose=1)
37 598.5 MiB 327.3 MiB km.fit_predict(sparse_matrix)
38 598.5 MiB 0.0 MiB print "Clustered sparse matrix in: %.3fs" % (time.time() - s)
谢谢!
我会首先使用*相同*随机矩阵重新运行此基准。如果不是通过传递相同的矩阵,那么通过将种子设置在每个函数的顶部。 – Andreus
好点 - 完成。我很确定结果会是一样的,但是我得到的结果差异很大,不能仅仅由矩阵生成的随机性来解释(记住有500 * 100K,即50M的随机数为每一步生成......) 顺便说一下,我不想使用_exact_相同的矩阵,因为我想释放两个步骤之间矩阵使用的内存。 –
因此,如果我使用常规(非MiniBatch)KMeans,收敛迭代和性能是相同的,密度为812秒,稀疏为3654秒,这对您的矩阵有意义(您的稀疏矩阵不稀疏,的稀疏编码实质上增加了运行时间)。 – Andreus