我想遍历一个CSR矩阵的行和列的总和除以每个元素,类似这样的位置:大NumPy的SciPy的CSR矩阵,行明智的操作
我的问题是我正在处理一个大矩阵:(96582,350138)
并且当从链接的帖子应用该操作时,由于返回的矩阵是密集的,所以它扩大了我的记忆。
所以这是我第一次尝试:
for row in counts:
row = row/row.sum()
不幸的是,这并不影响基质可言,所以我想出了第二个想法,以创建一个新的CSR矩阵和CONCAT行使用vstack:
from scipy import sparse
import time
start_time = curr_time = time.time()
mtx = sparse.csr_matrix((0, counts.shape[1]))
for i, row in enumerate(counts):
prob_row = row/row.sum()
mtx = sparse.vstack([mtx, prob_row])
if i % 1000 == 0:
delta_time = time.time() - curr_time
total_time = time.time() - start_time
curr_time = time.time()
print('step: %i, total time: %i, delta_time: %i' % (i, total_time, delta_time))
这种运作良好,但一些迭代之后它变得越来越慢:
step: 0, total time: 0, delta_time: 0
step: 1000, total time: 1, delta_time: 1
step: 2000, total time: 5, delta_time: 4
step: 3000, total time: 12, delta_time: 6
step: 4000, total time: 23, delta_time: 11
step: 5000, total time: 38, delta_time: 14
step: 6000, total time: 55, delta_time: 17
step: 7000, total time: 88, delta_time: 32
step: 8000, total time: 136, delta_time: 47
step: 9000, total time: 190, delta_time: 53
step: 10000, total time: 250, delta_time: 59
step: 11000, total time: 315, delta_time: 65
step: 12000, total time: 386, delta_time: 70
step: 13000, total time: 462, delta_time: 76
step: 14000, total time: 543, delta_time: 81
step: 15000, total time: 630, delta_time: 86
step: 16000, total time: 722, delta_time: 92
step: 17000, total time: 820, delta_time: 97
有何建议?任何想法为什么vstack变得越来越慢?
见https://stackoverflow.com/a/45339754和https://stackoverflow.com/q/44080315 – hpaulj
与密集数组一样,循环中的重复级联很慢。在列表中累积结果并执行'vstack'更快。 – hpaulj