2017-08-31 73 views
2

我需要更新稀疏矩阵中的某些列,但操作时间过长,以至于无法完成。R - 在非常大的稀疏矩阵中更新列

我有一个少于3M行和1500列左右的稀疏矩阵。我也有一个相同数量的行的数据框,但只有10列。我想用data.frame中的值更新矩阵中的某些列索引。

我用正常矩阵做这件事没有问题,但是当用稀疏矩阵尝试它时,甚至需要一个单独的列。

以下是我正在使用的代码,需要更改哪些内容才能有效运行?

library(Matrix) 

x <- Matrix(0, nrow = 2678748, ncol = 1559, sparse = TRUE) 
df <- data.frame(replicate(5,sample(0:1,2678748,rep = TRUE))) 

var_nums <- sample(1:1559,size = 5) 

for (i in 1:5){ 
    x[,var_nums[i]] <- df[,i] 
} 

回答

1

我能得到它完成下使用Matrix::cBind功能1秒,通过消除for循环。

library(Matrix) 

x <- Matrix(0, nrow = 2678748, ncol = 1559, sparse = TRUE) 
df <- data.frame(replicate(5,sample(0:1,2678748,rep = TRUE))) 

var_nums <- sample(1:1559,size = 5) 

t <- Sys.time() 
x   <- x[,-var_nums] 
x   <- Matrix::cBind(x, Matrix::as.matrix(df)) 
Sys.time()-t 
Time difference of 0.541054 secs 

WITH ORDER PRESERVED (静止不到1秒钟!)

library(Matrix) 

x <- Matrix(0, nrow = 2678748, ncol = 1559, sparse = TRUE) 
df <- data.frame(replicate(5,sample(0:1,2678748,rep = TRUE))) 

colnames(x) <- paste("col", 1:ncol(x)) 
col.order <- colnames(x) 

cols <- sample(colnames(x),size = 5) 
colnames(df) <- cols 

t <- Sys.time() 
x   <- x[,-which(colnames(x) %in% cols)] 
x   <- Matrix::cBind(x, Matrix::as.matrix(df)) 
x   <- x[,col.order] 
Sys.time()-t 
>  Time difference of 0.550012 secs 

# Proof that order is preserved: 
identical(colnames(x), col.order) 

TRUE

1

可以使用ijx符号的sparseMatrix

library(Matrix) 

# data 
set.seed(1) 
# Changed the dim size to fit in my laptop memory 
nc=10 
nr=100 
n=5 

df <- data.frame(replicate(n,sample(0:1,nr,rep = TRUE))) 
var_nums <- sample(1:nc,size = n) 

#Yours  
x <- Matrix(0, nrow = nr, ncol = nc, sparse = TRUE) 
for (i in 1:n){ 
    x[,var_nums[i]] <- df[,i] 
} 

# new version 
i = ((which(df==1)-1) %% nr) +1 
j = rep(var_nums, times=colSums(df)) 
y = sparseMatrix(i=i, j=j, x=1, dims=c(nrow(df), nc)) 

all.equal(x, y, check.attributes=FALSE) 

比较速度

f1 <- function(){  
    for (i in 1:n){ 
     x[,var_nums[i]] <- df[,i] 
    } 
    x 
} 

f2 <- function(){ 
    i = ((which(df==1)-1) %% nr) +1 
    j = rep(var_nums, times=colSums(df)) 
    y = sparseMatrix(i=i, j=j, x=1, dims=c(nrow(df), nc)) 
    y 
} 

microbenchmark::microbenchmark(f1(), f2()) 

Unit: milliseconds 
expr  min  lq  mean median  uq  max neval cld 
f1() 4.594229 4.694205 5.010071 4.770475 4.891649 12.666554 100 b 
f2() 1.274745 1.298663 1.464237 1.329534 1.392146 7.153076 100 a 

尝试更大

nc=100 
nr=10000 
n=50 
set.seed(1) 
df <- data.frame(replicate(n,sample(0:1,nr,rep = TRUE))) 
var_nums <- sample(1:nc,size = n) 
x <- Matrix(0, nrow = nr, ncol = nc, sparse = TRUE) 

all.equal(f1(), f2(), check.attributes=FALSE) 

microbenchmark::microbenchmark(f1(), f2(), times=1) 
Unit: milliseconds 
expr   min   lq  mean  median   uq   max neval 
f1() 21605.60251 21605.60251 21605.60251 21605.60251 21605.60251 21605.60251  1 
f2() 60.87275 60.87275 60.87275 60.87275 60.87275 60.87275  1 
0

这是略显繁琐,但你可以在需要的列绑定在一起像这

Nc = NCOL(x) 

    Matrix(cbind(
    x[, 1:(var_nums[1]-1)], 
    df[, 1], 
    x[, (var_nums[1]+1):(var_nums[2]-1)], 
    df[, 2], 
    x[, (var_nums[2]+1):(var_nums[3]-1)], 
    df[, 3], 
    x[, (var_nums[3]+1):(var_nums[4]-1)], 
    df[, 4], 
    x[, (var_nums[4]+1):(var_nums[5]-1)], 
    df[, 5], 
    x[, (var_nums[5]+1):Nc]), 
    sparse = TRUE) 

当df只有5列插入时,这并不算太坏。如果df有更多或者不同数量的列,那么不同的语法可能更合适。无论如何,绑定列是相对较快的。