R：计算从字符串开始匹配多少字母

我正在写一些更大的ML脚本来检测数据库中的同义词和缩写。单词相似度的度量之一是两个字符串中的第一个字母有多少匹配。所以，我有2个向量：R：计算从字符串开始匹配多少字母

v1 <- c("rejtan", "reiki","rejon") 
v2 <- c("rejtan", "rejtan", "beiki")

，我想（从匹配一个字的beggining字母％），有这样的结果：

 rejtan  reiki rejon 
rejtan  1 0.3333333 0.5 
rejtan  1 0.3333333 0.5 
beiki  0 0.0000000 0.0

我想出了这个功能：

count.first.character.matches <- function(vec1,vec2) { 
    sapply(X = vec1 , FUN= function(x) { 
    sapply(X = vec2, FUN = function(y) { 
     ny <- nchar(y) 
     nx <- nchar(x) 
     shorter_length <- ifelse(nx > ny, nx, ny) 
     matches <- sum(sapply(1:shorter_length, FUN=function(i,x,y) { substr(x,1,i) == substr(y,1,i)}, x,y)) 
     matches/shorter_length 
    }) 
    })

我的问题是：如何提高此功能的性能？我有65K的矢量对，每个700-1K字，所以我最终计算这个度量很多，根据Rprof这需要约。 25％的时间。感谢

来源

2015-10-13 mhnatiuk

看看'stringdist'包 –

你的'short_length'应该设置为'ifelse（nx

按照原样使用您的方法，您可以更改一些内容以提高效率。

1）nchar是一个函数，它不像length必须计算其参数的字符数，而不是获得属性。您将为每个“v1”的“v2”计算nchar，但对于每个“v2”的“v1”，还要计算nchar。您可以将nchar(x)放在第二个sapply之外，或者甚至更好地利用nchar的向量化特性，并计算所有内容，并在任何sapply之前计算所有内容。特别地，具有

x = replicate(1e3, paste0(sample(letters, sample(4:10, 1), TRUE), collapse = "")) 
y = replicate(1e3, paste0(sample(letters, sample(4:10, 1), TRUE), collapse = ""))

代替

system.time({ 
    nx = nchar(x) 
    ny = nchar(y) 
}) 
#user system elapsed 
# 0  0  0

使用

system.time({ 
sapply(x, function(X) 
      sapply(y, function(Y) { 
         nX = nchar(X) 
         nY = nchar(Y) 
         })) 
}) 
#user system elapsed 
#8.08 0.00 8.27

2）substring被矢量化，所以能够避免第三sapply。（另外，在检查的字符串的单个字符，strsplit可能更快，并且本身被矢量化，可以在任何循环之外进行计算。）

3）的if else块比较“长度时比ifelse更快== 1'载体。这完全是次要的，当然，但后两个嵌套sapply的IT增加了额外的计算时间，而无需：

microbenchmark::microbenchmark(replicate(1e4, if(2 < 3 && 5 > 3) 1 else 0), 
           replicate(1e4, ifelse(2 < 3 && 5 > 3, 1, 0))) 
#Unit: milliseconds 
#           expr  min  lq median  uq  max neval 
# replicate(10000, if (2 < 3 && 5 > 3) 1 else 0) 14.22543 14.85759 15.09545 15.78781 56.84884 100 
# replicate(10000, ifelse(2 < 3 && 5 > 3, 1, 0)) 29.77642 31.44824 36.20305 37.85782 65.72473 100

因此，考虑到有这些：

OP2 = function(v1, v2) 
{ 
    nc1 = nchar(v1) 
    nc2 = nchar(v2) 
    sv2 = seq_along(v2) 

    sapply(seq_along(v1), 
      function(i) { 
       sapply(sv2, 
        function(j) { 
         len = if(nc1[[i]] > nc2[[j]]) nc1[[i]] else nc2[[j]] 
         ind = seq_len(len) 
         sum(substring(v1[[i]], 1, ind) == substring(v2[[j]], 1, ind))/len 
        }) 
      }) 
}

并与你的比较：

set.seed(007)   
v1b = replicate(1e2, paste0(sample(letters[1:5], sample(4:10, 1), TRUE), collapse = "")) 
v2b = replicate(1e2, paste0(sample(letters[1:5], sample(4:10, 1), TRUE), collapse = "")) 

sum(count.first.character.matches(v1b, v2b) != OP2(v1b, v2b)) 
#[1] 0 
microbenchmark::microbenchmark(count.first.character.matches(v1b, v2b), OP2(v1b, v2b), times = 20) 
#Unit: milliseconds 
            expr  min  lq median  uq  max neval 
# count.first.character.matches(v1b, v2b) 932.2840 949.3697 969.6321 985.2237 1081.2882 20 
#       OP2(v1b, v2b) 161.7503 185.1102 192.3019 197.5060 213.6818 20

另一个想法，除了你的方法，可能是（改变“OP2”，以保持最短的长度后）：

ff = function(x, y) 
{ 
    sx = strsplit(x, "", fixed = TRUE) 
    sy = strsplit(y, "", fixed = TRUE) 
    array(mapply(function(X, Y) { 
        slen = seq_len(min(length(X), length(Y))) 
        wh = X[slen] == Y[slen] 
        if(all(wh)) return(1) else (which.min(wh) - 1)/length(slen) 
       }, 
       rep(sx, each = length(sy)), sy), 
      c(length(x), length(y)), list(y, x)) 
} 
sum(ff(v1b, v2b) != OP2(v1b, v2b)) 
#[1] 0 
microbenchmark::microbenchmark(ff(v1b, v2b), OP2(v1b, v2b), times = 20) 
#Unit: milliseconds 
#   expr  min  lq median  uq  max neval 
# ff(v1b, v2b) 72.72661 80.43703 85.85113 89.16066 110.5722 20 
# OP2(v1b, v2b) 165.13991 168.15051 176.01596 182.11389 213.9557 20

来源

2015-10-14 08:12:07

嘿，这真的帮了我很多，谢谢！ – mhnatiuk

这个怎么样，使用strsplit：

count.first.character.matches2 <- function(vec1,vec2) { 
    sapply(X = vec1 , FUN= function(x) { 
    sapply(X = vec2, FUN = function(y) { 
    ny <- nchar(y) 
    nx <- nchar(x) 
    shorter_length <- ifelse(nx < ny, nx, ny) 
    ind <- strsplit(x, "")[[1]][1 : shorter_length] == strsplit(y, "")[[1]][1 : shorter_length] 
    if(sum(ind) == shorter_length) return(1) else { 
     matches <- min(which(!ind)) - 1 
     matches/shorter_length 
     } 
    }) 
})}

快速测试（与shorter_length <- ifelse(nx < ny, nx, ny)你的函数）：

v11 <- rep(v1, 100) 
v22 <- rep(v2, 100) 

system.time(test1 <- count.first.character.matches(v11, v22)) 
# user system elapsed 
# 12.20 0.02 12.29 

system.time(test2 <- count.first.character.matches2(v11, v22)) 
# user system elapsed 
# 3.86 0.00 3.96 

all.equal(test1, test2) 
# [1] TRUE

有点凌乱，但速度要快得多。

来源

2015-10-13 17:41:00

R：计算从字符串开始匹配多少字母

回答

相关问题