我可以进一步向量化这个功能吗

我对R和基于矩阵的脚本语言都比较陌生。我已经编写了这个函数来返回每一行的索引，该行的内容与其他行的内容相似。这是我正在开发的一种垃圾邮件减少的原始形式。我可以进一步向量化这个功能吗

if (!require("RecordLinkage")) install.packages("RecordLinkage") 

library("RecordLinkage") 

# Takes a column of strings, returns a list of index's 
check_similarity <- function(x) { 
    threshold <- 0.8 
    values <- NULL 
    for(i in 1:length(x)) { 
    values <- c(values, which(jarowinkler(x[i], x[-i]) > threshold)) 
    } 
    return(values) 
}

有没有一种方法可以写这个来避免完整的for循环？

来源

2017-02-14 user2228313

@akrun更新，欢呼声 – user2228313

@Db没有，我比较反对所有其他行，X [I]，X [-i] – user2228313

也许试试这个：' m = as.matrix（sapply（x，jarowinkler，x））> threshold; diag（m）= 0;哪些（rowSums（m）> 0）'没有可重复的数据供我测试，但我认为这是有效的。 – dww

我们可以使用sapply来简化代码。

# some test data # 
x = c('hello', 'hollow', 'cat', 'turtle', 'bottle', 'xxx') 

# create an x by x matrix specifying which strings are alike 
m = sapply(x, jarowinkler, x) > threshold 

# set diagonal to FALSE: we're not interested in strings being identical to themselves 
diag(m) = FALSE 

# And find index positions of all strings that are similar to at least one other string 
which(rowSums(m) > 0) 
# [1] 1 2 4 5

即，这将返回的索引位置“你好”，“空洞”，“海龟”和“瓶”为类似于另一个字符串

如果你愿意，你可以使用colSums代替rowSums得到一个名为向量，但这可能是凌乱如果字符串长：

which(colSums(m) > 0) 
# hello hollow turtle bottle 
#  1  2  4  5

来源

2017-02-14 22:50:18 dww

我可以进一步向量化这个功能吗

回答

相关问题