2016-09-21 41 views
1

我有很多,我想比较文本句子的,但这里的小红帽的例子如何计算常用单词并将结果存储在矩阵中?

text1 <- "Once upon a time" 
text2 <- "there was a dear little girl" 
text3 <- "who was loved by everyone who looked at her" 

我希望创建一个计算常用的词就这样

text1_split <- unlist(strsplit(text1, " ")) 
text2_split <- unlist(strsplit(text2, " ")) 
text3_split <- unlist(strsplit(text3, " ")) 

length(intersect(text1_split, text2_split)) 
length(intersect(text2_split, text3_split)) 

texts <- c("text1","text2","text3") 
data <- data.frame(texts) 
data[, texts] <- NA 
rownames(data) <- texts 
data <- data[,-1] 

data[1,1] <- length(intersect(text1_split, text1_split)) 
data[1,2] <- length(intersect(text1_split, text2_split)) 
data[1,3] <- length(intersect(text1_split, text3_split)) 

矩阵我的矩阵的结果是这样的

 text1 text2 text3 
text1  4  1  0 
text2 NA NA NA 
text3 NA NA NA 

有没有办法以有效的方式完成矩阵?我有超过100个句子来比较。这是类似的,但不等于什么帖子:Count common words in two strings in R

回答

1

试试这个:

CommonWordsMatrixOld <- function(vList) { 
    v <- lapply(vList, tolower) 
    do.call(rbind, lapply(v, function(x) { 
      xSplit <- strsplit(x, " ")[[1]] 
      do.call(c, lapply(v, function(y) length(intersect(xSplit, strsplit(y, " ")[[1]])))) 
     })) 
} 

myText <- list(text1, text2, text3) 

调用它,我们有:

CommonWordsMatrixOld(myText) 
    [,1] [,2] [,3] 
[1,] 4 1 0 
[2,] 1 6 1 
[3,] 0 1 8 

它是体面快速的数据大小的OP是请求。获得的数据here

testWords <- read.csv("4000-most-common-english-words-csv.csv", stringsAsFactors = FALSE) 

set.seed(1111) 
myTestText <- lapply(1:100, function(x) { 
     paste(testWords[sample(1000:1020, sample(30, 1), replace = TRUE),],collapse = " ") 
    }) 

myTestText[[15]] 
[1] "access restaurant video opinion video eventually fresh eventually 
reform credit publish judge Senate publish fresh restaurant publish 
version Senate critical release recall relation version" 

system.time(test1 <- CommonWordsMatrixOld(myTestText)) 
user system elapsed 
0.625 0.009 0.646 

这里是输出:

test1[1:10,1:10] 
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] 
[1,] 9 3 5 1 3 4 4 2 2  1 
[2,] 3 5 3 1 1 3 3 0 0  1 
[3,] 5 3 12 0 3 8 4 3 2  1 
[4,] 1 1 0 1 0 0 1 0 0  0 
[5,] 3 1 3 0 4 2 1 1 1  0 
[6,] 4 3 8 0 2 13 7 4 1  1 
[7,] 4 3 4 1 1 7 10 4 1  1 
[8,] 2 0 3 0 1 4 4 7 3  0 
[9,] 2 0 2 0 1 1 1 3 4  0 
[10,] 1 1 1 0 0 1 1 0 0  2 

更新

这里是一个更快的算法,削减了许多不必要的操作,并采取lower.tri而优势其余很一般。

CommonWordsMatrixNew <- function(vList) { 
    v <- lapply(vList, function(x) tolower(strsplit(x, " ")[[1]])) 
    s <- length(v) 
    m <- do.call(rbind, lapply(1L:s, function(x) { 
     c(rep(0L,(x-1L)), do.call(c, lapply(x:s, function(y) length(intersect(v[[x]], v[[y]]))))) 
    })) 
    m[lower.tri(m)] <- t(m)[lower.tri(m)] 
    m 
} 

为了让你的性能提升一个想法,这里有一些基准。(应当指出的是,OP的解决方案是不分裂的载体,所以它不是一个真正的比较)。新算法几乎是OP解决方案的两倍。

microbenchmark(New=CommonWordsMatrixNew(myTestText), 
       Old=CommonWordsMatrixOld(myTestText), 
       Pach=CommonWordsMatrixPach(PreSplit1), times = 10) 
Unit: milliseconds 
expr  min  lq  mean median  uq  max neval 
New 78.64434 79.07127 86.10754 79.72828 81.39679 137.0695 10 
Old 321.49031 323.89835 326.61801 325.75221 328.50877 335.3306 10 
Pach 138.34742 143.00504 145.35147 145.17376 148.34699 151.5535 10 

identical(CommonWordsMatrixNew(myTestText), CommonWordsMatrixOld(myTestText), CommonWordsMatrixPach(PreSplit1)) 
[1] TRUE 

新算法通过n^2 - n倍减少的呼叫到strsplit数目(例如,在上面的例子中,strplit被调用原始ALGO 10000次,并在更新的版本仅100次)。此外,由于得到的矩阵是对称的,因此不需要计算每个句子之间的交互超过一次,因此lapply函数中的x = 1:sy = x:s。这些环路的计算次数从n^2减少到nth triangle number= (n*(n+1)/2)(例如,在我们的示例中,从100005050)。之后,我们依靠R的索引功能,这通常比手动制造快得多。

+0

非常感谢,下面我已经发布的东西基于你的回复,似乎更快 – pachamaltese

+1

@pachamaltese,第一个算法是做不必要的计算。我修改了我原来的算法来删除很多操作。而且,上面的算法仍然是一般的(即它们不依赖于预分割向量)。顺便说一句,很好的问题。 –

+0

好点!我已经在函数本身之前使用strsplit预分割了 – pachamaltese

0

我发现,分裂事先提高速度,使

CommonWordsMatrix <- function(vList) { 
    v <- lapply(vList, tolower) 
    do.call(rbind, lapply(v, function(x) { 
    do.call(c, lapply(v, function(y) length(intersect(x, y)))) 
    })) 
} 

是一个不错的选择去(x和y字的预分裂矢量)