2014-09-19 49 views
3

常用词我有两个字符串:计数在两个字符串

a <- "Roy lives in Japan and travels to Africa" 
b <- "Roy travels Africa with this wife" 

我希望得到的这些字符串之间共同的字数。

答案应该是3

  • “罗伊”

  • “游记”

  • “非洲”

是常用词汇

这是我尝试过的:

stra <- as.data.frame(t(read.table(textConnection(a), sep = " "))) 
strb <- as.data.frame(t(read.table(textConnection(b), sep = " "))) 

以独特的,以避免重复计算

stra_unique <-as.data.frame(unique(stra$V1)) 
strb_unique <- as.data.frame(unique(strb$V1)) 
colnames(stra_unique) <- c("V1") 
colnames(strb_unique) <- c("V1") 

common_words <-length(merge(stra_unique,strb_unique, by = "V1")$V1) 

我需要这与2000和1200字符串的数据集。 我必须评估字符串的总次数是2000 X 1200.任何快速方式,不使用循环。

+0

我也不是真的建议这个,但使用你的“stra”和“strb”,你可能只是做'merge(stra,str b)'...... – A5C1D2H2I1M1N2O1R2T1 2014-09-19 11:03:08

回答

4

也许,使用intersectstr_extractmultiple strings,你可以把它们作为一个listvector

vec1 <- c(a,b) 
Reduce(`intersect`,str_extract_all(vec1, "\\w+")) 
#[1] "Roy"  "travels" "Africa" 

faster选项,考虑stringi

library(stringi) 
Reduce(`intersect`,stri_extract_all_regex(vec1,"\\w+")) 
#[1] "Roy"  "travels" "Africa" 

计数:

length(Reduce(`intersect`,stri_extract_all_regex(vec1,"\\w+"))) 
#[1] 3 

或者使用base R

Reduce(`intersect`,regmatches(vec1,gregexpr("\\w+", vec1))) 
    #[1] "Roy"  "travels" "Africa" 
6

您可以使用strsplitintersectbase库:

> a <- "Roy lives in Japan and travels to Africa" 
> b <- "Roy travels Africa with this wife" 
> a_split <- unlist(strsplit(a, sep=" ")) 
> b_split <- unlist(strsplit(b, sep=" ")) 
> length(intersect(a_split, b_split)) 
[1] 3 
+0

为我工作! – Crt 2016-03-01 21:24:28

2

这种方法推广到n个向量:

a <- "Roy lives in Japan and travels to Africa" 
b <- "Roy travels Africa with this wife" 
c <- "Bob also travels Africa for trips but lives in the US unlike Roy." 

library(stringi);library(qdapTools) 
X <- stri_extract_all_words(list(a, b, c)) 
X <- mtabulate(X) > 0 
Y <- colSums(X) == nrow(X); names(Y)[Y] 

[1] "Africa" "Roy"  "travels"