计数在两个字符串

常用词我有两个字符串：计数在两个字符串

a <- "Roy lives in Japan and travels to Africa" 
b <- "Roy travels Africa with this wife"

我希望得到的这些字符串之间共同的字数。

答案应该是3

“罗伊”
“游记”
“非洲”

是常用词汇

这是我尝试过的：

stra <- as.data.frame(t(read.table(textConnection(a), sep = " "))) 
strb <- as.data.frame(t(read.table(textConnection(b), sep = " ")))

以独特的，以避免重复计算

stra_unique <-as.data.frame(unique(stra$V1)) 
strb_unique <- as.data.frame(unique(strb$V1)) 
colnames(stra_unique) <- c("V1") 
colnames(strb_unique) <- c("V1") 

common_words <-length(merge(stra_unique,strb_unique, by = "V1")$V1)

我需要这与2000和1200字符串的数据集。我必须评估字符串的总次数是2000 X 1200.任何快速方式，不使用循环。

来源

2014-09-19 Jaimik Jain

我也不是真的建议这个，但使用你的“stra”和“strb”，你可能只是做'merge（stra，str b）'...... – A5C1D2H2I1M1N2O1R2T1 2014-09-19 11:03:08

也许，使用intersect和str_extract 为multiple strings，你可以把它们作为一个list或vector

vec1 <- c(a,b) 
Reduce(`intersect`,str_extract_all(vec1, "\\w+")) 
#[1] "Roy"  "travels" "Africa"

为faster选项，考虑stringi

library(stringi) 
Reduce(`intersect`,stri_extract_all_regex(vec1,"\\w+")) 
#[1] "Roy"  "travels" "Africa"

计数：

length(Reduce(`intersect`,stri_extract_all_regex(vec1,"\\w+"))) 
#[1] 3

或者使用base R

Reduce(`intersect`,regmatches(vec1,gregexpr("\\w+", vec1))) 
    #[1] "Roy"  "travels" "Africa"

来源

2014-09-19 09:25:44 akrun

您可以使用strsplit和intersect从base库：

> a <- "Roy lives in Japan and travels to Africa" 
> b <- "Roy travels Africa with this wife" 
> a_split <- unlist(strsplit(a, sep=" ")) 
> b_split <- unlist(strsplit(b, sep=" ")) 
> length(intersect(a_split, b_split)) 
[1] 3

来源

2014-09-19 09:30:47

为我工作！ – Crt 2016-03-01 21:24:28

这种方法推广到n个向量：

a <- "Roy lives in Japan and travels to Africa" 
b <- "Roy travels Africa with this wife" 
c <- "Bob also travels Africa for trips but lives in the US unlike Roy." 

library(stringi);library(qdapTools) 
X <- stri_extract_all_words(list(a, b, c)) 
X <- mtabulate(X) > 0 
Y <- colSums(X) == nrow(X); names(Y)[Y] 

[1] "Africa" "Roy"  "travels"

来源

2016-01-29 13:09:06

计数在两个字符串

回答

相关问题