2017-07-13 35 views
0

我有一个关键字列表:伯爵一号实例与R中没有重复计数

library(stringr) 
words <- as.character(c("decomposed", "no diagnosis","decomposition","autolysed","maggots", "poor body", "poor","not suitable", "not possible")) 

我想匹配这些关键字的数据帧列文本(DF $文本)和计数一个关键字在一个不同data.frame(matchdf)发生的次数:

matchdf<- data.frame(Keywords=words) 
m_match<-sapply(1:length(words), function(x) sum(str_count(tolower(df$text),words[[x]]))) 
matchdf$matchs<-m_match 

然而,我注意到,该方法计算一列内的关键词的每次出现。例如)

"The sample was too decomposed to perform an analysis. The decomposed sample indicated that this animal was dead for a long time" 

然后会返回2的计数。但是,我只想计算字段中“decomposed”的第一个实例。

我认为会有一种方法只计算使用str_count的第一个实例,但似乎没有一个。

+5

你不想'str_detect'然后? – CPak

回答

1

在这个例子中stringr并不是必须的,grepl从base R开始就足够了。这就是说,使用str_detect代替grepl,如果你喜欢的包的功能(如在评论中指出@智乐)

library(stringr) 

words <- c("decomposed", "no diagnosis","decomposition","autolysed","maggots", 
      "poor body", "poor","not suitable", "not possible") 

df <- data.frame(text = "The sample was too decomposed to perform an analysis. The decomposed sample indicated that this animal was dead for a long time") 

matchdf <- data.frame(Keywords = words, stringsAsFactors = FALSE) 

# Base R grepl 
matchdf$matches1 <- sapply(1:length(words), function(x) as.numeric(grepl(words[x], tolower(df$text)))) 

# Stringr function 
matchdf$matches2 <- sapply(1:length(words), function(x) as.numeric(str_detect(tolower(df$text),words[[x]]))) 

matchdf 

结果

 Keywords matches1 matches2 
1 decomposed  1  1 
2 no diagnosis  0  0 
3 decomposition  0  0 
4  autolysed  0  0 
5  maggots  0  0 
6  poor body  0  0 
7   poor  0  0 
8 not suitable  0  0 
9 not possible  0  0