2017-04-17 78 views
0

我正在使用R进行文本挖掘,我想确定一些单词是否在我的关键字关键字之前加了三个或更少的单词。例如,我的关键字关键字是兼容性,我想知道单词有限的是否在我的关键字之前有三个或更少的单词。因此,我希望得到频率计数文本关于以下组合出现了多少次(X =任何其他文字):R文本挖掘 - 如何识别关键字前面的单词

  • 有限的兼容性
  • 有限X兼容性
  • 有限XX 兼容性

欢迎任何建议。谢谢。

回答

0

下面是使用tidytext找到跳过的n-gram的方法:

library(tidyverse) 
library(tidytext) 

x <- 'I am working on text mining using R, I would like to identify if some words precede my focal keyword by three or fewer words. For instance, my focal keyword is compatibility and I wanted to know if the word limited precedes my keyword by three or fewer words. Thus, I wanted to get frequency count in a text regarding how many times the following combination appears (X=any other word): 

limited compatibility 
limited X compatibility 
limited X X compatibility 

Any suggestions are welcome. Thanks.' 

data_frame(x) %>% 
    unnest_tokens(line, x, 'lines') %>% 
    mutate(line_number = row_number()) %>% 
    unnest_tokens(ngram, line, 'skip_ngrams', n = 2, k = 2) %>% 
    filter(grepl('limited', ngram), grepl('compatibility', ngram)) 
#> # A tibble: 3 × 2 
#> line_number     ngram 
#>   <int>     <chr> 
#> 1   2 limited compatibility 
#> 2   3 limited compatibility 
#> 3   4 limited compatibility 
0

这里是基础R和正则表达式的方法。
grepRaw提供了每个匹配的正则表达式模式的位置(参数为all = TRUE)。这个结果的长度提供了匹配的数量。

d <- c(" 
Limited compatibility Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla maximus lobortis 
tellus quis egestas. Donec non dignissim urna. Praesent at commodo ligula. 
Cras laoreet limited compatibility interdum mi nec euismod. Ut interdum odio non sem luctus iaculis. Mauris id sapien limited X XXXX compatibility accumsan, imperdiet justo non,limited compatibility egestas felis. Morbi commodo lectus limited X compatibility scelerisque limited XXX compatibility est bibendum, vel varius tellus vulputate. Aenean dictum accumsan limited X compatibility neque limited X X compatibility sed dictum. Vivamus finibus lacus sit amet iaculis molestie. Fusce enim limited X compatibility sapien, iaculis quis leo non, pellentesque lobortis arcu. Proin commodo limited X XXX XXXXX compatibility velit placerat venenatis mattis. Limited compatibility Curabitur et laoreet ipsum. Limited compatibility 
") 

> length(grepRaw("Limited compatibility", d, ignore.case = TRUE, all = TRUE)) 
[1] 5 
> length(grepRaw("limited \\w+ compatibility", d, ignore.case = TRUE, all = TRUE)) 
[1] 4 
> length(grepRaw("limited (\\w+){2}compatibility", d, ignore.case = TRUE, all = TRUE)) 
[1] 2 
> length(grepRaw("limited (\\w+){3}compatibility", d, ignore.case = TRUE, all = TRUE)) 
[1] 1 

以下的正则表达式匹配“有限X兼容性预期值,也有限XX兼容性”,这是不entended行为

> length(grepRaw("limited (\\w+){6}compatibility", d, ignore.case = TRUE, all = TRUE)) 
[1] 1 

也许更安全然后放置在一行上的每一个“限定XX兼容性”图案:

d <- gsub("Limited", "\nLimited", d, ignore.case = TRUE) 
d <- gsub("compatibility", "compatibility\n", d, ignore.case = TRUE) 
# writeLines(d) 

这现在是正确的

> length(grepRaw("limited (\\w+){6}compatibility", d, ignore.case = TRUE, all = TRUE)) 
[1] 0