2017-08-13 75 views
2

我想使用tidytext同时使用bigram和trigram。我可以使用什么代码来查找2个和3个单词。如何使用bigrams和trigrams使用整齐文本

这是只使用双字母组代码:

library(tidytext) 
library(janeaustenr) 

austen_bigrams <- austen_books() %>% 
    unnest_tokens(bigram, text, token = "ngrams", n = 2) 

austen_bigrams 

回答

3

如果你看一下?unnest_tokens,它会告诉你...是传递给标记生成器的参数。对于n元语法,这是tokenizers::tokenize_ngrams,如果你看一下它的帮助文件,它有一个n_min参数,所以你可以做

library(magrittr) 
library(tidytext) 
library(janeaustenr) 

austen_bigrams <- austen_books() %>% 
    head(1000) %>% # otherwise this will get very large 
    unnest_tokens(bigram, text, token = "ngrams", n = 3, n_min = 2) 

austen_bigrams 
#> # A tibble: 19,801 x 2 
#>     book    bigram 
#>     <fctr>     <chr> 
#> 1 Sense & Sensibility    sense and 
#> 2 Sense & Sensibility sense and sensibility 
#> 3 Sense & Sensibility  and sensibility 
#> 4 Sense & Sensibility and sensibility by 
#> 5 Sense & Sensibility  sensibility by 
#> 6 Sense & Sensibility sensibility by jane 
#> 7 Sense & Sensibility    by jane 
#> 8 Sense & Sensibility  by jane austen 
#> 9 Sense & Sensibility   jane austen 
#> 10 Sense & Sensibility  jane austen 1811 
#> # ... with 19,791 more rows