R-bigram tokenizer中的文档项矩阵不起作用

我正在试图为一个语料库，一个使用unigrams，一个使用bigrams制作两个文档项矩阵。然而，二元矩阵当前与单元矩阵相同，我不知道为什么。从的ngram包作为标记生成器，但是这并不工作R-bigram tokenizer中的文档项矩阵不起作用

docs<-Corpus(DirSource("data", recursive=TRUE)) 

# Get the document term matrices 
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) 
dtm_unigram <- DocumentTermMatrix(docs, control = list(tokenize="words", 
    removePunctuation = TRUE, 
    stopwords = stopwords("english"), 
    stemming = TRUE)) 
dtm_bigram <- DocumentTermMatrix(docs, control = list(tokenize = BigramTokenizer, 
    removePunctuation = TRUE, 
    stopwords = stopwords("english"), 
    stemming = TRUE)) 

inspect(dtm_unigram) 
inspect(dtm_bigram)

我还试图使用的ngram（X，N = 2）：

的代码。我如何解决bigram标记化？

来源

2017-03-05 filaments

我也有这个问题，所以如果你找到答案，请让我知道。 –

答复迟了一点，对不起 - 但我通过使用VCorpus而不是语料库得到了这个工作。 – filaments

标记器选项似乎不适用于语料库（SimpleCorpus）。使用VCorpus来解决问题。

来源

2017-03-28 18:30:48 filaments

为什么'VCorpus'在'Corpus'上？还有另一个相关的SO问题[这里]（https://stackoverflow.com/questions/42757183/creating-n-grams-with-tm-rweka-works-with-vcorpus-but-not-corpus）但没有'似乎是令人满意的解释。 – hongsy

R-bigram tokenizer中的文档项矩阵不起作用

回答

相关问题