2017-10-19 114 views
2

我刚刚在R中开始使用tm包,似乎无法解决问题。 虽然我的分词器的功能似乎工作权:R中的TermDocumentMatrix - 仅创建1克克

uniTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1)) 
biTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2)) 
triTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3)) 

uniTDM <- TermDocumentMatrix(corpus, control=list(tokenize = uniTokenizer)) 
biTDM <- TermDocumentMatrix(corpus, control=list(tokenize = biTokenizer)) 
triTDM <- TermDocumentMatrix(corpus, control=list(tokenize = triTokenizer)) 

当我试图拉2克从biTDM,只有1克拿出...

findFreqTerms(biTDM, 50) 

[1] "after" "and"  "most" "the"  "were" "years" "love" 
[8] "you"  "all"  "also" "been" "did"  "from" "get"  

的同时, 2克的功能似乎是在机智:

x <- biTokenizer(corpus) 
head(x) 

[1] "c in"    "in the"   "the years"  
[4] "years thereafter" "thereafter most" "most of"  
+2

包括[最小再现的示例](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)在你的问题会增加你的机会得到答案。 – jsb

回答

0

我只能假设是什么问题在这里:NGramTokenizer需要一个VCorpus对象,而不是Corpus物体。

library(tm) 
library(RWeka) 

# some dummy text 
text <- c("Lorem ipsum dolor sit amet, consetetur sadipscing elitr", 
      "sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat", 
      "sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum", 
      "Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet") 

# create a VCorpus 
corpus <- VCorpus(VectorSource(text)) 


biTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2)) 


biTDM <- TermDocumentMatrix(corpus, control=list(tokenize = biTokenizer)) 

print(biTDM$dimnames$Terms) 

[1] "accusam et"   "aliquyam erat"   "amet consetetur"  "at vero"    "clita kasd"   "consetetur sadipscing" "diam nonumy"   "diam voluptua"   "dolor sit"    "dolore magna"   
[11] "dolores et"   "duo dolores"   "ea rebum"    "eirmod tempor"   "eos et"    "est lorem"    "et accusam"   "et dolore"    "et ea"     "et justo"    
[21] "gubergren no"   "invidunt ut"   "ipsum dolor"   "justo duo"    "kasd gubergren"  "labore et"    "lorem ipsum"   "magna aliquyam"  "no sea"    "nonumy eirmod"   
[31] "sadipscing elitr"  "sanctus est"   "sea takimata"   "sed diam"    "sit amet"    "stet clita"   "takimata sanctus"  "tempor invidunt"  "ut labore"    "vero eos"    
[41] "voluptua at"