2017-02-09 100 views
0

问题:如何才能保持bigram“没有奇妙”仅在文档术语矩阵或我想保留的bigrams(Terms)列表中?只保留文档术语矩阵中的特定格式R

我想将其应用于非常大的文档术语矩阵。我尝试将术语矩阵转换为矩阵,但矢量大小超过1000 Gb。

代码:

dd <- data.frame(
id = 10:13, 
text = c("No wonderful, then, that ever", 
     "So that in many cases such a ", 
     "But there were still other and", 
     "Not even at the rationale"), stringsAsFactors = F) 

library(tm) 
library(RWeka) 

myReader <- readTabular(mapping = list(content = "text", id = "id")) 

#create v corpus 
tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader)) 

#n-gram tokenizer 
Tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) 

#create document term matrix using Tokenizer 
     dtm <- TermDocumentMatrix(tm, control = list(tokenize = Tokenizer)) 
     inspect(dtm) 

输出:

       Docs 
      Terms   10 11 12 13 
      at the   0 0 0 1 
      but there  0 0 1 0 
      cases such  0 1 0 0 
      even at   0 0 0 1 
      in many   0 1 0 0 
      many cases  0 1 0 0 
      no wonderful 1 0 0 0 
      not even  0 0 0 1 
      other and  0 0 1 0 
      so that   0 1 0 0 
      still other  0 0 1 0 
      such a   0 1 0 0 
      that ever  1 0 0 0 
      that in   0 1 0 0 
      the rationale 0 0 0 1 
      then that  1 0 0 0 
      there were  0 0 1 0 
      were still  0 0 1 0 
      wonderful then 1 0 0 0 

回答

0

一直以为是更为复杂,因为它是一个DTM。

问题解决了:

d_sel <- dtm[c('no wonderful', 'there were'),] 
    inspect(d_sel) 

       Docs 
       Terms   10 11 12 13 
       no wonderful 1 0 0 0 
       there were  0 0 1 0 
相关问题