只保留文档术语矩阵中的特定格式R

问题：如何才能保持bigram“没有奇妙”仅在文档术语矩阵或我想保留的bigrams（Terms）列表中？只保留文档术语矩阵中的特定格式R

我想将其应用于非常大的文档术语矩阵。我尝试将术语矩阵转换为矩阵，但矢量大小超过1000 Gb。

代码：

dd <- data.frame(
id = 10:13, 
text = c("No wonderful, then, that ever", 
     "So that in many cases such a ", 
     "But there were still other and", 
     "Not even at the rationale"), stringsAsFactors = F) 

library(tm) 
library(RWeka) 

myReader <- readTabular(mapping = list(content = "text", id = "id")) 

#create v corpus 
tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader)) 

#n-gram tokenizer 
Tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) 

#create document term matrix using Tokenizer 
     dtm <- TermDocumentMatrix(tm, control = list(tokenize = Tokenizer)) 
     inspect(dtm)

输出：

       Docs 
      Terms   10 11 12 13 
      at the   0 0 0 1 
      but there  0 0 1 0 
      cases such  0 1 0 0 
      even at   0 0 0 1 
      in many   0 1 0 0 
      many cases  0 1 0 0 
      no wonderful 1 0 0 0 
      not even  0 0 0 1 
      other and  0 0 1 0 
      so that   0 1 0 0 
      still other  0 0 1 0 
      such a   0 1 0 0 
      that ever  1 0 0 0 
      that in   0 1 0 0 
      the rationale 0 0 0 1 
      then that  1 0 0 0 
      there were  0 0 1 0 
      were still  0 0 1 0 
      wonderful then 1 0 0 0

来源

2017-02-09 BEMR

一直以为是更为复杂，因为它是一个DTM。

问题解决了：

d_sel <- dtm[c('no wonderful', 'there were'),] 
    inspect(d_sel) 

       Docs 
       Terms   10 11 12 13 
       no wonderful 1 0 0 0 
       there were  0 0 1 0

来源

2017-02-09 18:19:15 BEMR

只保留文档术语矩阵中的特定格式R

回答

相关问题