2016-08-02 61 views
0

我是tm程序包的新手,尝试应用TermDocumentMatrix函数时遇到了障碍。在tm包中创建TermDocumentMatrix时出错

我用下面的代码,直到函数调用失败:

myCorpus <- Corpus(VectorSource(posts$message)) 
myCorpus <- tm_map(myCorpus, content_transformer(tolower)) 
myCorpus <- tm_map(myCorpus, removePunctuation) 
myCorpus <- tm_map(myCorpus, removeNumbers) 

removeURL <- function(x) gsub("http[[:alnum:]]*", "", x) 

myCorpus <- tm_map(myCorpus, removeURL) 

myStopwords <- c(stopwords("english")) 
myCorpus <- tm_map(myCorpus, removeWords, myStopwords) 

myCorpusCopy <- myCorpus 
myCorpus <- tm_map(myCorpus, stemDocument) 

经检验它好像文档列表是它应该是什么:

> for(i in 1:5) { 
+ cat(paste("[[", i, "]] ", sep ="")) 
+ writeLines(myCorpus[[i]]) 
+ } 
[[1]] syntel recruitment drive week freshers newregistrationlink passout graduates 
qualification graduatebebtechmcamemtech 
syntel registration link 
limited referrals available 
comment emailids reference future job upd 
[[2]] dont miss opportunity get placed one best mnc companies world ebay freshers week january 
qualification graduate can apply 
ebay registration link 
comment emailids fast beacuse referrals left 
[[3]] recent passouts  eligible apply wipro go updated link lastday reference drive jan apply link fresher referral 
apply link 
go link apply asap 
[[4]] robertbosch recruitment drive week freshers newregistrationlink passout graduates 
qualification graduatebebtechmcamemtech 
robertbosch registration link 
limited referrals available 
comment emailids reference future job upd 
[[5]] mega job openings year 
mphasis recruitment freshers january 
qualification btech bsc bca graduates mca mba mtech post graduates 
mphasis registration link 
comment emailids comment box reference future job updates emailbox  

然而,在创建之后一个完整的语料库副本,问题就出现了。

myCorpus <- tm_map(myCorpus, stemCompletion, 
        dictionary = myCorpusCopy, lazy = TRUE) 
> tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(1, Inf))) 
Error in UseMethod("meta", x) : 
    no applicable method for 'meta' applied to an object of class "try-error" 
In addition: Warning messages: 
1: In mclapply(x$content[i], function(d) tm_reduce(d, x$lazy$maps)) : 
    all scheduled cores encountered errors in user code 
2: In mclapply(unname(content(x)), termFreq, control) : 
    all scheduled cores encountered errors in user code 

解决方法的任何想法?

回答

1

我认为你必须使用TermDocumentMatrix之前召回

myCorpus <- Corpus(VectorSource(myCorpus)) 

,你的代码的最后一块将是:

myCorpus <- tm_map(myCorpus, stemCompletion, dictionary = myCorpusCopy) 
myCorpus <- Corpus(VectorSource(myCorpus)) 
tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(1, Inf))) 

如果直到文档的词干没有出现错误,之前的说明将解决您的问题。

0

否则,你可以尝试先:

myCorpus <- tm_map(myCorpus, PlainTextDocument) 

您使用之前

myCorpus <- Corpus(VectorSource(myCorpus))