2017-04-25 36 views
1

我有一个tm文档语料库和一个单词列表。我想在语料库上运行一个for循环,以便循环顺序地从语料库中删除列表中的每个单词。在没有丢失语料库结构的情况下循环通过tm语料库

某些复制数据:现在

library(tm) 
m <- cbind(c("Apple blue two","Pear yellow five","Banana yellow two"), 
      c(1, 2, 3)) 
tm_corpus <- Corpus(VectorSource(m[,1])) 
words <- as.list(c("Apple", "yellow", "two")) 

tm_corpus是由3个文件的语料库对象:

<<SimpleCorpus>> 
Metadata: corpus specific: 1, document level (indexed): 0 
Content: documents: 3 

words是3个字的清单:

[[1]] 
[1] "Apple" 

[[2]] 
[1] "yellow" 

[[3]] 
[1] "two" 

我有试了三个不同的循环。第一个是:

tm_corpusClean <- tm_corpus 
for (i in seq_along(tm_corpusClean)) { 
    for (u in seq_along(words)) { 
    tm_corpusClean[i] <- tm_map(tm_corpusClean[i], removeWords, words[[u]]) 
    } 
} 

哪个返回以下错误7次(编号为1-7):

Error in x$dmeta[i, , drop = FALSE] : incorrect number of dimensions 
In addition: Warning messages: 
1: In tm_corpusClean[i] <- tm_map(tm_corpusClean[i], removeWords,     
words[[u]]) : 
    number of items to replace is not a multiple of replacement length 
2: In tm_corpusClean[i] <- tm_map(tm_corpusClean[i], removeWords,   
words[[u]]) : 
    number of items to replace is not a multiple of replacement length 
[...] 

第二个是:

tm_corpusClean <- tm_corpus 
for (i in seq_along(words)) { 
    for (u in seq_along(tm_corpusClean)) { 
    tm_corpusClean[u] <- tm_map(tm_corpusClean[u], removeWords, words[[i]]) 
    } 
} 

返回错误:

Error in x$dmeta[i, , drop = FALSE] : incorrect number of dimensions 

最后一个循环是:

tm_corpusClean <- tm_corpus 
for (i in seq_along(words)) { 
    tm_corpusClean <- tm_map(tm_corpusClean, removeWords, words[[i]]) 
} 

这实际上返回名为tm_corpusClean一个对象,但这个对象只返回第一个文件,而不是所有的原始三个:

inspect(tm_corpusClean[[1]]) 

<<PlainTextDocument>> 
Metadata: 7 
Content: chars: 6 

blue 

我要去哪里错了?

回答

0

之前我们去的顺序去除,在你的例子,如果测试tm_map工作:

obj1 <- tm_map(tm_corpus, removeWords, unlist(words)) 
sapply(obj1, `[`, "content") 

$`1.content` 
[1] " blue " 

$`2.content` 
[1] "Pear five" 

$`3.content` 
[1] "Banana " 

接下来,使用lapply顺序一次删除一个字,即"Apple", "yellow", "two"

obj2 <- lapply(words, function(word) tm_map(tm_corpus, removeWords, word)) 
sapply(obj2, function(x) sapply(x, `[`, "content")) 

      [,1]    [,2]    [,3]    
1.content " blue two"   "Apple blue two" "Apple blue "  
2.content "Pear yellow five" "Pear five"  "Pear yellow five" 
3.content "Banana yellow two" "Banana two" "Banana yellow " 

请注意,生成的语料库位于嵌套列表中(两个Sapply用于查看内容的原因)。

+0

嗨,亚当,谢谢你的回答。你的代码的工作,但给我NA的,而不是输出你目前的位置: 'OBJ1 < - tm_map(tm_corpus,removeWords,不公开(字)) sapply(OBJ1,'[', “内容”)' ' [1]不适用不适用 obj2 < - lapply(单词,函数(单词)tm_map(tm_corpus, removeWords,单词)) sapply(obj2,function(x)sapply(x,''''“content”) ) [1] [2] [3] [1,] NA NA NA [2,1] NA NA NA [3,] NA NA NA' 对不起,无法找出如何添加换行符。 – Rnout

+0

对于'obj1 < - tm_map(tm_corpus,removeWords,unlist(words))',如果你要检查'obj1 [[1]] $ content',你得到了什么? –

+0

'obj1 [[1]] $ content'确实返回'[1]“blue”',所以NA只在运行'sapply(obj1,''''content“)'后出现,给出了[[1] NA NA NA'。但它似乎对语料库本身起作用。 :) – Rnout