是否可以提供RTextTools包的自定义停用词表？

与TM包，我能够做到这一点是这样的：是否可以提供RTextTools包的自定义停用词表？

c0 <- Corpus(VectorSource(text)) 
c0 <- tm_map(c0, removeWords, c(stopwords("english"),mystopwords))

mystopwords是附加的禁用词我想删除的载体。

但我找不到使用RTextTools软件包的等效方法。例如：

dtm <- create_matrix(text,language="english", 
      removePunctuation=T, 
      stripWhitespace=T, 
      toLower=T, 
      removeStopwords=T, #no clear way to specify a custom list here! 
      stemWords=T)

是否可以这样做？我真的很喜欢RTextTools界面，并且很遗憾必须回到tm。

来源

2013-10-08 user2175594

有三个（或可能更多）的解决方案，您的问题：

首先，使用tm包只去除的话。这两个软件包都处理相同的对象，因此您可以使用tm仅用于删除单词，而不是RTextTools软件包。即使您在功能create_matrix中查看，它也使用tm函数。

二，修改create_matrix函数。例如添加一个输入参数一样own_stopwords=NULL，并添加以下行：

# existing line 
corpus <- Corpus(VectorSource(trainingColumn), 
        readerControl = list(language = language)) 
# after that add this new line 
if(!is.null(own_stopwords)) corpus <- tm_map(corpus, removeWords, 
              words=as.character(own_stopwords))

三，写自己的功能，这样的事情：

# excluder function 
remove_my_stopwords<-function(own_stw, dtm){ 
    ind<-sapply(own_stw, function(x, words){ 
    if(any(x==words)) return(which(x==words)) else return(NA) 
    }, words=colnames(dtm)) 
    return(dtm[ ,-c(na.omit(ind))]) 
}

let's看看它是否工作：

# let´s test it 
data(NYTimes) 
data <- NYTimes[sample(1:3100, size=10,replace=FALSE),] 
matrix <- create_matrix(cbind(data["Title"], data["Subject"])) 

head(colnames(matrix), 5) 
# [1] "109"   "200th"  "abc"   "amid"  "anniversary" 


# let´s consider some "own" stopwords as words above 
ostw <- head(colnames(matrix), 5) 

matrix2<-remove_my_stopwords(own_stw=ostw, dtm=matrix) 

# check if they are still there 
sapply(ostw, function(x, words) any(x==words), words=colnames(matrix2)) 
#109  200th   abc  amid anniversary 
#FALSE  FALSE  FALSE  FALSE  FALSE

HTH

来源

2013-10-08 11:31:21 holzben

谢谢！这工作完美。虽然，由于RTextTools包丢失了一些功能（或者失去了一个简单的实现），你会推荐使用它吗？（坚持tm包） – user2175594

我认为这取决于你的矩阵和停止词向量。一般来说，我会做解决方案三，但如果矩阵和停止词矢量太大，你可能会遇到内存问题。比我会做解决方案2，添加线，命名它。 'create_matrix2'，但是它放在一个文件上并将其来源。你可以像使用旧功能一样使用'create_matrix2'，但使用新功能。 – holzben

您可以在同一个列表中添加你的止损的话。例如：

c0 <- tm_map(c0, removeWords, c(stopwords("english"),"mystopwords"))

来源

2016-08-18 20:32:08 Apu

是否可以提供RTextTools包的自定义停用词表？

回答

相关问题