2
如何删除主题建模(LDA)不必要的信息如何删除主题建模(LDA)不必要的信息
你好,我想创建主题建模。 我的数据是这种结构。
1. Doesn't taste good to me.
2. Most delicious ramen I have ever had. Spicy and tasty. Great price too.
3. I have this on my subscription, my family loves this version. The taste is great by itself or when we add the vegetables and.or meats.
4. The noodle is ok, but I had better ones.
5. some day's this is lunch and or dinner on second case
6. Really good ramen!
我清理了评论并转为主题建模。但是你可以看到“”,“26.6564810276031”,“字符(0)”。
[,1] [,2] [,3] [,4]
[1,] "cabbag" ")." "=" "side"
[2,] "gonna" "26.6564810276031," "" "day,"
[3,] "broth" "figur" "character(0)," "ok."
本来,你看不到这些东西,如果你只有单词的频率,但是当你运行主题建模可以看到这些话。
我怎么了? 我该如何解决它?
library(tm)
library(XML)
library(SnowballC)
crudeCorp<-VCorpus(VectorSource(readLines(file.choose())))
crudeCorp <- tm_map(crudeCorp, stripWhitespace)
crudeCorp<-tm_map(crudeCorp, content_transformer(tolower))
# remove stopwords from corpus
crudeCorp<-tm_map(crudeCorp, removeWords, stopwords("english"))
myStopwords <- c(stopwords("english"),"noth","two","first","lot", "because", "can", "will","go","also","get","since","way","even","just","now","will","give","gave","got","one","make","even","much","come","take","without","goes","along","alot","alone")
myStopwords <- setdiff(myStopwords, c("will","can"))
crudeCorp <- tm_map(crudeCorp, removeWords, myStopwords)
crudeCorp<-tm_map(crudeCorp,removeNumbers)
crudeCorp <- tm_map(crudeCorp, content_transformer(function(x)
gsub(x, pattern = "bought", replacement = "buy")))
crudeCorp <- tm_map(crudeCorp, content_transformer(function(x)
gsub(x, pattern = "broke", replacement = "break")))
crudeCorp <- tm_map(crudeCorp, content_transformer(function(x)
gsub(x, pattern = "products", replacement = "product")))
crudeCorp <- tm_map(crudeCorp, content_transformer(function(x)
gsub(x, pattern = "made", replacement = "make")))
crudeCorp <- tm_map(crudeCorp, stemDocument)
library(reshape)
library(ScottKnott)
library(lda)
### Faster Way of doing LDA
corpusLDA <- lexicalize(crudeCorp)
## K: Number of factors, ,vocab=corpusLDA$vocab (Word contents)
ldaModel=lda.collapsed.gibbs.sampler(corpusLDA$documents,K=7,
vocab=corpusLDA$vocab,burnin=9999,num.iterations=1000,alpha=1,eta=0.1)
top.words <- top.topic.words(ldaModel$topics, 10, by.score=TRUE)
print(top.words)
我是一个初学者,我没有完全理解你的答案。 在制作小伙子模型之前,我已经删除了缩写和数字。你做完后你需要做吗?我想提供更多细节。 – yome