2014-11-03 56 views
3

我想使用tm软件包的findAssocs命令,但它仅在文集中有多个文档时有效。相反,我有一列数据框,其中每行包含来自Tweet的文本。是否有可能将其转换为一个将每行作为新文档的语料库?如何将单个列的R数据框转换为tm的语料库,以便将每行作为文档?

VCorpus (documents: 1, metadata (corpus/indexed): 0/0) 
TermDocumentMatrix (terms: 71, documents: 1) 

我有10行数据的Iwish它转换为

VCorpus (documents: 10, metadata (corpus/indexed): 0/0) 
TermDocumentMatrix (terms: 71, documents: 10) 

回答

4

我建议你继续之前先阅读tm -vignette。回答你下面的具体问题。

创建示例数据:

txt <- strsplit("I wanted to use the findAssocs of the tm package. but it works only when there are more than one documents in the corpus. I have a data frame table which has one column and each row has a tweet text. Is it possible to convert the into a corpus which takes each row as a new document?", split=" ")[[1]] 
data <- data.frame(text=txt, stringsAsFactors=FALSE) 
data[1:5, ] 

导入你的数据变成了“源”,你的“来源”为“语料库”,然后做一个TDM出你的“语料库”的:

library(tm) 
tdm <- TermDocumentMatrix(Corpus(DataframeSource(data))) 

show(tdm) 
#A term-document matrix (35 terms, 58 documents) 
# 
#Non-/sparse entries: 43/1987 
#Sparsity   : 98% 
#Maximal term length: 10 
#Weighting   : term frequency (tf) 

str(tdm) 
#List of 6 
# $ i  : int [1:43] 32 31 28 12 28 21 3 35 20 33 ... 
# $ j  : int [1:43] 2 4 5 6 8 10 11 13 14 15 ... 
# $ v  : num [1:43] 1 1 1 1 1 1 1 1 1 1 ... 
# $ nrow : int 35 
# $ ncol : int 58 
# $ dimnames:List of 2 
# ..$ Terms: chr [1:35] "and" "are" "but" "column" ... 
# ..$ Docs : chr [1:58] "1" "2" "3" "4" ... 
# - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix" 
# - attr(*, "Weighting")= chr [1:2] "term frequency" "tf"