0
也许我误解了tm::DocumentTermMatrix
的工作原理。我有一个语料库其预处理后看起来是这样的:TM DocumentTermMatrix给出了令人意想不到的结果给出了语料库
head(Description.text, 3)
[1] "azi sanitar local to1 presid osp martin presid ospedalier martin tofan torin tel possibil raggiung ospedal segu bus tram"
[2] "torin croll controsoffitt repart pediatr martin mag cartell compars sest pian ospedal martin torin ospedal tofan sol due anno riapertur"
[3] "ospedal martin croll controsoffitt repart pediatr mag ospedal martin croll controsoffitt repart pediatr distacc intonac avven nott mattin"
我通过过程:
Description.text.features <- DocumentTermMatrix(Corpus(VectorSource(Description.text)), list(
bounds = list(local = c(3, Inf)),
tokenize = 'scan'
))
当我检查DTM的第一行,我得到这样的:
inspect(Description.text.features[1,])
<<DocumentTermMatrix (documents: 1, terms: 887)>>
Non-/sparse entries: 0/887
Sparsity : 100%
Maximal term length: 15
Weighting : term frequency (tf)
Sample :
Terms
Docs banc camill mar martin ospedal presid san sanitar torin vittor
1 0 0 0 0 0 0 0 0 0 0
这些术语不对应于语料库Description.text
中的第一个文档(例如,banc
或camill
不在第一个文档中,例如martin
或presid
哪个)。
而且如果我运行:
Description.text.features[1,] %>% as.matrix() %>% sum
我得到零,表明该头文件中有与频率>零没有条件!
这是怎么回事?
感谢
UPDATE
我创建了自己的“语料库DTM”功能,实际上它提供了非常不同的结果。除了文档术语的权重与tm::DocumentTermMatrix
(我的预期是给定语料库)的权重非常不同之外,我的函数比tm函数(〜3000与800的tm)要多得多。
这里是我的功能:
corpus.to.DTM <- function(corpus, min.doc.freq = 3, minlength = 3, weight.fun = weightTfIdf) {
library(dplyr)
library(magrittr)
library(tm)
library(parallel)
lvls <- mclapply(corpus, function(doc) words(doc) %>% unique, mc.cores = 8) %>%
unlist %>%
table %>%
data.frame %>%
set_colnames(c('term', 'freq')) %>%
mutate(lengths = str_length(term)) %>%
filter(freq >= min.doc.freq & lengths >= minlength) %>%
use_series(term)
dtm <- mclapply(corpus, function(doc) factor(words(doc), levels = lvls) %>% table %>% as.vector, mc.cores = 8) %>%
do.call(what = 'rbind') %>%
set_colnames(lvls)
as.DocumentTermMatrix(dtm, weighting = weightTfIdf) %>%
as.matrix() %>%
as.data.frame()
}
谢谢你的建议!我会看看这个软件包!但我的问题特别是关于tm出了什么问题! – Bakaburg