2017-07-28 135 views
0

也许我误解了tm::DocumentTermMatrix的工作原理。我有一个语料库其预处理后看起来是这样的:TM DocumentTermMatrix给出了令人意想不到的结果给出了语料库

head(Description.text, 3) 
[1] "azi sanitar local to1 presid osp martin presid ospedalier martin tofan torin tel possibil raggiung ospedal segu bus tram"      
[2] "torin croll controsoffitt repart pediatr martin mag cartell compars sest pian ospedal martin torin ospedal tofan sol due anno riapertur"  
[3] "ospedal martin croll controsoffitt repart pediatr mag ospedal martin croll controsoffitt repart pediatr distacc intonac avven nott mattin" 

我通过过程:

Description.text.features <- DocumentTermMatrix(Corpus(VectorSource(Description.text)), list(
    bounds = list(local = c(3, Inf)), 
    tokenize = 'scan' 
)) 

当我检查DTM的第一行,我得到这样的:

inspect(Description.text.features[1,]) 
<<DocumentTermMatrix (documents: 1, terms: 887)>> 
Non-/sparse entries: 0/887 
Sparsity   : 100% 
Maximal term length: 15 
Weighting   : term frequency (tf) 
Sample    : 
    Terms 
Docs banc camill mar martin ospedal presid san sanitar torin vittor 
    1 0  0 0  0  0  0 0  0  0  0 

这些术语不对应于语料库Description.text中的第一个文档(例如,banccamill不在第一个文档中,例如martinpresid哪个)。

而且如果我运行:

Description.text.features[1,] %>% as.matrix() %>% sum 

我得到零,表明该头文件中有与频率>零没有条件!

这是怎么回事?

感谢

UPDATE

我创建了自己的“语料库DTM”功能,实际上它提供了非常不同的结果。除了文档术语的权重与tm::DocumentTermMatrix(我的预期是给定语料库)的权重非常不同之外,我的函数比tm函数(〜3000与800的tm)要多得多。

这里是我的功能:

corpus.to.DTM <- function(corpus, min.doc.freq = 3, minlength = 3, weight.fun = weightTfIdf) { 
    library(dplyr) 
    library(magrittr) 
    library(tm) 
    library(parallel) 

    lvls <- mclapply(corpus, function(doc) words(doc) %>% unique, mc.cores = 8) %>% 
     unlist %>% 
     table %>% 
     data.frame %>% 
     set_colnames(c('term', 'freq')) %>% 
     mutate(lengths = str_length(term)) %>% 
     filter(freq >= min.doc.freq & lengths >= minlength) %>% 
     use_series(term) 

    dtm <- mclapply(corpus, function(doc) factor(words(doc), levels = lvls) %>% table %>% as.vector, mc.cores = 8) %>% 
     do.call(what = 'rbind') %>% 
     set_colnames(lvls) 

    as.DocumentTermMatrix(dtm, weighting = weightTfIdf) %>% 
     as.matrix() %>% 
     as.data.frame() 
} 

回答

1

下面是一个使用TM替代解决办法,quanteda。你甚至可以找到后者的相对简单性,加上其速度和特性,足以将其用于其余的分析!

description.text <- 
    c("azi sanitar local to1 presid osp martin presid ospedalier martin tofan torin tel possibil raggiung ospedal segu bus tram", 
    "torin croll controsoffitt repart pediatr martin mag cartell compars sest pian ospedal martin torin ospedal tofan sol due anno riapertur", 
    "ospedal martin croll controsoffitt repart pediatr mag ospedal martin croll controsoffitt repart pediatr distacc intonac avven nott mattin") 

require(quanteda) 
require(magrittr) 

qdfm <- dfm(description.text) 
head(qdfm, nfeat = 10) 
# Document-feature matrix of: 3 documents, 35 features (56.2% sparse). 
# (showing first 3 documents and first 10 features) 
#  features 
# docs azi sanitar local to1 presid osp martin ospedalier tofan torin 
# text1 1  1  1 1  2 1  2   1  1  1 
# text2 0  0  0 0  0 0  2   0  1  2 
# text3 0  0  0 0  0 0  2   0  0  0 

qdfm2 <- qdfm %>% dfm_trim(min_count = 3, min_docfreq = 3) 
qdfm2 
# Document-feature matrix of: 3 documents, 2 features (0% sparse). 
# (showing first 3 documents and first 2 features) 
#  features 
# docs martin ospedal 
# text1  2  1 
# text2  2  2 
# text3  2  2 

转换回TM

convert(qdfm2, to = "tm") 
# <<DocumentTermMatrix (documents: 3, terms: 2)>> 
# Non-/sparse entries: 6/0 
# Sparsity   : 0% 
# Maximal term length: 7 
# Weighting   : term frequency (tf) 

在您的例子中,你使用的TF-IDF权重。这也很容易在量子

dfm_weight(qdfm, "tfidf") %>% head 
# Document-feature matrix of: 3 documents, 35 features (56.2% sparse). 
# (showing first 3 documents and first 6 features) 
#   features 
# docs   azi sanitar  local  to1 presid  osp 
# text1 0.4771213 0.4771213 0.4771213 0.4771213 0.9542425 0.4771213 
# text2 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 
# text3 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 
+0

谢谢你的建议!我会看看这个软件包!但我的问题特别是关于tm出了什么问题! – Bakaburg

相关问题