使用text2vec包进行文本预处理和主题建模

我有大量文档，并且想使用text2vec和LDA（Gibbs Sampling）进行主题建模。使用text2vec包进行文本预处理和主题建模

步骤我需要的是为（按顺序）：

从文本中删除数字和符号

library(stringr) 
docs$text <- stringr::str_replace_all(docs$text,"[^[:alpha:]]", " ") 
docs$text <- stringr::str_replace_all(docs$text,"\\s+", " ")

移除停止字

library(text2vec)   
library(tm) 

stopwords <- c(tm::stopwords("english"),custom_stopwords) 

prep_fun <- tolower 
tok_fun <- word_tokenizer 
tok_fun <- word_tokenizer  
tokens <- docs$text%>% 
     prep_fun %>% 
     tok_fun 
it <- itoken(tokens, 
      ids = docs$id, 
      progressbar = FALSE) 

v <- create_vocabulary(it, stopwords = stopwords) %>% 
    prune_vocabulary(term_count_min = 10) 

vectorizer <- vocab_vectorizer(v)

通过替换同义词条款

我有一个excel文件，其中第一列是主词，同义词列在第二，第三和...列中。我想用主词（第1列）替换所有的同义词。每个术语可以有不同数量的同义词。下面是使用“TM”包的代码的一个例子（但我对到所述一个中text2vec包）：

replaceSynonyms <- content_transformer(function(x, syn=NULL) 
     {Reduce(function(a,b) { 
     gsub(paste0("\\b(", paste(b$syns, collapse="|"),")\\b"), b$word,  a, perl = TRUE)}, syn, x) }) 

l <- lapply(as.data.frame(t(Synonyms), stringsAsFactors = FALSE), # 
      function(x) { 
      x <- unname(x) 
      list(word = x[1], syns = x[-1]) 
      }) 
names(l) <- paste0("list", Synonyms[, 1]) 
list2env(l, envir = .GlobalEnv) 

synonyms <- list()   
for (i in 1:length(names(l))) synonyms[i] = l[i] 

MyCorpus <- tm_map(MyCorpus, replaceSynonyms, synonyms)

转换为文档词矩阵

dtm <- create_dtm(it, vectorizer)

应用LDA模型上的文档词矩阵

doc_topic_prior <- 0.1 # can be chosen based on data? 
lda_model <- LDA$new(n_topics = 10, 
      doc_topic_prior = doc_topic_prior, topic_word_prior = 0.01) 
doc_topic_distr <- lda_model$fit_transform(dtm, n_iter = 1000, convergence_tol <- 0.01, check_convergence_every_n = 10)

步骤3中的MyCorpurs是使用“tm”包获得的语料库。步骤2和步骤3不一起工作，因为步骤2的输出是词汇表，但步骤3的输入是“tm”语料库。

我的第一个问题是，我怎么能使用text2vec包（和兼容包）来做所有的步骤，因为我发现它非常高效;感谢Dmitriy Selivanov。

第二：如何在步骤5中设置LDA中参数的最优值？是否可以根据数据自动设置它们？

感谢曼努埃尔比克尔在我的文章更正。

感谢，山姆

来源

2017-10-20 Sam S

响应您的评论更新的答案：

第一个问题：Replace words in text2vec efficiently：同义词替换的问题已经在这里找到答案。请检查count的答案。模式和替换可能是ngram（多个单词短语）。请注意，Dmitriy Selivanov的第二个答案使用word_tokenizer()，并不涵盖所呈现形式的ngram替换情况。

是否有任何理由需要在停用词清除之前替换同义词？通常这个顺序不应该引起问题;或者您是否有一个切换顺序产生重大不同结果的示例？如果你真的想在停用词删除后替换同义词，我想，当使用text2vec时，你将不得不对dtm应用这样的更改。如果你这样做了，你需要允许你的dtm中的ngram包含在你的同义词中，并且最小的ngram长度。作为一种选择，我在下面的代码中提供了一种解决方法。请注意，在dtm中允许更高的ngram会产生噪声，这可能会影响或不会影响您的下游任务（您可能会删除词汇步骤中的大部分噪声）。因此，以前替换ngram似乎是更好的解决方案。

第二个问题：你可能会检查textmineR包，可帮助您选择主题的最佳数量或也回答了这个问题Topic models: cross validation with loglikelihood or perplexity的包（和源代码）。关于处理先验问题，我还没有弄清楚，如何处理这些包（例如text2vec（WarpLDA算法），lda（Collabed Gibbs Sampling算法等）或topicmodels（'标准'吉布斯采样和变分期望最大化算法）值详细。作为一个起点，您可以查看topicmodels的详细文档，第2.2章“估计”告诉您如何估计在“2.1模型规范”中定义的alpha和beta参数。

对于学习的目的，请注意，你的代码中产生的误差在两点，我已经修订：（1）你需要使用的正确名称为create_vocabulary()停用词，禁用词代替STOP_WORDS，因为你定义名称为（2）您的lda模型定义中不需要vocabulary =... - 也许您使用旧版本的text2vec？

library(text2vec) 
library(reshape2) 
library(stringi) 

#function proposed by @count 
mgsub <- function(pattern,replacement,x) { 
    if (length(pattern) != length(replacement)){ 
    stop("Pattern not equal to Replacment") 
    } 
    for (v in 1:length(pattern)) { 
    x <- gsub(pattern[v],replacement[v],x, perl = TRUE) 
    } 
    return(x) 
} 

docs <- c("the coffee is warm", 
      "the coffee is cold", 
      "the coffee is hot", 
      "the coffee is boiling like lava", 
      "the coffee is frozen", 
      "the coffee is perfect", 
      "the coffee is warm almost hot" 
) 

synonyms <- data.frame(mainword = c("warm", "cold") 
         ,syn1 = c("hot", "frozen") 
         ,syn2 = c("boiling like lava", "") 
         ,stringsAsFactors = FALSE) 

synonyms[synonyms == ""] <- NA 

synonyms <- reshape2::melt(synonyms 
          ,id.vars = "mainword" 
          ,value.name = "synonym" 
          ,na.rm = TRUE) 

synonyms <- synonyms[, c("mainword", "synonym")] 


prep_fun <- tolower 
tok_fun <- word_tokenizer 
tokens <- docs %>% 
    #here is where you might replace synonyms directly in the docs 
    #{ mgsub(synonyms[,"synonym"], synonyms[,"mainword"], .) } %>% 
    prep_fun %>% 
    tok_fun 
it <- itoken(tokens, 
      progressbar = FALSE) 

v <- create_vocabulary(it, 
         sep_ngram = "_", 
         ngram = c(ngram_min = 1L 
           #allow for ngrams in dtm 
           ,ngram_max = max(stri_count_fixed(unlist(synonyms), " ")) 
           ) 
) 

vectorizer <- vocab_vectorizer(v) 
dtm <- create_dtm(it, vectorizer) 

#ngrams in dtm 
colnames(dtm) 

#ensure that ngrams in synonym replacement table have the same format as ngrams in dtm 
synonyms <- apply(synonyms, 2, function(x) gsub(" ", "_", x)) 

colnames(dtm) <- mgsub(synonyms[,"synonym"], synonyms[,"mainword"], colnames(dtm)) 


#only zeros/ones in dtm since none of the docs specified in my example 
#contains duplicate terms 
dim(dtm) 
#7 24 
max(dtm) 
#1 

#workaround to aggregate colnames in dtm 
#I think there is no function `colsum` that allows grouping 
#therefore, a workaround based on rowsum 
#not elegant because you have to transpose two times, 
#convert to matrix and reconvert to sparse matrix 
dtm <- 
    Matrix::Matrix(
    t(
     rowsum(t(as.matrix(dtm)), group = colnames(dtm)) 
    ) 
    , sparse = T) 


#synonyms in columns replaced 
dim(dtm) 
#7 20 
max(dtm) 
#2

来源

2017-10-20 12:51:54

非常感谢您的回答。其实我有大量的拼写错误和缩写的数据，也是同一个词的不同缩写。主词只是一个词，但同义词可以是诸如“热水”之类的词组。我需要先删除停用词（我的问题中的第2步），然后用主词替换多个同义词。我如何按顺序完成这两个步骤，即先删除停用词，然后替换同义词？我做了所有使用“tm”和“topicmodels”包的工作，但它们非常慢，我想切换到text2vec。 –

我意识到你的问题的一部分已经在其他地方得到了回答。我已经相应地更新了我的答案，并包含了该答案的链接。 –

感谢Manuel的更新。在ngram之前删除一些停用词让我更容易关注重要的ngrams /短语。例如，“返回工作”，“返回工作”，“返回工作”全部被替换为返工。我有很多这种类型的短语。 –

使用text2vec包进行文本预处理和主题建模

回答

相关问题