使用R TM包搜索2和3单词短语

我试图找到一个代码，它实际上可以在R文本挖掘包中找到最常用的两个和三个单词短语（也许还有另外一个包，不知道）。我一直在尝试使用标记器，但似乎没有运气。使用R TM包搜索2和3单词短语

如果您过去曾经遇到过类似的情况，您可以发布经过测试且实际可行的代码吗？非常感谢！

2012-01-17 appletree

Ordered phrases，that is？或共同发生？ – 2012-01-17 18:34:02

两者都有用。谢谢！ – appletree 2012-01-17 20:06:01

这是为了不同的目的我自己由创作，但我认为可能适用于您的需求太：

#User Defined Functions 
Trim <- function (x) gsub("^\\s+|\\s+$", "", x) 

breaker <- function(x) unlist(strsplit(x, "[[:space:]]|(?=[.!?*-])", perl=TRUE)) 

strip <- function(x, digit.remove = TRUE, apostrophe.remove = FALSE){ 
    strp <- function(x, digit.remove, apostrophe.remove){ 
     x2 <- Trim(tolower(gsub(".*?($|'|[^[:punct:]]).*?", "\\1", as.character(x)))) 
     x2 <- if(apostrophe.remove) gsub("'", "", x2) else x2 
     ifelse(digit.remove==TRUE, gsub("[[:digit:]]", "", x2), x2) 
    } 
unlist(lapply(x, function(x) Trim(strp(x =x, digit.remove = digit.remove, 
    apostrophe.remove = apostrophe.remove)))) 
} 

unblanker <- function(x)subset(x, nchar(x)>0) 

#Fake Text Data 
x <- "I like green eggs and ham. They are delicious. They taste so yummy. I'm talking about ham and eggs of course" 

#The code using Base R to Do what you want 
breaker(x) 
strip(x) 
words <- unblanker(breaker(strip(x))) 
textDF <- as.data.frame(table(words)) 
textDF$characters <- sapply(as.character(textDF$words), nchar) 
textDF2 <- textDF[order(-textDF$characters, textDF$Freq), ] 
rownames(textDF2) <- 1:nrow(textDF2) 
textDF2 
subset(textDF2, characters%in%2:3)

来源

2012-01-17 22:37:04

嗨，@ Tyler-Rinker，我知道现在已经有几年了，但是当测试你的代码时，我得到这个错误：'在FUN中出错（c（“”，“”，“”，“”，“”， “”，“”，“”，“”，“”，“”，“”“”，“”，：找不到“修剪”功能 – jessi 2015-02-24 18:32:25

添加了'修剪'，如果有帮助 – 2015-02-24 18:46:14

哈哈。，@Tyler_Rinker。我有一个叫做'trim'的函数，但是我没有意识到它就是它所寻找的。谢谢！ – jessi 2015-02-26 15:40:21

您可以在自定义的符号化功能tm的DocumentTermMatrix功能传球，所以如果你有包tau安装它非常简单。

library(tm); library(tau); 

tokenize_ngrams <- function(x, n=3) return(rownames(as.data.frame(unclass(textcnt(x,method="string",n=n))))) 

texts <- c("This is the first document.", "This is the second file.", "This is the third text.") 
corpus <- Corpus(VectorSource(texts)) 
matrix <- DocumentTermMatrix(corpus,control=list(tokenize=tokenize_ngrams))

凡n在tokenize_ngrams函数是每短语的字的数量。此功能也在包RTextTools中实现，这进一步简化了事情。

library(RTextTools) 
texts <- c("This is the first document.", "This is the second file.", "This is the third text.") 
matrix <- create_matrix(texts,ngramLength=3)

这将返回一个类的DocumentTermMatrix用于与包tm使用。

来源

2012-01-18 03:17:48

我意识到这是一个相当陈旧的线程，但最近有人尝试过吗？我的手，第一个方法给出了以下错误：'>矩阵< - DocumentTermMatrix（语料库，控制=列表（标记化=标记大小和格式）） Simple_triplet_matrix（i = i，j = j，v = as.numeric（v），nrow = length（allTerms），： 'i，j，v'不同长度另外：警告信息：）在simple_triplet_matrix（i = i，j = j，v = as.numeric（v））中，在用户代码中遇到错误。，nrow = length（allTerms），：通过胁迫引入的NAs。 – NumerousHats 2015-01-23 01:12:00

尝试使用'library（RTextTools）'示例时，出现相同的错误@MAndrecPhD。 – jessi 2015-02-24 18:25:23

我有同样的问题。我看到一些人认为，SnowballC软件包可以解决它，但它不适合我。有什么建议么？ – Marius 2015-07-26 11:43:19

这是tm封装的FAQ第5部分：

5. Can I use bigrams instead of single tokens in a term-document matrix?

Yes. RWeka provides a tokenizer for arbitrary n-grams which can be directly passed on to the term-document matrix constructor. E.g.:

library("RWeka") 
    library("tm") 

    data("crude") 

    BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) 
    tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer)) 

    inspect(tdm[340:345,1:10])

来源

2013-05-30 03:52:46 Ben

这对我来说是个诡计。实际上，当前版本的FAQ有一个不需要RWeka的解决方案：http://tm.r-forge.r-project.org/faq.html#Bigrams – Tripartio 2016-04-30 15:05:58

我通过使用tm和ngram包添加一个类似的问题。调试mclapply后，我看到那里有与下面的错误

input 'x' has nwords=1 and n=2; must have nwords >= n

上的文档问题不到2个字所以我添加了一个过滤器与低字计数删除文件：

myCorpus.3 <- tm_filter(myCorpus.2, function (x) { 
     length(unlist(strsplit(stringr::str_trim(x$content), '[[:blank:]]+'))) > 1 
    })

然后我的记号化功能看起来像：

bigramTokenizer <- function(x) { 
    x <- as.character(x) 

    # Find words 
    one.list <- c() 
    tryCatch({ 
    one.gram <- ngram::ngram(x, n = 1) 
    one.list <- ngram::get.ngrams(one.gram) 
    }, 
    error = function(cond) { warning(cond) }) 

    # Find 2-grams 
    two.list <- c() 
    tryCatch({ 
    two.gram <- ngram::ngram(x, n = 2) 
    two.list <- ngram::get.ngrams(two.gram) 
    }, 
    error = function(cond) { warning(cond) }) 

    res <- unlist(c(one.list, two.list)) 
    res[res != ''] 
}

然后你就可以用测试功能：

dtmTest <- lapply(myCorpus.3, bigramTokenizer)

最后：

dtm <- DocumentTermMatrix(myCorpus.3, control = list(tokenize = bigramTokenizer))

来源

2015-03-16 10:42:25

试试这个代码。

library(tm) 
library(SnowballC) 
library(class) 
library(wordcloud) 

keywords <- read.csv(file.choose(), header = TRUE, na.strings=c("NA","-","?")) 
keywords_doc <- Corpus(VectorSource(keywords$"use your column that you need")) 
keywords_doc <- tm_map(keywords_doc, removeNumbers) 
keywords_doc <- tm_map(keywords_doc, tolower) 
keywords_doc <- tm_map(keywords_doc, stripWhitespace) 
keywords_doc <- tm_map(keywords_doc, removePunctuation) 
keywords_doc <- tm_map(keywords_doc, PlainTextDocument) 
keywords_doc <- tm_map(keywords_doc, stemDocument)

这是双字母组或三克节，你可以使用

BigramTokenizer <- function(x) 
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE) 
# creating of document matrix 
keywords_matrix <- TermDocumentMatrix(keywords_doc, control = list(tokenize = BigramTokenizer)) 

# remove sparse terms 
keywords_naremoval <- removeSparseTerms(keywords_matrix, 0.95) 

# Frequency of the words appearing 
keyword.freq <- rowSums(as.matrix(keywords_naremoval)) 
subsetkeyword.freq <-subset(keyword.freq, keyword.freq >=20) 
frequentKeywordSubsetDF <- data.frame(term = names(subsetkeyword.freq), freq = subsetkeyword.freq) 

# Sorting of the words 
frequentKeywordDF <- data.frame(term = names(keyword.freq), freq = keyword.freq) 
frequentKeywordSubsetDF <- frequentKeywordSubsetDF[with(frequentKeywordSubsetDF, order(-frequentKeywordSubsetDF$freq)), ] 
frequentKeywordDF <- frequentKeywordDF[with(frequentKeywordDF, order(-frequentKeywordDF$freq)), ] 

# Printing of the words 
wordcloud(frequentKeywordDF$term, freq=frequentKeywordDF$freq, random.order = FALSE, rot.per=0.35, scale=c(5,0.5), min.freq = 30, colors = brewer.pal(8,"Dark2"))

希望这有助于。这是您可以使用的完整代码。

来源

2017-06-30 18:54:18

我尝试了所有的解决方案，但没有一个是使用我的数据。我不知道为什么。无论我在ngams函数中留下什么值（2,3,4等），结果总是1gram（即一个单词） – 2017-08-23 17:26:57

尝试tidytext包

library(dplyr) 
library(tidytext) 
library(janeaustenr) 
library(tidyr

）

假设我有一个包含注释列数据框CommentData，我想找到的两个词出现在一起。然后尝试

bigram_filtered <- CommentData %>% 
    unnest_tokens(bigram, Comment, token= "ngrams", n=2) %>% 
    separate(bigram, c("word1","word2"), sep=" ") %>% 
    filter(!word1 %in% stop_words$word, 
     !word2 %in% stop_words$word) %>% 
    count(word1, word2, sort=TRUE)

上面的代码创建令牌，然后删除停用词并不在分析帮助（如在，一，要等），那么你就数这句话发生。然后，您将使用联合功能来合并单个单词并记录它们的发生。

bigrams_united <- bigram_filtered %>% 
    unite(bigram, word1, word2, sep=" ") 
bigrams_united

来源

2017-07-11 13:15:15

的语料库库中有一个名为term_stats功能，你想要做什么：

library(corpus) 
corpus <- gutenberg_corpus(55) # Project Gutenberg #55, _The Wizard of Oz_ 
text_filter(corpus)$drop_punct <- TRUE # ignore punctuation 
term_stats(corpus, ngrams = 2:3) 
## term    count support 
## 1 of the    336  1 
## 2 the scarecrow  208  1 
## 3 to the    185  1 
## 4 and the   166  1 
## 5 said the   152  1 
## 6 in the    147  1 
## 7 the lion   141  1 
## 8 the tin   123  1 
## 9 the tin woodman 114  1 
## 10 tin woodman  114  1 
## 11 i am    84  1 
## 12 it was    69  1 
## 13 in a    64  1 
## 14 the great   63  1 
## 15 the wicked   61  1 
## 16 wicked witch  60  1 
## 17 at the    59  1 
## 18 the little   59  1 
## 19 the wicked witch 58  1 
## 20 back to    57  1 
## ⋮ (52511 rows total)

这里，count是出场数和support是包含词语的文档数。

来源

2017-10-05 11:34:35

使用R TM包搜索2和3单词短语

回答

相关问题