我试图找到一个代码,它实际上可以在R文本挖掘包中找到最常用的两个和三个单词短语(也许还有另外一个包,不知道)。我一直在尝试使用标记器,但似乎没有运气。使用R TM包搜索2和3单词短语
如果您过去曾经遇到过类似的情况,您可以发布经过测试且实际可行的代码吗?非常感谢!
我试图找到一个代码,它实际上可以在R文本挖掘包中找到最常用的两个和三个单词短语(也许还有另外一个包,不知道)。我一直在尝试使用标记器,但似乎没有运气。使用R TM包搜索2和3单词短语
如果您过去曾经遇到过类似的情况,您可以发布经过测试且实际可行的代码吗?非常感谢!
这是为了不同的目的我自己由创作,但我认为可能适用于您的需求太:
#User Defined Functions
Trim <- function (x) gsub("^\\s+|\\s+$", "", x)
breaker <- function(x) unlist(strsplit(x, "[[:space:]]|(?=[.!?*-])", perl=TRUE))
strip <- function(x, digit.remove = TRUE, apostrophe.remove = FALSE){
strp <- function(x, digit.remove, apostrophe.remove){
x2 <- Trim(tolower(gsub(".*?($|'|[^[:punct:]]).*?", "\\1", as.character(x))))
x2 <- if(apostrophe.remove) gsub("'", "", x2) else x2
ifelse(digit.remove==TRUE, gsub("[[:digit:]]", "", x2), x2)
}
unlist(lapply(x, function(x) Trim(strp(x =x, digit.remove = digit.remove,
apostrophe.remove = apostrophe.remove))))
}
unblanker <- function(x)subset(x, nchar(x)>0)
#Fake Text Data
x <- "I like green eggs and ham. They are delicious. They taste so yummy. I'm talking about ham and eggs of course"
#The code using Base R to Do what you want
breaker(x)
strip(x)
words <- unblanker(breaker(strip(x)))
textDF <- as.data.frame(table(words))
textDF$characters <- sapply(as.character(textDF$words), nchar)
textDF2 <- textDF[order(-textDF$characters, textDF$Freq), ]
rownames(textDF2) <- 1:nrow(textDF2)
textDF2
subset(textDF2, characters%in%2:3)
您可以在自定义的符号化功能tm
的DocumentTermMatrix
功能传球,所以如果你有包tau
安装它非常简单。
library(tm); library(tau);
tokenize_ngrams <- function(x, n=3) return(rownames(as.data.frame(unclass(textcnt(x,method="string",n=n)))))
texts <- c("This is the first document.", "This is the second file.", "This is the third text.")
corpus <- Corpus(VectorSource(texts))
matrix <- DocumentTermMatrix(corpus,control=list(tokenize=tokenize_ngrams))
凡n
在tokenize_ngrams
函数是每短语的字的数量。此功能也在包RTextTools
中实现,这进一步简化了事情。
library(RTextTools)
texts <- c("This is the first document.", "This is the second file.", "This is the third text.")
matrix <- create_matrix(texts,ngramLength=3)
这将返回一个类的DocumentTermMatrix
用于与包tm
使用。
我意识到这是一个相当陈旧的线程,但最近有人尝试过吗?我的手,第一个方法给出了以下错误:'>矩阵< - DocumentTermMatrix(语料库,控制=列表(标记化=标记大小和格式)) Simple_triplet_matrix(i = i,j = j,v = as.numeric(v) ,nrow = length(allTerms),: 'i,j,v'不同长度 另外:警告信息:)在simple_triplet_matrix(i = i,j = j,v = as.numeric(v))中,在用户代码中遇到错误。 ,nrow = length(allTerms),: 通过胁迫引入的NAs。 – NumerousHats 2015-01-23 01:12:00
尝试使用'library(RTextTools)'示例时,出现相同的错误@MAndrecPhD。 – jessi 2015-02-24 18:25:23
我有同样的问题。我看到一些人认为,SnowballC软件包可以解决它,但它不适合我。有什么建议么? – Marius 2015-07-26 11:43:19
5. Can I use bigrams instead of single tokens in a term-document matrix?
Yes. RWeka provides a tokenizer for arbitrary n-grams which can be directly passed on to the term-document matrix constructor. E.g.:
library("RWeka")
library("tm")
data("crude")
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
inspect(tdm[340:345,1:10])
这对我来说是个诡计。实际上,当前版本的FAQ有一个不需要RWeka的解决方案:http://tm.r-forge.r-project.org/faq.html#Bigrams – Tripartio 2016-04-30 15:05:58
我通过使用tm
和ngram
包添加一个类似的问题。 调试mclapply
后,我看到那里有与下面的错误
input 'x' has nwords=1 and n=2; must have nwords >= n
上的文档问题不到2个字所以我添加了一个过滤器与低字计数删除文件:
myCorpus.3 <- tm_filter(myCorpus.2, function (x) {
length(unlist(strsplit(stringr::str_trim(x$content), '[[:blank:]]+'))) > 1
})
然后我的记号化功能看起来像:
bigramTokenizer <- function(x) {
x <- as.character(x)
# Find words
one.list <- c()
tryCatch({
one.gram <- ngram::ngram(x, n = 1)
one.list <- ngram::get.ngrams(one.gram)
},
error = function(cond) { warning(cond) })
# Find 2-grams
two.list <- c()
tryCatch({
two.gram <- ngram::ngram(x, n = 2)
two.list <- ngram::get.ngrams(two.gram)
},
error = function(cond) { warning(cond) })
res <- unlist(c(one.list, two.list))
res[res != '']
}
然后你就可以用测试功能:
dtmTest <- lapply(myCorpus.3, bigramTokenizer)
最后:
dtm <- DocumentTermMatrix(myCorpus.3, control = list(tokenize = bigramTokenizer))
试试这个代码。
library(tm)
library(SnowballC)
library(class)
library(wordcloud)
keywords <- read.csv(file.choose(), header = TRUE, na.strings=c("NA","-","?"))
keywords_doc <- Corpus(VectorSource(keywords$"use your column that you need"))
keywords_doc <- tm_map(keywords_doc, removeNumbers)
keywords_doc <- tm_map(keywords_doc, tolower)
keywords_doc <- tm_map(keywords_doc, stripWhitespace)
keywords_doc <- tm_map(keywords_doc, removePunctuation)
keywords_doc <- tm_map(keywords_doc, PlainTextDocument)
keywords_doc <- tm_map(keywords_doc, stemDocument)
这是双字母组或三克节,你可以使用
BigramTokenizer <- function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
# creating of document matrix
keywords_matrix <- TermDocumentMatrix(keywords_doc, control = list(tokenize = BigramTokenizer))
# remove sparse terms
keywords_naremoval <- removeSparseTerms(keywords_matrix, 0.95)
# Frequency of the words appearing
keyword.freq <- rowSums(as.matrix(keywords_naremoval))
subsetkeyword.freq <-subset(keyword.freq, keyword.freq >=20)
frequentKeywordSubsetDF <- data.frame(term = names(subsetkeyword.freq), freq = subsetkeyword.freq)
# Sorting of the words
frequentKeywordDF <- data.frame(term = names(keyword.freq), freq = keyword.freq)
frequentKeywordSubsetDF <- frequentKeywordSubsetDF[with(frequentKeywordSubsetDF, order(-frequentKeywordSubsetDF$freq)), ]
frequentKeywordDF <- frequentKeywordDF[with(frequentKeywordDF, order(-frequentKeywordDF$freq)), ]
# Printing of the words
wordcloud(frequentKeywordDF$term, freq=frequentKeywordDF$freq, random.order = FALSE, rot.per=0.35, scale=c(5,0.5), min.freq = 30, colors = brewer.pal(8,"Dark2"))
希望这有助于。这是您可以使用的完整代码。
我尝试了所有的解决方案,但没有一个是使用我的数据。我不知道为什么。无论我在ngams函数中留下什么值(2,3,4等),结果总是1gram(即一个单词) – 2017-08-23 17:26:57
尝试tidytext包
library(dplyr)
library(tidytext)
library(janeaustenr)
library(tidyr
)
假设我有一个包含注释列数据框CommentData,我想找到的两个词出现在一起。然后尝试
bigram_filtered <- CommentData %>%
unnest_tokens(bigram, Comment, token= "ngrams", n=2) %>%
separate(bigram, c("word1","word2"), sep=" ") %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word) %>%
count(word1, word2, sort=TRUE)
上面的代码创建令牌,然后删除停用词并不在分析帮助(如在,一,要等),那么你就数这句话发生。然后,您将使用联合功能来合并单个单词并记录它们的发生。
bigrams_united <- bigram_filtered %>%
unite(bigram, word1, word2, sep=" ")
bigrams_united
的语料库库中有一个名为term_stats
功能,你想要做什么:
library(corpus)
corpus <- gutenberg_corpus(55) # Project Gutenberg #55, _The Wizard of Oz_
text_filter(corpus)$drop_punct <- TRUE # ignore punctuation
term_stats(corpus, ngrams = 2:3)
## term count support
## 1 of the 336 1
## 2 the scarecrow 208 1
## 3 to the 185 1
## 4 and the 166 1
## 5 said the 152 1
## 6 in the 147 1
## 7 the lion 141 1
## 8 the tin 123 1
## 9 the tin woodman 114 1
## 10 tin woodman 114 1
## 11 i am 84 1
## 12 it was 69 1
## 13 in a 64 1
## 14 the great 63 1
## 15 the wicked 61 1
## 16 wicked witch 60 1
## 17 at the 59 1
## 18 the little 59 1
## 19 the wicked witch 58 1
## 20 back to 57 1
## ⋮ (52511 rows total)
这里,count
是出场数和support
是包含词语的文档数。
Ordered phrases,that is?或共同发生? – 2012-01-17 18:34:02
两者都有用。谢谢! – appletree 2012-01-17 20:06:01