2012-01-17 77 views
21

我试图找到一个代码,它实际上可以在R文本挖掘包中找到最常用的两个和三个单词短语(也许还有另外一个包,不知道)。我一直在尝试使用标记器,但似乎没有运气。使用R TM包搜索2和3单词短语

如果您过去曾经遇到过类似的情况,您可以发布经过测试且实际可行的代码吗?非常感谢!

+0

Ordered phrases,that is?或共同发生? – 2012-01-17 18:34:02

+0

两者都有用。谢谢! – appletree 2012-01-17 20:06:01

回答

3

这是为了不同的目的我自己由创作,但我认为可能适用于您的需求太:

#User Defined Functions 
Trim <- function (x) gsub("^\\s+|\\s+$", "", x) 

breaker <- function(x) unlist(strsplit(x, "[[:space:]]|(?=[.!?*-])", perl=TRUE)) 

strip <- function(x, digit.remove = TRUE, apostrophe.remove = FALSE){ 
    strp <- function(x, digit.remove, apostrophe.remove){ 
     x2 <- Trim(tolower(gsub(".*?($|'|[^[:punct:]]).*?", "\\1", as.character(x)))) 
     x2 <- if(apostrophe.remove) gsub("'", "", x2) else x2 
     ifelse(digit.remove==TRUE, gsub("[[:digit:]]", "", x2), x2) 
    } 
unlist(lapply(x, function(x) Trim(strp(x =x, digit.remove = digit.remove, 
    apostrophe.remove = apostrophe.remove)))) 
} 

unblanker <- function(x)subset(x, nchar(x)>0) 

#Fake Text Data 
x <- "I like green eggs and ham. They are delicious. They taste so yummy. I'm talking about ham and eggs of course" 

#The code using Base R to Do what you want 
breaker(x) 
strip(x) 
words <- unblanker(breaker(strip(x))) 
textDF <- as.data.frame(table(words)) 
textDF$characters <- sapply(as.character(textDF$words), nchar) 
textDF2 <- textDF[order(-textDF$characters, textDF$Freq), ] 
rownames(textDF2) <- 1:nrow(textDF2) 
textDF2 
subset(textDF2, characters%in%2:3) 
+0

嗨,@ Tyler-Rinker,我知道现在已经有几年了,但是当测试你的代码时,我得到这个错误:'在FUN中出错(c(“”,“”,“”,“”,“”, “”,“”,“”,“”,“”,“”,“”“”,“”,: 找不到“修剪”功能 – jessi 2015-02-24 18:32:25

+0

添加了'修剪',如果有帮助 – 2015-02-24 18:46:14

+0

哈哈。 ,@Tyler_Rinker。我有一个叫做'trim'的函数,但是我没有意识到它就是它所寻找的。谢谢! – jessi 2015-02-26 15:40:21

11

您可以在自定义的符号化功能tmDocumentTermMatrix功能传球,所以如果你有包tau安装它非常简单。

library(tm); library(tau); 

tokenize_ngrams <- function(x, n=3) return(rownames(as.data.frame(unclass(textcnt(x,method="string",n=n))))) 

texts <- c("This is the first document.", "This is the second file.", "This is the third text.") 
corpus <- Corpus(VectorSource(texts)) 
matrix <- DocumentTermMatrix(corpus,control=list(tokenize=tokenize_ngrams)) 

ntokenize_ngrams函数是每短语的字的数量。此功能也在包RTextTools中实现,这进一步简化了事情。

library(RTextTools) 
texts <- c("This is the first document.", "This is the second file.", "This is the third text.") 
matrix <- create_matrix(texts,ngramLength=3) 

这将返回一个类的DocumentTermMatrix用于与包tm使用。

+4

我意识到这是一个相当陈旧的线程,但最近有人尝试过吗?我的手,第一个方法给出了以下错误:'>矩阵< - DocumentTermMatrix(语料库,控制=列表(标记化=标记大小和格式)) Simple_triplet_matrix(i = i,j = j,v = as.numeric(v) ,nrow = length(allTerms),: 'i,j,v'不同长度 另外:警告信息:)在simple_triplet_matrix(i = i,j = j,v = as.numeric(v))中,在用户代码中遇到错误。 ,nrow = length(allTerms),: 通过胁迫引入的NAs。 – NumerousHats 2015-01-23 01:12:00

+2

尝试使用'library(RTextTools)'示例时,出现相同的错误@MAndrecPhD。 – jessi 2015-02-24 18:25:23

+0

我有同样的问题。我看到一些人认为,SnowballC软件包可以解决它,但它不适合我。有什么建议么? – Marius 2015-07-26 11:43:19

7

这是封装的FAQ第5部分:

5. Can I use bigrams instead of single tokens in a term-document matrix?

Yes. RWeka provides a tokenizer for arbitrary n-grams which can be directly passed on to the term-document matrix constructor. E.g.:

library("RWeka") 
    library("tm") 

    data("crude") 

    BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) 
    tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer)) 

    inspect(tdm[340:345,1:10]) 
+1

这对我来说是个诡计。实际上,当前版本的FAQ有一个不需要RWeka的解决方案:http://tm.r-forge.r-project.org/faq.html#Bigrams – Tripartio 2016-04-30 15:05:58

1

我通过使用tmngram包添加一个类似的问题。 调试mclapply后,我看到那里有与下面的错误

input 'x' has nwords=1 and n=2; must have nwords >= n 

上的文档问题不到2个字所以我添加了一个过滤器与低字计数删除文件:

myCorpus.3 <- tm_filter(myCorpus.2, function (x) { 
     length(unlist(strsplit(stringr::str_trim(x$content), '[[:blank:]]+'))) > 1 
    }) 

然后我的记号化功能看起来像:

bigramTokenizer <- function(x) { 
    x <- as.character(x) 

    # Find words 
    one.list <- c() 
    tryCatch({ 
    one.gram <- ngram::ngram(x, n = 1) 
    one.list <- ngram::get.ngrams(one.gram) 
    }, 
    error = function(cond) { warning(cond) }) 

    # Find 2-grams 
    two.list <- c() 
    tryCatch({ 
    two.gram <- ngram::ngram(x, n = 2) 
    two.list <- ngram::get.ngrams(two.gram) 
    }, 
    error = function(cond) { warning(cond) }) 

    res <- unlist(c(one.list, two.list)) 
    res[res != ''] 
} 

然后你就可以用测试功能:

dtmTest <- lapply(myCorpus.3, bigramTokenizer) 

最后:

dtm <- DocumentTermMatrix(myCorpus.3, control = list(tokenize = bigramTokenizer)) 
0

试试这个代码。

library(tm) 
library(SnowballC) 
library(class) 
library(wordcloud) 

keywords <- read.csv(file.choose(), header = TRUE, na.strings=c("NA","-","?")) 
keywords_doc <- Corpus(VectorSource(keywords$"use your column that you need")) 
keywords_doc <- tm_map(keywords_doc, removeNumbers) 
keywords_doc <- tm_map(keywords_doc, tolower) 
keywords_doc <- tm_map(keywords_doc, stripWhitespace) 
keywords_doc <- tm_map(keywords_doc, removePunctuation) 
keywords_doc <- tm_map(keywords_doc, PlainTextDocument) 
keywords_doc <- tm_map(keywords_doc, stemDocument) 

这是双字母组或三克节,你可以使用

BigramTokenizer <- function(x) 
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE) 
# creating of document matrix 
keywords_matrix <- TermDocumentMatrix(keywords_doc, control = list(tokenize = BigramTokenizer)) 

# remove sparse terms 
keywords_naremoval <- removeSparseTerms(keywords_matrix, 0.95) 

# Frequency of the words appearing 
keyword.freq <- rowSums(as.matrix(keywords_naremoval)) 
subsetkeyword.freq <-subset(keyword.freq, keyword.freq >=20) 
frequentKeywordSubsetDF <- data.frame(term = names(subsetkeyword.freq), freq = subsetkeyword.freq) 

# Sorting of the words 
frequentKeywordDF <- data.frame(term = names(keyword.freq), freq = keyword.freq) 
frequentKeywordSubsetDF <- frequentKeywordSubsetDF[with(frequentKeywordSubsetDF, order(-frequentKeywordSubsetDF$freq)), ] 
frequentKeywordDF <- frequentKeywordDF[with(frequentKeywordDF, order(-frequentKeywordDF$freq)), ] 

# Printing of the words 
wordcloud(frequentKeywordDF$term, freq=frequentKeywordDF$freq, random.order = FALSE, rot.per=0.35, scale=c(5,0.5), min.freq = 30, colors = brewer.pal(8,"Dark2")) 

希望这有助于。这是您可以使用的完整代码。

+0

我尝试了所有的解决方案,但没有一个是使用我的数据。我不知道为什么。无论我在ngams函数中留下什么值(2,3,4等),结果总是1gram(即一个单词) – 2017-08-23 17:26:57

1

尝试tidytext包

library(dplyr) 
library(tidytext) 
library(janeaustenr) 
library(tidyr 

假设我有一个包含注释列数据框CommentData,我想找到的两个词出现在一起。然后尝试

bigram_filtered <- CommentData %>% 
    unnest_tokens(bigram, Comment, token= "ngrams", n=2) %>% 
    separate(bigram, c("word1","word2"), sep=" ") %>% 
    filter(!word1 %in% stop_words$word, 
     !word2 %in% stop_words$word) %>% 
    count(word1, word2, sort=TRUE) 

上面的代码创建令牌,然后删除停用词并不在分析帮助(如在,一,要等),那么你就数这句话发生。然后,您将使用联合功能来合并单个单词并记录它们的发生。

bigrams_united <- bigram_filtered %>% 
    unite(bigram, word1, word2, sep=" ") 
bigrams_united 
2

语料库库中有一个名为term_stats功能,你想要做什么:

library(corpus) 
corpus <- gutenberg_corpus(55) # Project Gutenberg #55, _The Wizard of Oz_ 
text_filter(corpus)$drop_punct <- TRUE # ignore punctuation 
term_stats(corpus, ngrams = 2:3) 
## term    count support 
## 1 of the    336  1 
## 2 the scarecrow  208  1 
## 3 to the    185  1 
## 4 and the   166  1 
## 5 said the   152  1 
## 6 in the    147  1 
## 7 the lion   141  1 
## 8 the tin   123  1 
## 9 the tin woodman 114  1 
## 10 tin woodman  114  1 
## 11 i am    84  1 
## 12 it was    69  1 
## 13 in a    64  1 
## 14 the great   63  1 
## 15 the wicked   61  1 
## 16 wicked witch  60  1 
## 17 at the    59  1 
## 18 the little   59  1 
## 19 the wicked witch 58  1 
## 20 back to    57  1 
## ⋮ (52511 rows total) 

这里,count是出场数和support是包含词语的文档数。

相关问题