我一直在玩R的情绪分析功能,并且一直运行在运行gsub函数时引发的错误。正面和负面的单词列表取自here。正则表达式错误信息 - “内存不足”
经过一番Google搜索之后,我在R帮助列表中发现了一处提到这个错误的地方,但没有其他地方。有没有人遇到这个问题?到底是怎么回事?有没有解决方法?
我在过去使用字符串时运行过类似的代码(使用gsub和stringer包),这是我第一次遇到这种类型的错误。此外,我试图通过在一组不同的字符串上编写类似的脚本来重现此错误,并且工作正常。
以下是错误消息:
> pos_match <- str_c(vpos, collapse = "|")
> neg_match <- str_c(vneg, collapse = "|")
> dat$positive <- as.numeric(str_detect(dat$Comment, pos_match))
> dat$negative <- as.numeric(str_detect(dat$Comment, neg_match))
Error: invalid regular expression, reason 'Out of memory'
这里是整个 '的过程。'
## SET WORKING DIRECTOR AND IMPORT PACKAGES:
setwd("~/Desktop/R_Tricks")
require(tm); require(stringr); require(lubridate); library(RTextTools)
# IMPORT DATA:
d1 <- read.csv("Video_Comments.csv", stringsAsFactors=FALSE, sep=",", fileEncoding="ISO_8859-2")
pos <- read.csv("positive-words.csv", stringsAsFactors=FALSE, header=TRUE, fileEncoding="ISO_8859-2")
neg <- read.csv("negative-words.csv", stringsAsFactors=FALSE, header=TRUE, fileEncoding="ISO_8859-2")
vpos = as.vector(pos[,1]); vneg = as.vector(neg[,1])
head(vpos); head(vneg)
colnames(d1); nrow(d1); ncol(d1)
str(d1); head(d1)
table(d1$Likes); table(d1$Replies)
nrow(vpos); nrow(vneg)
length(vpos); length(vneg)
is.atomic(vpos); is.atomic(vneg)
# SELECT DATA:
dat = data.frame(Comment=c(d1$Comment))
head(dat)
# CLEAN DATA - COMMENTS:
dat$Comment = gsub('[[:punct:]]', '', dat$Comment)
dat$Comment = gsub('[[:cntrl:]]', '', dat$Comment)
dat$Comment = gsub('\\d+', '', dat$Comment)
dat$Comment = tolower(dat$Comment)
head(dat)
# CLEAN DATA - CLASSIFICATIONS:
vpos = gsub('[[:punct:]]', '', vpos); vneg = gsub('[[:punct:]]', '', vneg)
vpos = gsub('[[:cntrl:]]', '', vpos); vneg = gsub('[[:cntrl:]]', '', vneg)
vpos = gsub('\\d+', '', vpos); vneg = gsub('\\d+', '', vneg)
vpos = tolower(vpos); vneg = tolower(vneg)
head(vpos); head(vneg)
# MATCH WORDS WITH FACEBOOK COMMENTS:
pos_match <- str_c(vpos, collapse = "|")
neg_match <- str_c(vneg, collapse = "|")
dat$positive <- as.numeric(str_detect(dat$Comment, pos_match))
dat$negative <- as.numeric(str_detect(dat$Comment, neg_match))
编辑:
我已经接收到另一个错误信息是:
> dat$negative <- as.numeric(str_detect(dat$Comment, neg_match))
Error: invalid regular expression 'faced|faces|abnormal|abolish|abominable|abominably|abominate|abomination|abort|aborted|
编辑2:
数据用于再现错误:
dat = c("Hey guys I am Aliza Lomez...18 y.o. I need your likes please like my page and find love quotes, beauty tips and much more.Please like my page you will never regret thank u all\u0083 <3 <3 <3...",
"Alexandra Saturn", "And that's what makes a Subaru a Subaru", "Missouri in a battleground....; meanwhile in southern California....", "What the Frisbee", "very cool !!!!", "Get a life",
"Try that with my GT!!!", "Did he make any money?", "Wo! WO! BSMITH THROWING DISCS WITH SUBARUS?!?! THIS IS SO AWESOME! SHOULD OF USED AN STI THO")
你正在创建〜6000'OR'运算符匹配 - “|”吗? 'pos_match < - str_c(vpos,collapse =“|”)' – zx8754 2014-10-06 14:01:40
我没有回答你的问题,因为它不可重现,但你可能想看看['polarity'函数](http:///trinker.github.io/qdap_dev/polarity.html)放在'qdap'包中。你可能正在重新发明已经完成的事情。 – 2014-10-06 14:03:05
'tm'包还有一个'tm.plugin.sentiment' [plugin/package](https://r-forge.r-project.org/R/?group_id=1048),应该会好一点比建立巨大的正则表达式。 – hrbrmstr 2014-10-06 14:05:27