2017-02-23 22 views
0

我尝试与R创建一个函数,但我遇到了subDF框架的positive.ponderate.polarity列的问题。这些值不正确。 我认为probleme来自这些行:与R数据框的列中的错误

EDIT2:

if(any(unlist(strsplit(as.character(context), " ")) %in% booster_words)) 
       { 
        subDF$positive.ponderate.polarity <- subDF$positive.polarity * 3 
       } 
       else 
       { 
        subDF$positive.ponderate.polarity <- subDF$positive.polarity/3 
       } 

       # calculate the total polarity of the sentence and store in the vector 
       polarity[i] <- sum(subDF$positive.ponderate.polarity) - sum(subDF$negative.polarity) 
    } 

你能帮助我吗?

谢谢

### function to calculate the polarity of sentences 

calcPolarity <- function(sentiment_DF,sentences){ 
    booster_words <- c("more","enough", "a lot", "as") 
    # separate each sentence in words using regular expression 
    # (it returns a list with the words of each sentence) 
    sentencesSplitInWords <- regmatches(sentences,gregexpr("[[:word:]]+",sentences,perl=TRUE)) 

    # pre-allocate the polarity result vector with size = number of sentences 
    polarity <- rep.int(0,length(sentencesSplitInWords)) 

    for(i in 1:length(polarity)){ 
     # get the i-th sentence words 
     wordsOfASentence <- sentencesSplitInWords[[i]] 

     # get the rows of sentiment_DF corresponding to the words in the sentence using match 
     # N.B. if a word occurs twice, there will be two equal rows 
     # (but I think it's correct since in this way you count its polarity twice) 
     subDF <- sentiment_DF[match(wordsOfASentence,sentiment_DF$word,nomatch = 0),] 

     # extract a context of 3 words before the word in the dataframe 

     context <- stringr::str_extract(sentences, "([^\\s]+\\s){3}subDF$word(\\s[^\\s]+){3}") 
     # check there is a words of the context in the booster_words list 
     if(any(unlist(strsplit(as.character(context), " ")) %in% booster_words)) 
       { 
        subDF$positive.ponderate.polarity <- 1.12 
       } 
       else 
       { 
        subDF$positive.ponderate.polarity <- 14 
       } 

       # calculate the total polarity of the sentence and store in the vector 
       polarity[i] <- sum(subDF$positive.ponderate.polarity) - sum(subDF$negative.polarity) 
    } 
    return(polarity) 
} 

用法:

sentiment_DF <- data.frame(word=c('interesting','boring','pretty'), 
          positive.polarity=c(1,0,1), 
          negative.polarity=c(0,1,0)) 
sentences <- c("The course was interesting, but the professor was so boring.", 
       "stackoverflow is an interesting place with interesting people!") 
result <- calcPolarity(sentiment_DF,sentences) 

编辑

我期待这样的结果数据框:

word positive.polarity nagative.polarity positive.ponderate.polarity 
interesting 1 0 1.12 
boring 0 1 14 

因为我预计有15.12(1.12 + 14) - 1 = 14.12

回答

1

新的答案,因为这是一个完整的解决方案:

calcPolarity <- function(sentiment_DF,sentences){ 
    booster_words <- c("more","enough", "a lot", "as", "so") 

    # pre-allocate the polarity result vector with size = number of sentences 
    polarity <- rep.int(0,length(sentences)) 

    # loop per sentence 
    for(i in 1:length(polarity)){ 
    sentence <- sentences[i] 

    # separate each sentence in words using regular expression 
    wordsOfASentence <- unlist(regmatches(sentence,gregexpr("[[:word:]]+",sentence,perl=TRUE))) 

    # get the rows of sentiment_DF corresponding to the words in the sentence using match 
    # N.B. if a word occurs twice, there will be two equal rows 
    # (but I think it's correct since in this way you count its polarity twice) 
    subDF <- sentiment_DF[match(wordsOfASentence,sentiment_DF$word,nomatch = 0),] 


    # Find (number) of matching word. 
    wordOfInterest <- wordsOfASentence[which(wordsOfASentence %in% levels(sentiment_DF$word))] # No multigrepl, so working with duplicates instead. eg interesting 
    regexOfInterest <- paste0("([^\\s]+\\s){0,3}", wordOfInterest, "(\\s[^\\s]+){0,3}") 

    # extract a context of 3 words before the word in the dataframe 
    context <- stringr::str_extract(sentence, regexOfInterest) 
    names(context) <- wordOfInterest # Helps in forloop 

    contextValue <- function(context){ 
     ifelse(any(unlist(strsplit(context, " ")) %in% booster_words), 1.12, 14) 
    } 
    subDF$positive.ponderate.polarity <- sapply(context, contextValue) 

    # Debug option 
    print(subDF) 

    # calculate the total polarity of the sentence and store in the vector 
    polarity[i] <- sum(subDF$positive.ponderate.polarity) - sum(subDF$negative.polarity) 

    } 
    return(polarity) 
} 

sentiment_DF <- data.frame(word=c('interesting','boring','pretty'), 
          positive.polarity=c(1,0,1), 
          negative.polarity=c(0,1,0)) 
sentences <- c("The course was interesting, but the professor was so boring.", 
       "stackoverflow is an interesting place with interesting people!") 
result <- calcPolarity(sentiment_DF,sentences) 

现在打印上面讨论的表格。 Outcomment此调试选项

> result 
[1] 14.12 28.00 
+0

好的,谢谢你杰里米,我会测试它明天:) – Poisson

+0

我,你与它的成功。 :-) – Jeremy

+0

非常感谢Jeremy,很抱歉让你忙于解决我的问题。欢呼 – Poisson

1

你期望什么值?我复制你的榜样,并得到:

> result 

[1] 27 28 

猜测变成了蓝色的我就奇了subDF$positive.ponderate.polarity <- 14相比1.12是非常高的。你的意思是1.4

EDIT1
有什么东西在这一行会错:

context <- stringr::str_extract(sentences, "([^\\s]+\\s){3}subDF$word(\\s[^\\s]+){3}")

一般.... R取正则表达式中的subDF$word literately。尝试使用paste0("([^\\s]+\\s){3}",subDF$word,"(\\s[^\\s]+){3}")来代替矢量(长度为2)。

在调试该表达式时,我发现最后一部分(\\s[^\\s]+){3}应该做什么。你只想要前三个字,究竟?

EDIT2: 你有两个向量:(a)匹配正则表的列表和(b)句子本身。 Edit1解决了问题a。用lapply解决问题b。

# extract a context of 3 words before the word in the dataframe 
contexter <- function(sentence){ 
    stringr::str_extract(sentence, paste0("([^\\s]+\\s){3}",subDF$word)) 
} 
context <- lapply(sentences, contexter) 

EDIT3: 这项工作正在进行中...这应该使你更接近,你希望:

# Add a booster word occurring in sentences at all 
booster_words <- c("more","enough", "a lot", "as", "so") 

# extract a context of 3 words before the word in the dataframe 
contexter <- function(sentence){ 
    context <- stringr::str_extract(sentence, paste0("([^\\s]+\\s){3}",subDF$word)) 

    # check there is a words of the context in the booster_words list 
    if(any(unlist(strsplit(context, " ")) %in% booster_words)) 
    { 
    subDF$positive.ponderate.polarity <- 1.12 
    } 
    else 
    { 
    subDF$positive.ponderate.polarity <- 14 
    } 

    return(subDF) 
} 

polarity <- lapply(sentences, contexter) 

返回:

> polarity 
[[1]] 
word positive.polarity negative.polarity positive.ponderate.polarity 
1 interesting     1     0      1.12 
2  boring     0     1      1.12 

[[2]] 
word positive.polarity negative.polarity positive.ponderate.polarity 
1 interesting     1     0       14 
2  boring     0     1       14