按列计算POS标签

我正在尝试计算所有行中的部分语音标签并对其进行总结。按列计算POS标签

现在我达到了两个输出：

1）/ DT问题/ NN为/ VBD，/什么/ WP是/ VBP你/ PRP要去/ VBG到/剪切/ VB /？。（“DT”，“NN”，“VBD”，“，”，“WP”，“VBP”，“PRP”，“VBG”，“T0”，“VB”，“ “）

在该特定示例期望输出是：

 DT NN VBD WP VBP PRP VBG TO VB 
1 doc 1 1 1 1 1 1  1  1 1

但是，由于我想创建它用于在数据帧的整个列我想看到有0值以及在一个列，其对应于这个句子中没有使用的POS标签。

例子：

1 doc = "The/DT question/NN was/VBD ,/, what/WP are/VBP you/PRP going/VBG to/TO cut/VB ?/" 

2 doc = "Response/NN ?/."

输出：

 DT NN VBD WP VBP PRP VBG TO VB 
1 doc 1 1 1 1 1 1  1  1 1 
2 doc 0 1 0 0 0 0  0  0 0

我做什么现在：

library(stringr) 
#Spliting into sentence based on carriage return 

s <- unlist(lapply(df$sentence, function(x) { str_split(x, "\n")  })) 

library(NLP) 
library(openNLP) 

tagPOS <- function(x, ...) { 
s <- as.String(x) 
word_token_annotator <- Maxent_Word_Token_Annotator() 
a2 <- Annotation(1L, "sentence", 1L, nchar(s)) 
a2 <- annotate(s, word_token_annotator, a2) 
a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2) 
a3w <- a3[a3$type == "word"] 
POStags <- unlist(lapply(a3w$features, `[[`, "POS")) 
POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ") 
list(POStagged = POStagged, POStags = POStags) 
} 

result <- lapply(s,tagPOS) 
result <- as.data.frame(do.call(rbind,result))

这就是我如何达到这是在开头描述的输出

我h ave试图计算这样的事件：发生< -as.data.frame（表（unlist（result $ POStags）））

但它统计整个数据帧的发生次数。我需要为现有数据框创建新列并在第一列中统计出现次数。

任何人都可以帮助我吗？ :(使用tm

来源

2017-05-08 ZverArt

上添加所需的输出，到目前为止，你已经尝试过什么做得很好，但你也可以提供你的'df'的样本？ –

也可能会查看'tm :: TermDocumentMatrix'，使用您的POS标签代替文档中的实际字词来创建矩阵。 –

我对tm有同样的想法。我会在今天晚些时候尝试。谢谢！关于df： '问题是，你要削减什么？它完全失去控制。我支持洁净煤技术.' 这是我在df – ZverArt

相对painfree：

虚拟数据

require(tm) 
    df <- data.frame(ID = c("doc1","doc2"), 
        tags = c(paste("NN"), 
          paste("DT", "NN", "VBD", ",", "WP", "VBP", "PRP", "VBG", "TO", "VB", ".")))

化妆语料库和DocumentTermMatrix：

corpus <- Corpus(VectorSource(df$tags)) 
#default minimum wordlength is 3, so make sure you change this 
dtm <- DocumentTermMatrix(corpus, control= list(wordLengths=c(1,Inf))) 

#see what you've done 
inspect(dtm) 

<<DocumentTermMatrix (documents: 2, terms: 9)>> 
Non-/sparse entries: 10/8 
Sparsity   : 44% 
Maximal term length: 3 
Weighting   : term frequency (tf) 
Sample    : 
    Terms 
Docs dt nn prp to vb vbd vbg vbp wp 
    1 0 1 0 0 0 0 0 0 0 
    2 1 1 1 1 1 1 1 1 1

埃塔：如果你不喜欢用DTM工作，您可以将其强制为一个数据框：

as.data.frame(as.matrix(dtm)) 

    nn dt prp to vb vbd vbg vbp wp 
1 1 0 0 0 0 0 0 0 0 
2 1 1 1 1 1 1 1 1 1

ETA2：Corpus只创建df$tags柱的语料库，并且VectorSource假定在数据的每一行是一个文档，所以行的在数据帧df顺序，和文件在DocumentTermMatrix的顺序是相同的：我可以cbinddf$ID到输出数据帧。我做到这一点使用dplyr，因为我认为它会导致最可读的代码（读%>%为“再”）：

require(dplyr) 
result <- as.data.frame(as.matrix(dtm)) %>% 
      bind_col(df$ID)

来源

2017-05-08 11:26:22

超级！有用！非常感谢你！ – ZverArt

如果你现在在想“我应该怎么处理'simple_triplet_matrix''！ –

还有一个控制选项，可以设置为不将POS标签名称转换为小写。 –

按列计算POS标签

回答

相关问题