下面是基于firstnames的数据集合中的一个方法:
install.packages("gender")
library(gender)
install_genderdata_package()
sets <- data(package = "genderdata")$results[,"Item"]
data(list = sets, package = "genderdata")
stopwords <- unique(kantrowitz$name)
texts <- as.data.frame (rbind (
'This text stuff if quite interesting',
'Where are all the names said Susan',
'Bob wondered what happened to all the proper nouns'
))
removeWords <- function(txt, words, n = 30000L) {
l <- cumsum(nchar(words)+c(0, rep(1, length(words)-1)))
groups <- cut(l, breaks = seq(1,ceiling(tail(l, 1)/n)*n+1, by = n))
regexes <- sapply(split(words, groups), function(words) sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), collapse = "|")))
for (regex in regexes) txt <- gsub(regex, "", txt, perl = TRUE, ignore.case = TRUE)
return(txt)
}
removeWords(texts[,1], stopwords)
# [1] "This text stuff if quite interesting"
# [2] "Where are all the names said "
# [3] " wondered what happened to all the proper nouns"
它可能需要一些调整,为您的特定数据集。
另一种方法可以基于词性标注。
除非你有一组固定的名字,否则这可能并不简单。您一定可以在网上找到常用的美国名字列表,并将其添加到您的停用词典字典中,但您永远不会获得所有的名字。 –
对于这个例子:'nms < - c('Susan','Bob'); gsub(paste0(nms,collapse ='|'),'',texts $ text)'(就像@ Hack-R说的:你需要一组固定的名字)。 – Jaap
尝试寻找_named entity extraction _/_命名实体recognition_,这是一个相当广泛的字段 – user2314737