从语料库中删除所有专有名称的有效方法

在R中工作，我试图找到一种搜索文本文件的有效方法，并删除或替换所有正确名称的实例（例如Thomas）。我认为有些东西可以做到这一点，但一直无法找到。从语料库中删除所有专有名称的有效方法

因此，在这个例子中，“Susan”和“Bob”将被删除。这是一个简化的例子，实际上它会希望这适用于数百个文档，因此也包含相当大的名称列表。

texts <- as.data.frame (rbind (
    'This text stuff if quite interesting', 
    'Where are all the names said Susan', 
    'Bob wondered what happened to all the proper nouns' 
    )) 
names(texts) [1] <- "text"

来源

2017-01-01 Andrew B

除非你有一组固定的名字，否则这可能并不简单。您一定可以在网上找到常用的美国名字列表，并将其添加到您的停用词典字典中，但您永远不会获得所有的名字。 –

对于这个例子：'nms < - c（'Susan'，'Bob'）; gsub（paste0（nms，collapse ='|'），''，texts $ text）'（就像@ Hack-R说的：你需要一组固定的名字）。 – Jaap

尝试寻找_named entity extraction _/_命名实体recognition_，这是一个相当广泛的字段 – user2314737

下面是基于firstnames的数据集合中的一个方法：

install.packages("gender") 
library(gender) 
install_genderdata_package() 

sets <- data(package = "genderdata")$results[,"Item"] 
data(list = sets, package = "genderdata") 
stopwords <- unique(kantrowitz$name) 

texts <- as.data.frame (rbind (
    'This text stuff if quite interesting', 
    'Where are all the names said Susan', 
    'Bob wondered what happened to all the proper nouns' 
)) 

removeWords <- function(txt, words, n = 30000L) { 
    l <- cumsum(nchar(words)+c(0, rep(1, length(words)-1))) 
    groups <- cut(l, breaks = seq(1,ceiling(tail(l, 1)/n)*n+1, by = n)) 
    regexes <- sapply(split(words, groups), function(words) sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), collapse = "|"))) 
    for (regex in regexes) txt <- gsub(regex, "", txt, perl = TRUE, ignore.case = TRUE) 
    return(txt) 
} 
removeWords(texts[,1], stopwords) 
# [1] "This text stuff if quite interesting"   
# [2] "Where are all the names said "     
# [3] " wondered what happened to all the proper nouns"

它可能需要一些调整，为您的特定数据集。

另一种方法可以基于词性标注。

来源

2017-01-01 17:12:25 lukeA

从语料库中删除所有专有名称的有效方法

回答

相关问题