特定的人的名字，我使用R键提取文本包含特定人名句子，这里如何提取句子是一个样本段落：含，使用R

的反对，在图宾根改革者，他接受马丁路德在威滕伯格大学的一次电话会议上，由他的叔叔Johann Reuchlin推荐。 Melanchthon在21岁时成为Wittenberg的希腊语教授。他研究圣经，特别是Paul的经文和福音派教义。他在莱比锡（1519）的辩论中作为旁观者出席了会议，但参与了他的评论。约翰埃克攻击了他的观点，梅兰奇松在他的Defensio contra Johannem Eckium的基础上回答了圣经的权威。

在这种短款，有几个人的名字，如： 约翰内斯·罗伊希林，梅兰希，约翰·埃克。随着openNLP包，三个个人名字的帮助马丁·路德·，保罗和梅兰希可以正确地提取和识别。然后我有两个问题：

我怎么能提取包含这些名字的句子？

由于命名实体识别器的输出结果并不那么有希望，如果我为[[Johann Reuchlin]]，[[Melanchthon]]等每个名称添加“[[]]”，我怎样才能提取包含这些句子的句子名称表达式 [[A]]，[[B]] ...？

2015-07-21 hui

Using `strsplit` and `grep`, first I set made an object `para` which was your paragraph. 

toMatch <- c("Martin Luther", "Paul", "Melanchthon") 

unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))] 


> unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))] 
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin" 
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"                  
[3] " He studied the Scripture, especially of Paul, and Evangelical doctrine"                    
[4] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

或少许清洁剂：

sentences<-unlist(strsplit(para,split="\\.")) 
sentences[grep(paste(toMatch, collapse="|"),sentences)]

如果您正在寻找每个人作为独立的回报，则句子：

toMatch <- c("Martin Luther", "Paul", "Melanchthon") 
sentences<-unlist(strsplit(para,split="\\.")) 
foo<-function(Match){sentences[grep(Match,sentences)]} 
lapply(toMatch,foo) 

[[1]] 
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin" 

[[2]] 
[1] " He studied the Scripture, especially of Paul, and Evangelical doctrine" 

[[3]] 
[1] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"             
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

编辑3：要添加每个人的名字，做一些简单的如：

foo<-function(Match){c(Match,sentences[grep(Match,sentences)])}

编辑4：

如果你想找到句子有多人/地点/事物（字），则只需添加一个参数为这两个如：

toMatch <- c("Martin Luther", "Paul", "Melanchthon","(?=.*Melanchthon)(?=.*Scripture)")

，改变perl到TRUE：

foo<-function(Match){c(Match,sentences[grep(Match,sentences,perl = T)])} 


> lapply(toMatch,foo) 
[[1]] 
[1] "Martin Luther"                                   
[2] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin" 

[[2]] 
[1] "Paul"                 
[2] " He studied the Scripture, especially of Paul, and Evangelical doctrine" 

[[3]] 
[1] "Melanchthon"                               
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"             
[3] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium" 

[[4]] 
[1] "(?=.*Melanchthon)(?=.*Scripture)"                          
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

编辑5：回答您的其他问题：

鉴于：

sentenceR<-"Opposed as a reformer at [[Tübingen]], he accepted a call to the University of [[Wittenberg]] by [[Martin Luther]], recommended by his great-uncle [[Johann Reuchlin]]" 

gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])

会给你的双括号内的话。

> gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]]) 
[1] "Tübingen"  "Wittenberg"  "Martin Luther" "Johann Reuchlin"

来源

2015-07-21 10:56:12

许多THX，但我注意到，第一和第四句，分别有两个人的名字。如果我在“toMatch”中添加诸如“Johann Eck”或“Johann Reuchlin”这样的名字并运行上面的代码，我仍然会得到四个句子输出。我的新问题是我怎样才能得到每个人的句子（重叠）？ – hui

我不太明白。你是要求a）只包含所有人的名字的句子，或者b）每个单独的名字（那些有马丁路德在其中的句子，然后是所有在他们中有保罗的句子等）的单独的回报？ –

@hui让我知道，如果新的代码回答你的问题 –

下面是采用两个封装quanteda和stringi更为简单的方法：

sents <- unlist(quanteda::tokenize(txt, what = "sentence")) 
namesToExtract <- c("Martin Luther", "Paul", "Melanchthon") 
namesFound <- unlist(stringi::stri_extract_all_regex(sents, paste(namesToExtract, collapse = "|"))) 
sentList <- split(sents, list(namesFound)) 

sentList[["Melanchthon"]] 
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."             
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium." 

sentList 
## $`Martin Luther` 
## [1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin." 
## 
## $Melanchthon 
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."             
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium." 
## 
## $Paul 
## [1] "He studied the Scripture, especially of Paul, and Evangelical doctrine."

来源

2015-07-22 02:25:48

很多thx。我之前没有使用这两个包，但在这种情况下它似乎很方便:) – hui

含，使用R

回答

编辑4：

编辑5：回答您的其他问题：

相关问题