2015-07-21 63 views
6

特定的人的名字,我使用R键提取文本包含特定人名句子,这里如何提取句子是一个样本段落:含,使用R

的反对,在图宾根改革者,他接受马丁路德在威滕伯格大学的一次电话会议上,由他的叔叔Johann Reuchlin推荐。 Melanchthon在21岁时成为Wittenberg的希腊语教授。他研究圣经,特别是Paul的经文和福音派教义。他在莱比锡(1519)的辩论中作为旁观者出席了会议,但参与了他的评论。约翰埃克攻击了他的观点,梅兰奇松在他的Defensio contra Johannem Eckium的基础上回答了圣经的权威。

在这种短款,有几个人的名字,如: 约翰内斯·罗伊希林梅兰希约翰·埃克。随着openNLP包,三个个人名字的帮助马丁·路德·保罗梅兰希可以正确地提取和识别。然后我有两个问题:

  1. 我怎么能提取包含这些名字的句子
  2. 由于命名实体识别器的输出结果并不那么有希望,如果我为[[Johann Reuchlin]],[[Melanchthon]]等每个名称添加“[[]]”,我怎样才能提取包含这些句子的句子名称表达式 [[A]],[[B]] ...?

回答

6
Using `strsplit` and `grep`, first I set made an object `para` which was your paragraph. 

toMatch <- c("Martin Luther", "Paul", "Melanchthon") 

unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))] 


> unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))] 
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin" 
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"                  
[3] " He studied the Scripture, especially of Paul, and Evangelical doctrine"                    
[4] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"  

或少许清洁剂:

sentences<-unlist(strsplit(para,split="\\.")) 
sentences[grep(paste(toMatch, collapse="|"),sentences)] 

如果您正在寻找每个人作为独立的回报,则句子:

toMatch <- c("Martin Luther", "Paul", "Melanchthon") 
sentences<-unlist(strsplit(para,split="\\.")) 
foo<-function(Match){sentences[grep(Match,sentences)]} 
lapply(toMatch,foo) 

[[1]] 
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin" 

[[2]] 
[1] " He studied the Scripture, especially of Paul, and Evangelical doctrine" 

[[3]] 
[1] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"             
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium" 

编辑3:要添加每个人的名字,做一些简单的如:

foo<-function(Match){c(Match,sentences[grep(Match,sentences)])} 

编辑4:

如果你想找到句子有多人/地点/事物(字),则只需添加一个参数为这两个如:

toMatch <- c("Martin Luther", "Paul", "Melanchthon","(?=.*Melanchthon)(?=.*Scripture)") 

,改变perlTRUE

foo<-function(Match){c(Match,sentences[grep(Match,sentences,perl = T)])} 


> lapply(toMatch,foo) 
[[1]] 
[1] "Martin Luther"                                   
[2] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin" 

[[2]] 
[1] "Paul"                 
[2] " He studied the Scripture, especially of Paul, and Evangelical doctrine" 

[[3]] 
[1] "Melanchthon"                               
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"             
[3] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium" 

[[4]] 
[1] "(?=.*Melanchthon)(?=.*Scripture)"                          
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium" 

编辑5:回答您的其他问题:

鉴于:

sentenceR<-"Opposed as a reformer at [[Tübingen]], he accepted a call to the University of [[Wittenberg]] by [[Martin Luther]], recommended by his great-uncle [[Johann Reuchlin]]" 

gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]]) 

会给你的双括号内的话。

> gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]]) 
[1] "Tübingen"  "Wittenberg"  "Martin Luther" "Johann Reuchlin" 
+0

许多THX,但我注意到,第一和第四句,分别有两个人的名字。如果我在“toMatch”中添加诸如“Johann Eck”或“Johann Reuchlin”这样的名字并运行上面的代码,我仍然会得到四个句子输出。我的新问题是我怎样才能得到每个人的句子(重叠)? – hui

+0

我不太明白。你是要求a)只包含所有人的名字的句子,或者b)每个单独的名字(那些有马丁路德在其中的句子,然后是所有在他们中有保罗的句子等)的单独的回报? –

+0

@hui让我知道,如果新的代码回答你的问题 –

2

下面是采用两个封装quantedastringi更为简单的方法:

sents <- unlist(quanteda::tokenize(txt, what = "sentence")) 
namesToExtract <- c("Martin Luther", "Paul", "Melanchthon") 
namesFound <- unlist(stringi::stri_extract_all_regex(sents, paste(namesToExtract, collapse = "|"))) 
sentList <- split(sents, list(namesFound)) 

sentList[["Melanchthon"]] 
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."             
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium." 

sentList 
## $`Martin Luther` 
## [1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin." 
## 
## $Melanchthon 
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."             
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium." 
## 
## $Paul 
## [1] "He studied the Scripture, especially of Paul, and Evangelical doctrine." 
+0

很多thx。我之前没有使用这两个包,但在这种情况下它似乎很方便:) – hui