2015-10-13 92 views
1

我在txt文件(T1.txt)中有一长文本。 我想查找txt文件中的所有名称(英文)和名称后面的2个前面的单词和后面的2个单词。 比如我有以下文字:在txt文件中查找名称

"Hello world!, my name is Mr. A.B. Morgan (in short) and it is nice to meet you." 
Orange Silver paid 100$ for his gift. 
I'll call Dina H. in two hours. 

我想获得以下数据框:

> df1 
     Before   Name   After 
    1 name is  A. B. Morgan in short 
    2    Orange Silver paid 100$ 
    3 I'll call Dina H.  in two 
+0

这可能需要更多的细节。你有所有名称,首字母缩写,称呼或是否有其他名称 – akrun

+0

谢谢@akrun,名称可以任何名称约定。它意味着首字母或不是。我想支持这两者。他们都以大写字母开头。 – Avi

+0

如果人们有像橙色,银色等名字,可能会很困难。 – akrun

回答

1

这是不完美的,也不是很漂亮,但它是一个开始:

text1 <- c("Hello world!, my name is Mr. A.B. Morgan (in short) and it is nice to meet you.") 
text2 <- c("Orange Silver paid 100$ for his gift.") 
text3 <- c("I'll call Dina H. in two hours.") 

library(stringr) 

find_names_and_BA <- function(x) { 
    matches <- str_extract_all(str_sub(x, 2), "[A-Z]\\S+")[[1]] 

    if (length(matches) < 2) { matches <- str_extract_all(x, "[A-Z]\\S+")[[1]] } 
     name_match <- paste(matches, collapse = " ") 
    beg_of_match <- str_locate(x, name_match)[1] 
    end_of_match <- str_locate(x, name_match)[2] 

    start_words <- str_extract_all(str_sub(x, , beg_of_match), "\\w+")[[1]] 
     end_words <- str_extract_all(str_sub(x, end_of_match), "\\w+")[[1]] 

      before <- paste(tail(start_words, 3)[1:2], collapse = " ") 
      after <- paste(head(end_words, 3)[2:3], collapse = " ") 
    return(data.frame(Before = before, Name = name_match, After = after)) 
} 

dplyr::bind_rows(find_names_and_BA(text1), 
       find_names_and_BA(text2), 
       find_names_and_BA(text3)) 

# Source: local data frame [3 x 3] 
# 
# Before   Name  After 
#  (chr)   (chr)  (chr) 
# 1 name is Mr. A.B. Morgan in short 
# 2 O NA Orange Silver paid 100 
# 3 ll call   Dina H. two hours