2016-11-01 28 views
-1

示例数据帧:分开的不同组合到第一和最后使用dplyr,tidyr,和正则表达式

name <- c("Smith John Michael","Smith, John Michael","Smith John, Michael","Smith-John Michael","Smith-John, Michael") 
df <- data.frame(name) 

df 
       name 
1 Smith John Michael 
2 Smith, John Michael 
3 Smith John, Michael 
4 Smith-John Michael 
5 Smith-John, Michael 

我需要实现以下所需的输出:

    name first.name last.name 
1 Smith John Michael  John  Smith 
2 Smith, John Michael  John  Smith 
3 Smith John, Michael Michael Smith John 
4 Smith-John Michael Michael Smith-John 
5 Smith-John, Michael Michael Smith-John 

的规则如下:如果字符串中有逗号,则以前的任何内容都是姓氏。在逗号后面的第一个单词是名字。如果字符串中没有逗号,第一个词是姓,第二个词是姓。带连字符的单词是一个单词。我宁愿用dplyr和regex来实现这一点,但我会采取任何解决方案。感谢您的帮助

+0

见http://stackoverflow.com/questions/7069076/split-column-at-delimiter-in-data-frame –

回答

1

可以使用分裂之间strsplit切换由","" "基于是否有逗号或不name达到你想要的结果。在这里,我们定义两个函数来使演示更清晰。你也可以在函数内嵌入代码。

get.last.name <- function(name) { 
    lapply(ifelse(grepl(",",name),strsplit(name,","),strsplit(name," ")),`[[`,1) 
} 

strsplit的结果是一个列表。 lapply(...,'[[',1)循环遍历此列表,并从每个列表元素(这是最后一个名称)中提取第一个元素。除了我们从由strsplit返回的每个列表元素,它包含第一个名称提取第二元件

get.first.name <- function(name) { 
    d <- lapply(ifelse(grepl(",",name),strsplit(name,","),strsplit(name," ")),`[[`,2) 
    lapply(strsplit(gsub("^ ","",d), " "),`[[`,1) 
} 

此功能是类似的。然后我们使用gsub删除任何起始空格,然后我们再次用" "分隔,以便从该strsplit作为名字返回的每个列表元素中提取第一个元素。

dplyr全部放在一起:

library(dplyr) 
res <- df %>% mutate(first.name=get.first.name(name), 
        last.name=get.last.name(name)) 

结果不出所料:

print(res) 
##     name first.name last.name 
## 1 Smith John Michael  John  Smith 
## 2 Smith, John Michael  John  Smith 
## 3 Smith John, Michael Michael Smith John 
## 4 Smith-John Michael Michael Smith-John 
## 5 Smith-John, Michael Michael Smith-John 

数据:

df <- structure(list(name = c("Smith John Michael", "Smith, John Michael", 
"Smith John, Michael", "Smith-John Michael", "Smith-John, Michael" 
)), .Names = "name", row.names = c(NA, -5L), class = "data.frame") 
##     name 
##1 Smith John Michael 
##2 Smith, John Michael 
##3 Smith John, Michael 
##4 Smith-John Michael 
##5 Smith-John, Michael 
+0

谢谢。那效果很好 – Eric

0

我不知道这是任何总比艾超的回答更好,但我反正也是这样。我给出了正确的输出。

df1 <- df %>% 
    filter(grepl(",",name)) %>% 
    separate(name, c("last.name","first.middle.name"), sep = "\\,", remove=F) %>% 
    mutate(first.middle.name = trimws(first.middle.name)) %>% 
    separate(first.middle.name, c("first.name","middle.name"), sep="\\ ",remove=T) %>% 
    select(-middle.name) 

df2 <- df %>% 
    filter(!grepl(",",name)) %>% 
    separate(name, c("last.name","first.name"), sep = "\\ ", remove=F) 

df<-rbind(df1,df2) 
相关问题