2017-09-14 79 views
1

我的数据是:比较列字符下一行

Name House Street Apt City Postal Phone 
Bob Joe  954 BLUE DRIVE NA A PLACE Z5K4N2 999-495-6544 
Smith Jane 555 BLUE DRIVE NA A PLACE Z5K4N5 999-435-6172 
Smith Jane 555 BLUE DRIVE NA A PLACE Z5K4N5 999-450-6763 

我想比较名称(动态,数据由众议院排序),如果相等,房子#是平等的,拼接的相应两个电话号码并删除未连接的行。

所以它看起来像这样经过:

我尝试
Name House Street  Apt City Postal Phone 
Bob Joe  954 BLUE DRIVE NA A PLACE Z5K4N2 999-495-6544 
Smith Jane 555 BLUE DRIVE NA A PLACE Z5K4N5 999-435-6172 OR 999-450-6763  

for(x in 1:nrow(data)) { 

    if(data$Name[x] == data$Name[x+1]) { 
    data$NameDupes <- data$Name[x] } 
} 

然后使用后

aggregate: aggregate(Phone ~ Name + Street + City + Postal + Apt + House, data = df, paste, collapse = " OR ") 

,然后在此之后,使用连接上我原来的df。

开放的思想

感谢

+0

你'aggregate'代码失败,因为公寓变量包含的NA,所以不能很好地用作分组变量。要解决这个问题,请将这些值更改为“无”或0或其他值。例如,'df $ Apt [is.na(df $ Apt)] < - “”'然后你的最后一行代码将在你的例子中结合第2行和第3行。 – lmo

回答

2

dplyr一个解决方案。

library(dplyr) 

dt2 <- dt %>% 
    group_by(House, Street, Apt, City, Postal) %>% 
    summarise(Name = first(Name), Phone = paste(Phone, collapse = " OR ")) %>% 
    ungroup() %>% 
    arrange(desc(House)) %>% 
    select(colnames(dt)) 
dt2 
# A tibble: 2 x 7 
     Name House  Street Apt City Postal      Phone 
     <chr> <int>  <chr> <lgl> <chr> <chr>      <chr> 
1 Bob Joe 954 BLUE DRIVE NA A PLACE Z5K4N2     999-495-6544 
2 Smith Jane 555 BLUE DRIVE NA A PLACE Z5K4N5 999-435-6172 OR 999-450-6763 

DATA

dt <- read.table(text = "Name House Street Apt City Postal Phone 
'Bob Joe'  954 'BLUE DRIVE' NA 'A PLACE' Z5K4N2 '999-495-6544' 
'Smith Jane' 555 'BLUE DRIVE' NA 'A PLACE' Z5K4N5 '999-435-6172' 
'Smith Jane' 555 'BLUE DRIVE' NA 'A PLACE' Z5K4N5 '999-450-6763'", 
header = TRUE, stringsAsFactors = FALSE) 
0

不同的答案比@ycw的......使用data.table。 (因为我是包装的粉丝)。

使用数据

dt <- read.table(text = "Name House Street Apt City Postal Phone 
'Bob Joe'  954 'BLUE DRIVE' NA 'A PLACE' Z5K4N2 '999-495-6544' 
'Smith Jane' 555 'BLUE DRIVE' NA 'A PLACE' Z5K4N5 '999-435-6172' 
'Smith Jane' 555 'BLUE DRIVE' NA 'A PLACE' Z5K4N5 '999-450-6763'", 
header = TRUE, stringsAsFactors = FALSE) 

我们执行一个伟大的班轮

library(data.table) 
dt = as.data.table(dt) 
dt[,.(Phone = paste(Phone,collapse = " OR ")),by = .(Name,House,Street,Apt,City,Postal)] 

输出

 Name House  Street Apt City Postal      Phone 
1: Bob Joe 954 BLUE DRIVE NA A PLACE Z5K4N2     999-495-6544 
2: Smith Jane 555 BLUE DRIVE NA A PLACE Z5K4N5 999-435-6172 OR 999-450-6763 
+1

你可以从'fread'开始,最初返回一个data.table。或者如果你想执行转换,'setDT'优于'as.data.table',因为执行执行转换,意味着没有拷贝。实际上,您可以只执行'setDT(dt)['并在代码的同一行执行转换。 – lmo

+0

@Imo当然,但问题不是要尽可能高效地转换为data.table。 – zwep