2017-09-01 82 views
0

我有两个数据帧:一个(“grny”),主要是一个引用,但在“yield”列中有一些数据I' m之后,另一个(“txie”)会因为丢失数据而产生少量数据。我想合并它们,以便在“网站”中具有共同值的行中的所有单元格都是完整的。R:合并2个数据帧并将参考数据应用于匹配一个级别的所有行

其中最多的一年,通过一年的数据是:

txie<-data.frame (site=c(rep("smithfield",2),rep("belleville",3)), 
yield=c((rnorm(4, mean=8)),NA), 
year=c(1999:2000,1992:1994), 
prim=c(rep("nt",2),rep(NA,3))) 

一些年的年收益率数据大多参考:

grny<-data.frame (site=c("smithfield","belleville",rep("nashua",3)), 
yield=c(rep(NA,2),rnorm(3,mean=9)), 
year=c(rep(NA,2),1990:1992), 
prim=c(NA,"nt",sample(c("nt","ct"),3,rep=TRUE)), 
lat=(c(rnorm(2,mean=45,sd=10),rep(49.1,3)))) 

我想要什么:

  site yield year prim lib  lat 
1 smithfield 7.009178 1999 nt 1109  43.61828 
2 smithfield 8.472677 2000 nt 1109  43.61828 
3 belleville 8.857462 1992 nt 122  74.08792 
4 belleville 7.368488 1993 nt 122  74.08792 
5 belleville  NA 1994 nt 122  74.08792 
6 nashua  7.494519 1990 nt 554  49.10000 
8 nashua  8.696066 1991 ct 554  49.10000 
9 nashua  8.051670 1992 nt 554  49.10000 

我试过的东西:

rbind.fill(txie,grny) #this appends rows to the correct columns but leaves NA's everywhere because it doesn't know I want data missing in grny filled in when it is available in txie 
Reduce(function(x,y) merge(txie,grny, by="site", all.y=TRUE), list(txie,grny)) #this merges by rows but creates new variables from x and y. 
merge(x = txie, y = grny, by = "site", all = TRUE) #this does the same as the above (new variables from each x and y ending in .x or .y) 
merge(x = txie, y = grny, by = "site", all.x = TRUE)#this does similar to above but merges based on the x df (new variables from each x and y ending in .x or .y) 
setkey(setDT(grny),site)[txie]# this gives a similar result to the all.x line 

例如,与外部连接合并我结束了:

 site yield.x year.x prim.x yield.y year.y prim.y  lat 
1 belleville 6.766628 1992 <NA>  NA  NA  nt 34.92136 
2 belleville 6.845789 1993 <NA>  NA  NA  nt 34.92136 
3 belleville  NA 1994 <NA>  NA  NA  nt 34.92136 
4 smithfield 8.841339 1999  nt  NA  NA <NA> 49.81872 
5 smithfield 7.313310 2000  nt  NA  NA <NA> 49.81872 
6  nashua  NA  NA <NA> 9.173229 1990  ct 49.10000 
7  nashua  NA  NA <NA> 9.196018 1991  nt 49.10000 
8  nashua  NA  NA <NA> 7.336645 1992  ct 49.10000 

规定:我想保持NA的那些已经在“收益率”列(如。 1994年纳舒厄)。 任何答案或有人可以告诉我,这种合并的例子(数据已经在一个或多个共享列,你没有合并,每个df bringing in new columns除“by”变量)?

谢谢!

+0

我错了说你不应该只在现场,而是在组合现场x年? –

+0

这个例子可能会令人困惑,但不,可以保持简单,只需要网站就可以了,因为我不会为同一个网站添加多年 – Anomie

回答

0

使用dplyr包,你可以做一个full_join,然后使用​​3210功能的双列yield.x VS yield.y来获得非NA值,prim.x VS prim.y等。

library(dplyr) 
full_join(txie,grny,by="site") %>% 
mutate(year = coalesce(year.x,.$year.y), 
yield = coalesce(yield.x,yield.y), 
prim = coalesce(prim.x,prim.y)) %>% 
select(-c(year.x,year.y,yield.x,yield.y,prim.x,prim.y)) 

     site  lat year  yield prim 
1 smithfield 59.71994 1999 7.920844 nt 
2 smithfield 59.71994 2000 10.122713 nt 
3 belleville 34.93358 1992 8.622351 nt 
4 belleville 34.93358 1993 7.360470 nt 
5 belleville 34.93358 1994  NA nt 
6  nashua 49.10000 1990 9.083390 ct 
7  nashua 49.10000 1991 8.073866 nt 
8  nashua 49.10000 1992 8.725625 nt 
+0

谢谢!这工作。对于像我这样的其他新手来说,这只是一个参考(现在看起来很明显并且很简单),但我必须确保所有具有相同名称的向量在两个dfs中都是同一类型。 – Anomie

相关问题