我有两个数据帧:一个(“grny”),主要是一个引用,但在“yield”列中有一些数据I' m之后,另一个(“txie”)会因为丢失数据而产生少量数据。我想合并它们,以便在“网站”中具有共同值的行中的所有单元格都是完整的。R:合并2个数据帧并将参考数据应用于匹配一个级别的所有行
其中最多的一年,通过一年的数据是:
txie<-data.frame (site=c(rep("smithfield",2),rep("belleville",3)),
yield=c((rnorm(4, mean=8)),NA),
year=c(1999:2000,1992:1994),
prim=c(rep("nt",2),rep(NA,3)))
一些年的年收益率数据大多参考:
grny<-data.frame (site=c("smithfield","belleville",rep("nashua",3)),
yield=c(rep(NA,2),rnorm(3,mean=9)),
year=c(rep(NA,2),1990:1992),
prim=c(NA,"nt",sample(c("nt","ct"),3,rep=TRUE)),
lat=(c(rnorm(2,mean=45,sd=10),rep(49.1,3))))
我想要什么:
site yield year prim lib lat
1 smithfield 7.009178 1999 nt 1109 43.61828
2 smithfield 8.472677 2000 nt 1109 43.61828
3 belleville 8.857462 1992 nt 122 74.08792
4 belleville 7.368488 1993 nt 122 74.08792
5 belleville NA 1994 nt 122 74.08792
6 nashua 7.494519 1990 nt 554 49.10000
8 nashua 8.696066 1991 ct 554 49.10000
9 nashua 8.051670 1992 nt 554 49.10000
我试过的东西:
rbind.fill(txie,grny) #this appends rows to the correct columns but leaves NA's everywhere because it doesn't know I want data missing in grny filled in when it is available in txie
Reduce(function(x,y) merge(txie,grny, by="site", all.y=TRUE), list(txie,grny)) #this merges by rows but creates new variables from x and y.
merge(x = txie, y = grny, by = "site", all = TRUE) #this does the same as the above (new variables from each x and y ending in .x or .y)
merge(x = txie, y = grny, by = "site", all.x = TRUE)#this does similar to above but merges based on the x df (new variables from each x and y ending in .x or .y)
setkey(setDT(grny),site)[txie]# this gives a similar result to the all.x line
例如,与外部连接合并我结束了:
site yield.x year.x prim.x yield.y year.y prim.y lat
1 belleville 6.766628 1992 <NA> NA NA nt 34.92136
2 belleville 6.845789 1993 <NA> NA NA nt 34.92136
3 belleville NA 1994 <NA> NA NA nt 34.92136
4 smithfield 8.841339 1999 nt NA NA <NA> 49.81872
5 smithfield 7.313310 2000 nt NA NA <NA> 49.81872
6 nashua NA NA <NA> 9.173229 1990 ct 49.10000
7 nashua NA NA <NA> 9.196018 1991 nt 49.10000
8 nashua NA NA <NA> 7.336645 1992 ct 49.10000
规定:我想保持NA的那些已经在“收益率”列(如。 1994年纳舒厄)。 任何答案或有人可以告诉我,这种合并的例子(数据已经在一个或多个共享列,你没有合并,每个df bringing in new columns除“by”变量)?
谢谢!
我错了说你不应该只在现场,而是在组合现场x年? –
这个例子可能会令人困惑,但不,可以保持简单,只需要网站就可以了,因为我不会为同一个网站添加多年 – Anomie