2015-11-06 99 views
1

我想计算两个日期之间的变量的均值,下面是可重现的数据帧。如何计算两个日期之间的变量的均值

year <- c(1996,1996,1996,1996,1996,1996,1996,1996,1996,1996,1996,1996, 
     1996,1996,1996,1996,1996,1996,1996,1996,1996,1996,1996,1996, 
     1997,1997,1997,1997,1997,1997,1997,1997,1997,1997,1997,1997, 
     1997,1997,1997,1997,1997,1997,1997,1997,1997,1997,1997,1997) 
month <- c("JAN","FEB","MAR","APR","MAY","JUN","JUL","AUG","SEP","OCT","NOV","DEC") 
station <- c("A","A","A","A","A","A","A","A","A","A","A","A", 
     "B","B","B","B","B","B","B","B","B","B","B","B") 

concentration <- as.numeric(round(runif(48,20,40),1)) 

df <- data.frame(year,month,station,concentration) 


id <- c(1,2,3,4) 
station1996 <- c("A","A","B","B") 
station1997 <- c("B","A","A","B") 
start <- c("06/01/1996","07/01/1996","07/01/1996","08/01/1996") 
end <- c("04/01/1997","04/01/1997","04/01/1997","05/01/1997") 

participant <- data.frame(id,station1996,station1997,start,end) 
participant$start <- as.Date(participant$start, format = "%m/%d/%Y") 
participant$end <- as.Date(participant$end, format = "%m/%d/%Y") 

所以我有两个数据集,如下

df 
    year month station concentration 
1 1996 JAN  A   24.4 
2 1996 FEB  A   37.0 
3 1996 MAR  A   39.5 
4 1996 APR  A   28.0 
... 
45 1997 SEP  B   37.7 
46 1997 OCT  B   35.2 
47 1997 NOV  B   26.8 
48 1997 DEC  B   40.0 

participant 
    id station1996 station1997  start  end 
1 1   A   B 1996-06-01 1997-04-01 
2 2   A   A 1996-07-01 1997-04-01 
3 3   B   A 1996-07-01 1997-04-01 
4 4   B   B 1996-08-01 1997-05-01 

每个ID,我想计算开始和结束日期(月日)的平均浓度。注意到电台可能会在几年之间发生变化。

例如对于id = 1,我想计算1996年6月到1997年4月的平均浓度。这应该基于1996年6月至1996年12月在A站的浓度以及1997年1月至1997年4月的浓度台B.

任何人都可以帮忙吗?

非常感谢。

+1

第1步:将'start'和'end'转换为'Date'或'POSIXct'格式,并将'year'和'month'作为同一格式的新列。 – MichaelChirico

+0

您也可以将它们转换为“1997-10”形式的字符串。那么你可以像'平均值(浓度[日期> =开始和日期<=结束])'库(动物园)' –

+0

; as.yearmon(参与者$ start)'等等......在这种情况下也可能非常方便,如果你不想处理稍微笨拙的POSIXct格式。 – thelatemail

回答

1

这里是一个data.table解决方案。基本思路是将起始范围中的所有日期都列为yearmon,对于每个id,然后将其用作浓度表df的索引。这有点复杂,所以希望有人会出现并向你展示一个更简单的方法。

library(data.table) 
library(zoo)   # for as.yearmon(...) 
setDT(df)    # convert to data.table 
setDT(participant) 
df[, yrmon:= as.yearmon(paste(year,month,sep="-"), format="%Y-%B")] # add year-month column 
p.melt <- reshape(participant, varying=2:3, direction="long", sep="", timevar="year") 
x <- participant[, .(date=seq(start,end,by="month")), by=id] 
x[, c("year","yrmon"):=.(year(date),as.yearmon(date))]   # add year and year-month 
x[p.melt, station:=station, on=c("id","year")]     # add station 
x[df, conc:= concentration, on=c("yrmon","station"), nomatch=0] # add concentration 
setorder(x,id) # not necessary, but makes it easier to interpret x 
result <- x[, .(mean.conc=mean(conc)), by=id]     # mean(conc) by id 
result 
# id mean.conc 
# 1: 1 28.61818 
# 2: 2 28.56000 
# 3: 3 28.44000 
# 4: 4 29.60000 

所以,首先我们将所有东西都转换成data.tables。然后我们添加一个yrmon列到df以供稍后索引。然后,我们通过将participant重塑为长格式创建p.melt,以便该工作站位于一列中,并且指示器(1996或1997)位于单独的列中。然后我们创建一个临时表x,其中包含每个id的日期序列,并为每个日期添加year和yrmon。然后我们将p.meltidyear合并为x。然后我们使用yrmonstation合并xdf以获得适当的浓度。然后我们简单地使用mean(...)x中通过id汇总conc

相关问题