2016-03-02 54 views
3

我正在尝试将适用的嵌套for循环变为使用。我希望这会使它快很多。 (从我读它应该,虽然这并非总是如此)主数据帧中有一些150K行循环...非常耗时如何使嵌套for循环更高效,并与应用一起使用

我已经在R中编写了一个for循环来检查日期。时间在DF1在于DF2 2个date.times之间,并且如果在DF1 DF2和匹配的代码,在DF2位置然后粘贴到DF1

下面是子集的样本数据

df1<-structure(list(date.time = structure(c(1455922438, 1455922445, 
1455922449, 1455922457, 1455922459, 1455922461), class = c("POSIXct", 
"POSIXt"), tzone = ""), code = c(32221, 32222, 32221, 32222, 
32222, 32221)), .Names = c("date.time", "code"), row.names = 50000:50005, class = "data.frame") 

df2<-structure(list(Location = 11:12, Code = 32221:32222, t_in = structure(c(1455699600, 
1455699600), class = c("POSIXct", "POSIXt"), tzone = ""), t_out = structure(c(1456401600, 
1456401600), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("Location", 
"Code", "t_in", "t_out"), class = "data.frame", row.names = 11:12) 

For循环其中工程正确,但需要很长时间:

for (i in 1:nrow(df1)[1]){ 
    for (j in 1:nrow(df2)){ 
    ifelse(df1$code[i] == df2$Code[j] 
      & df1$date.time [i] < df2$t_out [j] 
      & df1$date.time [i] > df2$t_in [j], 
      df1$Location [i] <- df2$Location [j], 
      NA) 
    } 
} 

我做了一部分的方式与此:

ids <- as.numeric(df2$Location) 
f <- function(x){ 
    a <- ids[ (df2$t_in < x) & (x < df2$t_out) ] 
    if (length(a) == 0) NA else a 
} 

df1$Location <- lapply(df1$date.time, f) 

这将返回两个数字作为DF1的date.time在于T_IN和度T_out因此之间为什么会出现在每个数据帧进行编码的要求时,匹配位置被粘贴

不胜感激

+1

如果你可以提供一个稍微好一点的测试用例,其中一些i,j组合不满足时间条件,那将是理想的。无论如何,请回报system.time结果。 –

+0

您可以尝试使用sqldf软件包查看是否将df转换为本地数据库,然后在其上执行查询有助于提高速度。 – fhlgood

回答

3

data.table具有重叠范围联接可以非常快地做到这一点的包的任何指针。您正在寻找的功能是foverlaps。下面是用清洁的一点点的例子使用foverlaps前:

require(data.table) 

dt1 <- data.table(df1) 
dt2 <- data.table(df2) 

## need to create a range in dt 1 to find overlaps on 
dt1[,start:=date.time] 
dt1[,end:=date.time] 

## clean up names to match each other 
setnames(dt2,c("Location","Code","start","end")) 
setnames(dt1,c("code"),c("Code")) 

setkey(dt1,Code,start,end) 
setkey(dt2,Code,start,end) 

## use foverlaps with the additional matching variable Code 
out <- foverlaps(dt1,dt2,type="any", 
       by.x=c("Code","start","end"), 
       by.y=c("Code","start","end")) 

## more renaming and selection of the same subset of columns 
setnames(out,"i.start","date.time") 
out <- out[,.(date.time,Code,Location)] 

这给输出:

> out 
      date.time Code Location 
1: 2016-02-19 14:53:58 32221  11 
2: 2016-02-19 14:54:09 32221  11 
3: 2016-02-19 14:54:21 32221  11 
4: 2016-02-19 14:54:05 32222  12 
5: 2016-02-19 14:54:17 32222  12 
6: 2016-02-19 14:54:19 32222  12 
+0

可能比我的三重“外部”方法更快,但我希望OP会报告两种方法的“system.time”结果与他的双重for-loop方法。 –

1

我试图建立一个不依赖于任何for“环式少”版本或apply。看看它的任何速度更快:

trans <- which(outer(X=df1$code, Y=df2$Code,'==') & 
       outer(df1$date.time , df2$t_in, ">") & 
       outer(df1$date.time, df2$t_out , "<") , arr.ind=TRUE) 
df1$Location [ trans[,1] ] <- df2$Location [ trans[,2] ] 
df1 
#------ 
       date.time code Location 
50000 2016-02-19 14:53:58 32221  11 
50001 2016-02-19 14:54:05 32222  12 
50002 2016-02-19 14:54:09 32221  11 
50003 2016-02-19 14:54:17 32222  12 
50004 2016-02-19 14:54:19 32222  12 
50005 2016-02-19 14:54:21 32221  11 

三个调用外将建设i通过j矩阵是TRUE当三个独立的条件得到满足,他们AND -ed给合资满意的结果,然后将which(. , arr.ind=TRUE)返回一个矩阵,其中i值位于第一列,j值位于第二列,因此可以使用相应向量的普通[<-分配。