2017-04-17 48 views
1

我有一个数据集,其中包含一个人离开网络时的日期。一个人可以多次离开网络,因为他们可能在离开网络后再次加入网络。以下代码复制该场景。如何在R中构建高效循环查找

library(data.table) 
Leaving_Date<- data.table(Id= c(1,2,3,4,3,5),Date = as.Date(
c("2017-01-01","2017-02-03","2017-01-01","2017-03-10","2017-02-09","2017-02-05"))) 

(IDS在此表中重复多次作为一个人可以离开网络多次给定他们又加入了它)

> Leaving_Date 
    Id  Date 
1: 1 2017-01-01 
2: 2 2017-02-03 
3: 3 2017-01-01 
4: 4 2017-03-10 
5: 3 2017-02-09 
6: 5 2017-02-05 

我有另外一个数据集给的日期,当特定的人之后这可以在他们离开网络之前或之后进行。以下代码复制该场景。

FOLLOWUPs <- data.table(Id = c(1,2,3,2,2,3,3,4,1,5), 
         Date =as.Date(c("2016-10-01","2017-02-04", 
         "2017-01-17","2017-02-23", "2017-03-03", 
         "2017-02-10","2017-02-11","2017-01-01", 
         "2017-01-15","2017-01-01"))) 


> FOLLOWUPs 
    Id  Date 
1: 1 2016-10-01 
2: 2 2017-02-04 
3: 3 2017-01-17 
4: 2 2017-02-23 
5: 2 2017-03-03 
6: 3 2017-02-10 
7: 3 2017-02-11 
8: 4 2017-01-01 
9: 1 2017-01-15 
10: 5 2017-01-01 

现在我想查找在Leaving_Date每种情况下,发现当他们进行随访日期和创建三列(SevenDay,FourteenDay,ThirtyDay),表明后续的时间段中的0(柜面,如果有任何)和1秒。我使用下面的代码:

SEVENDAY_FOLLOWUP <- vector() 
FOURTEEN_FOLLOWUP <- vector() 
THIRTYDAY_FOLLOWUP <- vector() 
for(i in 1:nrow(Leaving_Date)){ 
    sub_data <- FOLLOWUPs[Id== Leaving_Date[i,1]] 
    if(nrow(sub_data[Date > Leaving_Date[i,Date] & 
        Date < (Leaving_Date[i,Date]+7)])== 0){ 
    SEVENDAY_FOLLOWUP <- rbind(SEVENDAY_FOLLOWUP,0) 
    } 
    else{ 
    SEVENDAY_FOLLOWUP <- rbind(SEVENDAY_FOLLOWUP,1) 
    } 

    if(nrow(sub_data[Date > Leaving_Date[i,Date] & 
        Date < (Leaving_Date[i,Date]+14)])== 0){ 
    FOURTEEN_FOLLOWUP <- rbind(FOURTEEN_FOLLOWUP,0) 
    } 
    else{ 
    FOURTEEN_FOLLOWUP <- rbind(FOURTEEN_FOLLOWUP,1) 
    } 

    if(nrow(sub_data[Date > Leaving_Date[i,Date] & 
        Date < (Leaving_Date[i,Date]+30)])== 0){ 
    THIRTYDAY_FOLLOWUP <- rbind(THIRTYDAY_FOLLOWUP,0) 
    } 
    else{ 
    THIRTYDAY_FOLLOWUP <- rbind(THIRTYDAY_FOLLOWUP,1) 
    } 
}    


Leaving_Date$SEVENDAY <- as.vector(SEVENDAY_FOLLOWUP) 
Leaving_Date$FOURTEENDAY <- as.vector(FOURTEEN_FOLLOWUP) 
Leaving_Date$THIRTYDAY <- as.vector(THIRTYDAY_FOLLOWUP) 

最终数据

> Leaving_Date 
    Id  Date SEVENDAY FOURTEENDAY THIRTYDAY 
1: 1 2017-01-01  0   0   1 
2: 2 2017-02-03  1   1   1 
3: 3 2017-01-01  0   0   1 
4: 4 2017-03-10  0   0   0 
5: 3 2017-02-09  1   1   1 
6: 5 2017-02-05  0   0   0 

此代码是非常低效的,因为我要运行它100K的观察,它需要大量的时间。有没有任何有效的方法来做到这一点。

+1

您可能想要阅读[R Inferno](http://www.burns-stat.com/pages/Tutor/R_inferno.pdf)的第二个圆圈 – shayaa

+1

@Frank我编辑了它 –

回答

4

使用非相等连接:

setorder(FOLLOWUPs, Id, Date) 
Leaving_Date[, n := 
    FOLLOWUPs[.SD, on=.(Id, Date > Date), mult = "first", x.Date - i.Date] 
] 

    Id  Date  n 
1: 1 2017-01-01 14 days 
2: 2 2017-02-03 1 days 
3: 3 2017-01-01 16 days 
4: 4 2017-03-10 NA days 
5: 3 2017-02-09 1 days 
6: 5 2017-02-05 NA days 

从开关DateIDate可能会使这个速度快两倍。请参阅?IDate


我认为这是最好停在这里,但n可以针对7相比,14,30,如果有必要,像

Leaving_Date[, bin := c(7, 14, 30)[ findInterval(n, c(0, 7, 14, 30)) ]] 

    Id  Date  n bin 
1: 1 2017-01-01 14 days 30 
2: 2 2017-02-03 1 days 7 
3: 3 2017-01-01 16 days 30 
4: 4 2017-03-10 NA days NA 
5: 3 2017-02-09 1 days 7 
6: 5 2017-02-05 NA days NA 

边注:请不要给表名喜欢这个。

0

我认为这是你使用dplyr寻找的。

它通过Id执行'内部连接' - 为给定的Id在两个数据框中生成日期的所有组合 - 然后计算日期差异,按Id编组,然后检查是否存在落入范围内的值你的三个类别。

library(dplyr) 

Leaving_Date2 <- Leaving_Date %>% inner_join(FOLLOWUPs %>% rename(FU_Date=Date)) %>% 
    mutate(datediff=as.numeric(FU_Date-Date)) %>% group_by(Id,Date) %>% 
    summarise(SEVENDAY=as.numeric(any(datediff %in% 0:6)), 
      FOURTEENDAY=as.numeric(any(datediff %in% 0:13)), 
      THIRTYDAY=as.numeric(any(datediff %in% 0:29))) 
+0

如果您更改了%'语句中'datediff%'来自'0:n',您的最终结果与所需结果相符。 –

+0

啊 - 我明白你现在要做的是什么!以上修改。感谢您发现! –

0

我们可以做为查询而不是循环。首先,我清理了你的data.tables,因为我被变量名弄糊涂了。

为了使比较步骤更容易,我们首先预先计算7,14和30天阈值的跟踪日期限制。

library(dplyr) 

dt_leaving_neat = Leaving_Date %>% 
    mutate(.id = 1:n()) %>% 
    mutate(limit_07 = Date + 7) %>% 
    mutate(limit_14 = Date + 14) %>% 
    mutate(limit_30 = Date + 30) %>% 
    rename(id = .id, id_person = Id, leaving_date = Date) 

dt_follow_neat = FOLLOWUPs %>% 
    select(id_person = Id, followed_up_date = Date) 

的实际操作中仅仅是一个查询。为了便于阅读,它在dplyr中写出,但如果速度是您的主要问题,则可以将其转换为data.table。我建议在管道中执行每一步,以确保您了解正在发生的事情。

dt_followed_up = dt_leaving_neat %>% 
    tidyr::gather(follow_up, limit_date, limit_07:limit_30) %>% 
    left_join(dt_follow_neat, by = "id_person") %>% 
    mutate(followed_up = (followed_up_date > leaving_date) & (followed_up_date < limit_date)) %>% 
    select(id, id_person, leaving_date, follow_up, followed_up) %>% 
    filter(followed_up == TRUE) %>% 
    unique() %>% 
    tidyr::spread(follow_up, followed_up, fill = 0) %>% 
    select(id, id_person, leaving_date, limit_07, limit_14, limit_30) 

的想法是加入了离开日期跟进日期和检查随访日期是否是阈值内(也是离开日期后,如想必你不能在离开前跟进) 。

然后进行一些最终清理以返回您所需的格式。您也可以使用selectrename来更改列名称。

dt_result = dt_leaving_neat %>% 
    select(id, id_person, leaving_date) %>% 
    left_join(dt_followed_up, by = c("id", "id_person", "leaving_date")) 

dt_result[is.na(dt_result)] = 0 

结果

> dt_result 
    id id_person leaving_date limit_07 limit_14 limit_30 
1 1   1 2017-01-01  0  0  1 
2 2   2 2017-02-03  1  1  1 
3 3   3 2017-01-01  0  0  1 
4 4   4 2017-03-10  0  0  0 
5 5   3 2017-02-09  1  1  1 
6 6   5 2017-02-05  0  0  0 

而继安德鲁的回答,等效1线data.table SOLN是

​​