2017-09-01 88 views
0

我有具有以下结构的数据的大数据帧:滚动轴承累计总和和滞后不限于滞后范围

name  date val1 val2 
1  A 2017-01-01 0 2 
2  A 2017-01-02 1 1 
3  A 2017-01-03 1 0 
4  A 2017-01-04 0 3 
5  A 2017-01-05 1 1 
6  A 2017-01-06 0 0 
7  B 2017-01-01 0 0 
8  B 2017-01-02 0 3 
9  B 2017-01-03 1 2 
10 B 2017-01-04 1 1 
11 B 2017-01-05 0 0 
12 B 2017-01-06 1 0 
13 C 2017-01-01 0 2 
14 C 2017-01-02 0 1 
15 C 2017-01-03 1 2 
16 C 2017-01-04 0 0 
17 C 2017-01-05 0 0 
18 C 2017-01-06 1 3 

对于任何date每组name内,现在,我想以计算cumsum()val1为最后2次出现,而val2为最后3次出现。

我用下面的代码(基于这样的回答:https://stackoverflow.com/a/27649238/1162278;含创建样本数据集):尝试这种

library(dplyr) 
library(data.table) 

dates <- seq(as.Date('2017-01-01'), as.Date('2017-01-06'), by = '1 day') 

d <- CJ(
    name = c('A', 'B', 'C'), 
    date = dates 
) %>% 
    left_join(
    data.frame(
     name = c(rep('A',6), rep('B',6), rep('C',6)), 
     date = c(rep(dates, 3)), 
     val1 = c(0,1,1,0,1,0,0,0,1,1,0,1,0,0,1,0,0,1), 
     val2 = c(2,1,0,3,1,0,0,3,2,1,0,0,2,1,2,0,0,3) 
    ) 
) 


d %>% 
    group_by(name) %>% 
    mutate(
    val1_l2 = dplyr::lag(cumsum(val1), k=2), 
    val2_l3 = dplyr::lag(cumsum(val2), k=3) 
) 

这产生了:

name  date val1 val2 val1_l2 val2_l3 
    <chr>  <date> <dbl> <dbl> <dbl> <dbl> 
1  A 2017-01-01  0  2  NA  NA 
2  A 2017-01-02  1  1  0  2 
3  A 2017-01-03  1  0  1  3 
4  A 2017-01-04  0  3  2  3 
5  A 2017-01-05  1  1  2  6 
6  A 2017-01-06  0  0  3  7 
7  B 2017-01-01  0  0  NA  NA 
8  B 2017-01-02  0  3  0  0 
9  B 2017-01-03  1  2  0  3 
10  B 2017-01-04  1  1  1  5 
11  B 2017-01-05  0  0  2  6 
12  B 2017-01-06  1  0  2  6 
13  C 2017-01-01  0  2  NA  NA 
14  C 2017-01-02  0  1  0  2 
15  C 2017-01-03  1  2  0  3 
16  C 2017-01-04  0  0  1  5 
17  C 2017-01-05  0  0  1  5 
18  C 2017-01-06  1  3  1  5 

然而,似乎类似于cumsum()总是针对name组内的所有以前的记录进行计算,而不是针对滚动范围k=2k=3对于val1val2

例子:

Row Variable Calculated Expected 
    5 val1_l2  2   1 
    5 val2_l3  6   4 

我在做什么错?

+1

我不清楚 – Sotos

+0

不应'val2_l3'在5行根据你的逻辑为4(3 + 0 + 1),而不是4? – count

+0

事实上,它应该,道歉和感谢指出。我在问题中纠正了它。 –

回答

0

我们可能不需要在这里使用lag。除最后两行或三行外,我们可以将所有值替换为0,然后使用cumsum。这是一个例子。请注意0​​是最终输出。 n():(n() - 1)n():(n() - 2)表示最后两行或三行。 ifelse(row_number() %in% ...)检查行号是否与最后两行或三行匹配。

d2 <- d %>% 
    group_by(name) %>% 
    mutate(val1_l2 = ifelse(row_number() %in% n():(n() - 1), val1, 0), 
     val2_l3 = ifelse(row_number() %in% n():(n() - 2), val2, 0)) %>% 
    mutate(val1_l2 = cumsum(val1_l2), 
     val2_l3 = cumsum(val2_l3)) 

d2 
# A tibble: 18 x 6 
# Groups: name [3] 
    name  date val1 val2 val1_l2 val2_l3 
    <chr>  <date> <dbl> <dbl> <dbl> <dbl> 
1  A 2017-01-01  0  2  0  0 
2  A 2017-01-02  1  1  0  0 
3  A 2017-01-03  1  0  0  0 
4  A 2017-01-04  0  3  0  3 
5  A 2017-01-05  1  1  1  4 
6  A 2017-01-06  0  0  1  4 
7  B 2017-01-01  0  0  0  0 
8  B 2017-01-02  0  3  0  0 
9  B 2017-01-03  1  2  0  0 
10  B 2017-01-04  1  1  0  1 
11  B 2017-01-05  0  0  0  1 
12  B 2017-01-06  1  0  1  1 
13  C 2017-01-01  0  2  0  0 
14  C 2017-01-02  0  1  0  0 
15  C 2017-01-03  1  2  0  0 
16  C 2017-01-04  0  0  0  0 
17  C 2017-01-05  0  0  0  0 
18  C 2017-01-06  1  3  1  3 

数据

library(dplyr) 
library(data.table) 

dates <- seq(as.Date('2017-01-01'), as.Date('2017-01-06'), by = '1 day') 

d <- CJ(
    name = c('A', 'B', 'C'), 
    date = dates 
) %>% 
    left_join(
    data.frame(
     name = c(rep('A',6), rep('B',6), rep('C',6)), 
     date = c(rep(dates, 3)), 
     val1 = c(0,1,1,0,1,0,0,0,1,1,0,1,0,0,1,0,0,1), 
     val2 = c(2,1,0,3,1,0,0,3,2,1,0,0,2,1,2,0,0,3) 
    ) 
)