集团行多达当前行中的R data.table

我有一个数据集，看起来像这样：集团行多达当前行中的R data.table

library(data.table) 

set.seed(10) 

n_rows <- 50 

data <- data.table(id = 1:n_rows, 
        timestamp = Sys.Date() + as.difftime(1:n_rows, units = "days"), 
        subject = sample(letters[1:4], n_rows, replace = T), 
        response = sample(3, n_rows, replace = T) 
        ) 

head(data, 10) 

    id timestamp subject response 
1: 1 2016-05-17  c  2 
2: 2 2016-05-18  b  3 
3: 3 2016-05-19  b  1 
4: 4 2016-05-20  c  2 
5: 5 2016-05-21  a  1 
6: 6 2016-05-22  a  2 
7: 7 2016-05-23  b  2 
8: 8 2016-05-24  b  2 
9: 9 2016-05-25  c  2 
10: 10 2016-05-26  b  2

我需要通过操作做一些组按主题迄今为止每个响应的那笔出现次数。

下面的组通过产生nth_test列。

new_vars <- data[, .(id, timestamp, nth_test = 1:.N, response), by=.(subject)] 

    subject id timestamp nth_test response 
1:  c 1 2016-05-17  1  2 
2:  c 4 2016-05-20  2  2 
3:  c 9 2016-05-25  3  2 
4:  c 11 2016-05-27  4  1 
5:  c 12 2016-05-28  5  1 
6:  c 14 2016-05-30  6  2 
7:  c 22 2016-06-07  7  2 
8:  c 26 2016-06-11  8  2 
9:  c 31 2016-06-16  9  3 
10:  c 36 2016-06-21  10  1

但我不知道如何生产列resp_1，resp_2 & resp_3像下面。

subject id timestamp nth_test response resp_1 resp_2 resp_3 
1:  c 1 2016-05-17  1  2  0  1  0 
2:  c 4 2016-05-20  2  2  0  2  0 
3:  c 9 2016-05-25  3  2  0  3  0 
4:  c 11 2016-05-27  4  1  1  3  0 
5:  c 12 2016-05-28  5  1  2  3  0 
6:  c 14 2016-05-30  6  2  2  4  0 
7:  c 22 2016-06-07  7  2  2  5  0 
8:  c 26 2016-06-11  8  2  2  6  0 
9:  c 31 2016-06-16  9  3  2  6  1 
10:  c 36 2016-06-21  10  1  3  6  1

干杯

来源

2016-05-16 efbbrown

您的数据是如何排序的，因为这些列值取决于您的数据的顺序？你可以做一些类似'resp_i：= cumsum（response == i）' – Psidom

Psidom这正是我需要的，谢谢。 – efbbrown

我们可以尝试

Un1 <- unique(sort(data$response)) 
data[, c("nth_test", paste("resp", Un1, sep="_")) := c(list(1:.N), 
     lapply(Un1, function(x) cumsum(x==response))) , .(subject)] 
data[order(subject, timestamp)][subject=="c"] 
# id timestamp subject response nth_test resp_1 resp_2 resp_3 
# 1: 1 2016-05-17  c  2  1  0  1  0 
# 2: 4 2016-05-20  c  2  2  0  2  0 
# 3: 9 2016-05-25  c  2  3  0  3  0 
# 4: 11 2016-05-27  c  1  4  1  3  0 
# 5: 12 2016-05-28  c  1  5  2  3  0 
# 6: 14 2016-05-30  c  2  6  2  4  0 
# 7: 22 2016-06-07  c  2  7  2  5  0 
# 8: 26 2016-06-11  c  2  8  2  6  0 
# 9: 31 2016-06-16  c  3  9  2  6  1 
#10: 36 2016-06-21  c  1  10  3  6  1 
#11: 39 2016-06-24  c  1  11  4  6  1 
#12: 40 2016-06-25  c  1  12  5  6  1 
#13: 44 2016-06-29  c  2  13  5  7  1

来源

2016-05-16 03:05:15 akrun

谢谢，漂亮优雅的解决方案。 – efbbrown

很好的答案，但如果你稍后再对它进行子集化处理，那么在“subject”上的顺序是什么？当然它是更好的子集，然后按'timestamp'排序。 – jangorecki

@jangorecki你说得对。我只是在OP的帖子上显示了预期的输出结果。 – akrun

，我想看看这会是什么样如果在data.table在长格式cummax/cumsum做（也许是在某些配置中效率更高）：

> data[order(subject, timestamp) 
+  ][, rCnt := 1:.N, .(subject, response) 
+  ][, responseStr := sprintf('%s_%s', 'resp', response) 
+  ][, dcast(.SD, id + timestamp + subject + response ~ responseStr, value.var='rCnt', fill=0) 
+  ][, melt(.SD, id.vars=c('id', 'timestamp', 'subject', 'response')) 
+  ][order(subject, timestamp) 
+  ][, value := cummax(value), .(subject, variable) 
+  ][, nth_test := 1:.N, .(subject, variable) 
+  ][, dcast(.SD, id + timestamp + subject + response + nth_test ~ variable, value.var='value') 
+  ][order(subject, timestamp) 
+  ][subject == 'c' 
+  ] 
    id timestamp subject response nth_test resp_1 resp_2 resp_3 
1: 1 2016-05-17  c  2  1  0  1  0 
2: 4 2016-05-20  c  2  2  0  2  0 
3: 9 2016-05-25  c  2  3  0  3  0 
4: 11 2016-05-27  c  1  4  1  3  0 
5: 12 2016-05-28  c  1  5  2  3  0 
6: 14 2016-05-30  c  2  6  2  4  0 
7: 22 2016-06-07  c  2  7  2  5  0 
8: 26 2016-06-11  c  2  8  2  6  0 
9: 31 2016-06-16  c  3  9  2  6  1 
10: 36 2016-06-21  c  1  10  3  6  1 
11: 39 2016-06-24  c  1  11  4  6  1 
12: 40 2016-06-25  c  1  12  5  6  1 
13: 44 2016-06-29  c  2  13  5  7  1 
>

来源

2016-05-17 01:38:09

集团行多达当前行中的R data.table

回答

相关问题