2016-07-09 51 views
2

我有两个问题: 您推荐阅读哪些资源以提高数据处理能力?我一直在处理更大的数据集,并且一直在努力适应 - 我觉得我正在打砖墙,不知道去哪里看(许多在线资源变得太复杂,没有建立基础)。简化数据帧年份并计算百分比变化

例如,我试图解决这个问题。我有一个数百万行的DF,我试图简化它并分析一个趋势。我有一个例子。我试图隔离每个ID并获取给定年份的最小值。 (有些ID有几年不适用于其他人)。简化了这些数据后,我试图添加百分比更改列。鉴于这是一个20多年的时间系列,我可以忽略几个月的时间,因为一年的最低值与另一年的最低值相比应该产生合理的百分比变化。

谢谢!

输入:

structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L), .Label = c("a", "b"), class = "factor"), Date = structure(c(1L, 
2L, 3L, 4L, 5L, 6L, 10L, 12L, 14L, 7L, 8L, 9L, 11L, 13L, 5L, 
6L, 10L, 12L, 14L, 7L, 8L, 9L, 11L, 13L, 15L, 16L), .Label = c("2/21/2009", 
"2/22/2009", "2/23/2009", "2/24/2009", "2/25/2009", "2/26/2009", 
"3/2/2011", "3/3/2011", "3/4/2011", "3/5/2010", "3/5/2011", "3/6/2010", 
"3/6/2011", "3/7/2010", "3/7/2011", "3/8/2011"), class = "factor"), 
    Year = c(2009L, 2009L, 2009L, 2009L, 2009L, 2009L, 2010L, 
    2010L, 2010L, 2011L, 2011L, 2011L, 2011L, 2011L, 2009L, 2009L, 
    2010L, 2010L, 2010L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 
    2011L), Value = c(10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
    20, 21, 22, 5, 6, 7, 8, 8, 9, 10, 11, 12, 15, 23, 25, 27)), .Names = c("ID", 
"Date", "Year", "Value"), class = "data.frame", row.names = c(NA, 
-26L)) 

预期输出:

structure(list(ID = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("a", 
"b"), class = "factor"), Date = structure(c(1L, 4L, 5L, 2L, 4L, 
3L), .Label = c("2/21/2009", "2/25/2009", "3/2/2011", "3/5/2010", 
"3/6/2011"), class = "factor"), Year = c(2009L, 2010L, 2011L, 
2009L, 2010L, 2011L), Value = c(10, 16, 5, 6, 8, 10), Percent.Increase = c(NA, 
0.6, -0.6875, NA, 0.333333333, 0.25)), .Names = c("ID", "Date", 
"Year", "Value", "Percent.Increase"), class = "data.frame", row.names = c(NA, 
-6L)) 
+1

至于读什么书一尺,data.table护身符是一个良好的开端:https://github.com/Rdatatable/data.table/wiki/Getting-started对于关于如何考虑组织数据的指导,我会推荐Hadley的文章https://www.jstatsoft.org/article/view/v059i10,即使它不使用data.table语法。 – Frank

回答

3

由 'ID' 分组后, '年',我们slice每个组内的min “值” 列,然后通过分组' ID',我们通过从'Value'的lag减去'Value'并除以'Value'的lag来创建'Percent.Increase'。

res <- df1 %>% 
     group_by(ID, Year) %>% 
     slice(which.min(Value)) %>% 
     group_by(ID) %>% 
     mutate(Percent.Increase = (Value-lag(Value))/lag(Value)) 
+0

你很容易做到这一点是疯了。谢谢@akrun!你推荐一个特定的资源/方法来学习dplyr吗? – sammyramz

+1

@sammyramz我认为最好的理解方式是练习,犯错误,从中学习,当然也可以阅读官方文档。 – akrun

2

直到HAVING clause在data.table实现,这似乎是非常有效的方法:在5e7

dt[dt[, .I[which.min(Value)],, .(ID, Year)]$V1 
    ][, Percent_Increase := { 
     tmp <- shift(Value) 
     (Value-tmp)/tmp 
    }, .(ID)] 

检查时间。

library(dplyr) 
library(data.table) 
N = 5e7 
set.seed(1) 
df = data.frame(ID = sample(2L, N, TRUE), 
       Date = sample(16L, N, TRUE), 
       Year = sample(2009:2011, N, TRUE), 
       Value = sample(N/10, N, TRUE)) 
dt = as.data.table(df) 
system.time(
    res <- df %>% 
     group_by(ID, Year) %>% 
     slice(which.min(Value)) %>% 
     group_by(ID) %>% 
     mutate(Percent_Increase = (Value-lag(Value))/lag(Value))  
) 
# user system elapsed 
# 1.676 2.176 3.847 
system.time(
    r <- dt[dt[, .I[which.min(Value)],, .(ID, Year)]$V1, 
      ][, Percent_Increase := { 
       tmp <- shift(Value) 
       (Value-tmp)/tmp 
      }, .(ID)] 
) 
# user system elapsed 
# 0.940 0.460 1.334 
all.equal(r, as.data.table(res), ignore.col.order = TRUE, check.attributes = FALSE, ignore.row.order = TRUE) 
#[1] TRUE