2015-10-20 37 views
1

我试图找到最有效的方式来计算长数据集中某个期间的下一个百分比变化。下面是格式的示例:在长格式数据集中创建%更改列

set.seed(1234) 
df <- data.frame(Date=c(2001:2010),CompanyA=rnorm(10,0,1),CompanyB=rnorm(10,1,2),CompanyC=rnorm(10,-1,2)) 
longdf <- melt(df,id.vars="Date") 

结果表看起来是这样的:

Date variable  value 
1 2001 CompanyA -1.20706575 
2 2002 CompanyA 0.27742924 
3 2003 CompanyA 1.08444118 
4 2004 CompanyA -2.34569770 
5 2005 CompanyA 0.42912469 
6 2006 CompanyA 0.50605589 
7 2007 CompanyA -0.57473996 
8 2008 CompanyA -0.54663186 
9 2009 CompanyA -0.56445200 
10 2010 CompanyA -0.89003783 
11 2001 CompanyB 0.04561460 
12 2002 CompanyB -0.99677289 
13 2003 CompanyB -0.55250779 
14 2004 CompanyB 1.12891763 
15 2005 CompanyB 2.91898812 
16 2006 CompanyB 0.77942901 
17 2007 CompanyB -0.02201901 
18 2008 CompanyB -0.82239083 
19 2009 CompanyB -0.67434336 
20 2010 CompanyB 5.83167036 
21 2001 CompanyC -0.73182356 
22 2002 CompanyC -1.98137179 
23 2003 CompanyC -1.88109574 
24 2004 CompanyC -0.08082112 
25 2005 CompanyC -2.38744049 
26 2006 CompanyC -3.89640982 
27 2007 CompanyC 0.14951144 
28 2008 CompanyC -3.04731145 
29 2009 CompanyC -1.03027660 
30 2010 CompanyC -2.87189720 

我要的是添加第4列其中显示了从一个期间的各公司的得分变化%到下一个。

我可以使用下面的代码创建此列:

for (c in unique(longdf$variable)) { 
    for (y in unique(longdf$Date)[-1]){ 
longdf$change[longdf$variable==c & longdf$Date==y] <- (longdf[longdf$variable==c & longdf$Date==y,"value"]-longdf[longdf$variable==c & longdf$Date==y-1,"value"])/abs(longdf[longdf$variable==c & longdf$Date==y-1,"value"]) 
    } 
} 
longdf 

生成的表是这样的:

Date variable  value  change 
1 2001 CompanyA -1.20706575   NA 
2 2002 CompanyA 0.27742924 1.22983772 
3 2003 CompanyA 1.08444118 2.90889283 
4 2004 CompanyA -2.34569770 -3.16304743 
5 2005 CompanyA 0.42912469 1.18294117 
6 2006 CompanyA 0.50605589 0.17927471 
7 2007 CompanyA -0.57473996 -2.13572427 
8 2008 CompanyA -0.54663186 0.04890578 
9 2009 CompanyA -0.56445200 -0.03259990 
10 2010 CompanyA -0.89003783 -0.57681757 
11 2001 CompanyB 0.04561460   NA 
12 2002 CompanyB -0.99677289 -22.85205787 
13 2003 CompanyB -0.55250779 0.44570343 
14 2004 CompanyB 1.12891763 3.04326103 
15 2005 CompanyB 2.91898812 1.58565198 
16 2006 CompanyB 0.77942901 -0.73297972 
17 2007 CompanyB -0.02201901 -1.02825018 
18 2008 CompanyB -0.82239083 -36.34912573 
19 2009 CompanyB -0.67434336 0.18002082 
20 2010 CompanyB 5.83167036 9.64792433 
21 2001 CompanyC -0.73182356   NA 
22 2002 CompanyC -1.98137179 -1.70744467 
23 2003 CompanyC -1.88109574 0.05060941 
24 2004 CompanyC -0.08082112 0.95703509 
25 2005 CompanyC -2.38744049 -28.53981030 
26 2006 CompanyC -3.89640982 -0.63204479 
27 2007 CompanyC 0.14951144 1.03837159 
28 2008 CompanyC -3.04731145 -21.38179426 
29 2009 CompanyC -1.03027660 0.66190637 
30 2010 CompanyC -2.87189720 -1.78750114 

与上面的代码的问题是,它似乎非常低效。我正在使用的数据框将有数百万行。是否有更有效的方式为长形数据创建%更改列?

回答

1

使用dplyr您可以通过variable因素分组后使用lag功能:

library(dplyr) 

longdf %>% 
    group_by(variable) %>% 
    mutate(change = value/lag(value) - 1) 

# Source: local data frame [30 x 4] 
# Groups: variable [3] 
# 
#  Date variable  value  change 
# (int) (fctr)  (dbl)  (dbl) 
# 1 2001 CompanyA -1.2070657   NA 
# 2 2002 CompanyA 0.2774292 -1.22983772 
# 3 2003 CompanyA 1.0844412 2.90889283 
# 4 2004 CompanyA -2.3456977 -3.16304743 
# 5 2005 CompanyA 0.4291247 -1.18294117 
# 6 2006 CompanyA 0.5060559 0.17927471 
# 7 2007 CompanyA -0.5747400 -2.13572427 
# 8 2008 CompanyA -0.5466319 -0.04890578 
# 9 2009 CompanyA -0.5644520 0.03259990 
# 10 2010 CompanyA -0.8900378 0.57681757 
# .. ...  ...  ...   ... 
+0

非常感谢这一点。我在包“quantmod”中找到了一个名为Delt的函数,它似乎做了类似的事情。我会给他们两个尝试,看看哪个更快。再次感谢! –