2016-02-13 67 views
1

重新加权我有一吨由国家,日期和UPC(产品代码)索引的价格数据。我想汇总UPC,并通过加权平均结合价格。我会尽力解释它,但您可能只想阅读下面的代码。通过汇总和的indeces R中

数据集中的每个观察是:UPC,日期,状态,价格和重量。我想离开聚集在这样的UPC指数:

采取所有的数据点具有相同的日期和状态,以及它们的权重多的价格,总结起来。这显然创建了一个加权平均数,我称之为priceIndex。但是,对于某个日期的&状态组合,权重不会累加为1.因此,我想创建两个附加列:一个用于每个日期&状态组合的权重总和。第二个是重新加权平均值:也就是说,如果原来的两个权重是.5和.3,将它们改为.5 /(.5 + .3)= .625和.3 /(.5 + .3)= .375,然后将加权平均值重新计算为另一个价格指数。

这就是我的意思是:

upc=c(1153801013,1153801013,1153801013,1153801013,1153801013,1153801013,2105900750,2105900750,2105900750,2105900750,2105900750,2173300001,2173300001,2173300001,2173300001) 
date=c(200601,200602,200603,200603,200601,200602,200601,200602,200603,200601,200602,200601,200602,200603,200601) 
price=c(26,28,27,27,23,24,85,84,79.5,81,78,24,19,98,47) 
state=c(1,1,1,2,2,2,1,1,2,2,2,1,1,1,2) 
weight=c(.3,.2,.6,.4,.4,.5,.5,.5,.45,.15,.5,.2,.15,.3,.45) 

# This is what I have: 
data <- data.frame(upc,date,state,price,weight) 
data 

# These are a few of the weighted calculations: 
# .3*26+85*.5+24*.2 = 55.1 
# 28*.2+84*.5+19*.15 = 50.45 
# 27*.6+98*.3 = 45.6 
# Etc. etc. 

# Here is the reweighted calculation for date=200602 & state==1: 
# 28*(.2/.85)+84*(.5/.85)+19*(.15/.85) = 50.45 
# Or, equivalently: 
# (28*.2+84*.5+19*.15)/.85 = 50.45 

# This is what I want: 
date=c(200601,200602,200603,200601,200602,200603) 
state=c(1,1,1,2,2,2) 
priceIndex=c(55.1,50.45,45.6,42.5,51,46.575) 
totalWeight=c(1,.85,.9,1,1,.85) 
reweightedIndex=c(55.1,59.35294,50.66667,42.5,51,54.79412) 
index <- data.frame(date,state,priceIndex,totalWeight,reweightedIndex) 
index 

而且,不是它应该的问题,但也有35州,150点的UPC,并在数据集84个日期 - 所以有很多意见。

非常感谢。

回答

2

我们可以通过总结操作使用其中的一个组。随着data.table,我们转换“data.frame”到“data.table”(setDT(data),通过“日期”,“国家”,我们得到了分组的“价格”和“重量”,并作为sum(weight)临时变量的产品sum ,然后创建在list的3个变量基础上。

library(data.table) 
setDT(data)[, {tmp1 = sum(price*weight) 
       tmp2 = sum(weight) 
     list(priceIndex=tmp1, totalWeight=tmp2, 
       reweigthedIndex = tmp1/tmp2)}, .(date, state)] 
# date state priceIndex totalWeight reweightedIndex 
#1: 200601  1  55.100  1.00  55.10000 
#2: 200602  1  50.450  0.85  59.35294 
#3: 200603  1  45.600  0.90  50.66667 
#4: 200603  2  46.575  0.85  54.79412 
#5: 200601  2  42.500  1.00  42.50000 
#6: 200602  2  51.000  1.00  51.00000 

或者使用dplyr,我们可以使用summarise做的“日期”和“状态”分组后创造了3列。

library(dplyr) 
data %>% 
    group_by(date, state) %>% 
    summarise(priceIndex = sum(price*weight), 
      totalWeight = sum(weight), 
      reweightedIndex = priceIndex/totalWeight) 
# date state priceIndex totalWeight reweightedIndex 
# (dbl) (dbl)  (dbl)  (dbl)   (dbl) 
#1 200601  1  55.100  1.00  55.10000 
#2 200601  2  42.500  1.00  42.50000 
#3 200602  1  50.450  0.85  59.35294 
#4 200602  2  51.000  1.00  51.00000 
#5 200603  1  45.600  0.90  50.66667 
#6 200603  2  46.575  0.85  54.79412 
+0

对于dplyr之一,当我输入时,我只得到一行? – ejn

+1

@ejn你可以使用'dplyr :: summarise'(如果你还加载了'plyr' – akrun