2016-02-29 62 views
0

计算值大于95%分位点时如下我的数据构造:错误使用plyr

Individ <- data.frame(Participant = c("Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", 
             "Harry", "Harry", "Harry", "Harry","Harry", "Harry", "Harry", "Harry", "Paul", "Paul", "Paul", "Paul"), 
         Time = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4), 
         Condition = c("Placebo", "Placebo", "Placebo", "Placebo", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", 
            "Placebo", "Placebo", "Placebo", "Placebo", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr"), 
         Power = c(400, 250, 180, 500, 300, 450, 600, 512, 300, 500, 450, 200, 402, 210, 130, 520, 310, 451, 608, 582, 390, 570, NA, NA)) 

使用dplyr我通过下面的代码应用的滚动平均值(从2到4秒):

for (summaryFunction in c("mean")) { 
    for (i in seq(2, 4, by = 1)) { 
    tempColumn <- Individ %>% 
     group_by(Participant) %>% 
     transmute(rollapply(Power, 
          width = i, 
          FUN = summaryFunction, 
          align = "right", 
          fill = NA, 
          na.rm = T)) 
    colnames(tempColumn)[2] <- paste("Rolling", summaryFunction, as.character(i), sep = ".") 
    Individ <- bind_cols(Individ, tempColumn[2]) 
    } 
} 

我现在希望计算每个滚动平均值中每个ParticipantPower的前5%。为了计算这个,我用:

Output = ddply(Individ, .(Participant, Condition), summarise, 
      TwoSec <- Rolling.mean.2 > quantile(Rolling.mean.2 , 0.95, na.rm = TRUE)) 

不过,我结束了,指出TRUEFALSE列。相反,我追踪的是前5%的实际值。我该怎么做呢?是否还有更简单的方法来循环查看每个滚动平均值列,按参与者和条件查找每个滚动平均值的前5%?

谢谢!

+0

这个能帮忙吗? http://stackoverflow.com/questions/19608618/r-percentile-calculations-on-subsets-of-data – 2016-02-29 05:54:09

+0

是的,它是有益的,谢谢你的链接。然而,我怎样才能将每个参与者的所有出现次数都大于95%?我不了解其他分位数。 – user2716568

+0

如果我正确理解你的问题,用'dplyr'就可以得到'df%>%group_by(Participant)%>%filter(between(Power, ,1,na.rm = TRUE)))' – alistaire

回答

1

这很好,你有你的滚动数据表,这使计算分位数的工作更容易。

第1步:由参与者,条件组,位置

Individ <- data.frame(Participant = c("Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", 
             "Harry", "Harry", "Harry", "Harry","Harry", "Harry", "Harry", "Harry", "Paul", "Paul", "Paul", "Paul"), 
         Time = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4), 
         Condition = c("Placebo", "Placebo", "Placebo", "Placebo", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", 
            "Placebo", "Placebo", "Placebo", "Placebo", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr"), 
         Location = c("Home", "Home", "Home", "Home", "Away", "Away", "Away", "Away", "Home", "Home", "Home", "Home", 
            "Home", "Home", "Home", "Home", "Away", "Away", "Away", "Away", "Home", "Home", "Home", "Home"), 
         Power = c(400, 250, 180, 500, 300, 450, 600, 512, 300, 500, 450, 200, 402, 210, 130, 520, 310, 451, 608, 582, 390, 570, NA, NA)) 


library(dplyr) 
library(zoo) 
for (summaryFunction in c("mean")) { 
    for (i in seq(2, 4, by = 1)) { 
    tempColumn <- Individ %>% 
     group_by(Participant) %>% 
     transmute(rollapply(Power, 
          width = i, 
          FUN = summaryFunction, 
          align = "right", 
          fill = NA, 
          na.rm = T)) 
    colnames(tempColumn)[2] <- paste("Rolling", summaryFunction, as.character(i), sep = ".") 
    Individ <- bind_cols(Individ, tempColumn[2]) 
    } 
} 


Individ 


    Participant Time Condition Location Power Rolling.mean.2 Rolling.mean.3 Rolling.mean.4 
     (fctr) (dbl) (fctr) (fctr) (dbl)   (dbl)   (dbl)   (dbl) 
1   Bill  1 Placebo  Home 400    NA    NA    NA 
2   Bill  2 Placebo  Home 250   325    NA    NA 
3   Bill  3 Placebo  Home 180   215  276.6667    NA 
4   Bill  4 Placebo  Home 500   340  310.0000   332.5 
5   Bill  1  Expr  Away 300   400  326.6667   307.5 
6   Bill  2  Expr  Away 450   375  416.6667   357.5 
7   Bill  3  Expr  Away 600   525  450.0000   462.5 
8   Bill  4  Expr  Away 512   556  520.6667   465.5 
9   Bill  1  Expr  Home 300   406  470.6667   465.5 
10  Bill  2  Expr  Home 500   400  437.3333   478.0 

让所有7或8列(该数据集包括位置),所以它回答对方的问题,以及在新的Individ后数据集,这是我做了什么来解决你的问题。我100%肯定有一个更清洁和更有效的方式来做到这一点,但这里有逻辑,它应该输出很好。

步骤2:获取位数为基

library(plyr) 
Individ[is.na(Individ)]<- 0 
Top_percentiles <- ddply(Individ, 
         c("Participant", "Condition", "Location"), 
         summarise, 
         Power2 = quantile(Rolling.mean.2, .95), 
         Power3 = quantile(Rolling.mean.3, .95), 
         Power4 = quantile(Rolling.mean.4, .95) 
         ) 

Top_percentiles 

    Participant Condition Location Power2 Power3 Power4 
1  Bill  Expr  Away 551.350 510.0667 465.050 
2  Bill  Expr  Home 464.650 465.6667 476.125 
3  Bill Placebo  Home 337.750 305.0000 282.625 
4  Harry  Expr  Away 585.175 533.4000 485.425 
5  Harry Placebo  Home 322.150 280.7667 268.175 
6  Paul  Expr  Home 556.500 556.5000 408.000 

其是用于为每个组和相应的滚动平均值的前5%的阈值。

现在唯一要做的就是计算数据集中高于每个阈值的观测值。

第3步:匹配滚动平均值列与原始数据集

像这样的事情是有点什么,我摆弄周围。

Individ$Power2 <- Top_percentiles$Power2[match(Individ$Participant, Top_percentiles$Participant) && 
             match(Individ$Condition, Top_percentiles$Condition) && 
             match(Individ$Location, Top_percentiles$Location)] 

Individ$Power3 <- Top_percentiles$Power3[match(Individ$Participant, Top_percentiles$Participant) && 
              match(Individ$Condition, Top_percentiles$Condition) && 
              match(Individ$Location, Top_percentiles$Location)] 

Individ$Power4 <- Top_percentiles$Power4[match(Individ$Participant, Top_percentiles$Participant) && 
              match(Individ$Condition, Top_percentiles$Condition) && 
              match(Individ$Location, Top_percentiles$Location)] 


Individ 


    Participant Time Condition Location Power Rolling.mean.2 Rolling.mean.3 Rolling.mean.4 Power2 Power3 
     (fctr) (dbl) (fctr) (fctr) (dbl)   (dbl)   (dbl)   (dbl) (dbl) (dbl) 
1   Bill  1 Placebo  Home 400    0   0.0000   0.0 551.350 510.0667 
2   Bill  2 Placebo  Home 250   325   0.0000   0.0 464.650 465.6667 
3   Bill  3 Placebo  Home 180   215  276.6667   0.0 337.750 305.0000 
4   Bill  4 Placebo  Home 500   340  310.0000   332.5 585.175 533.4000 
5   Bill  1  Expr  Away 300   400  326.6667   307.5 322.150 280.7667 
6   Bill  2  Expr  Away 450   375  416.6667   357.5 556.500 556.5000 
7   Bill  3  Expr  Away 600   525  450.0000   462.5 551.350 510.0667 
8   Bill  4  Expr  Away 512   556  520.6667   465.5 464.650 465.6667 
9   Bill  1  Expr  Home 300   406  470.6667   465.5 337.750 305.0000 
10  Bill  2  Expr  Home 500   400  437.3333   478.0 585.175 533.4000 

我的想法是将分位列匹配到Individual数据集。

第4步:筛选数据集

这应该得到你想要的,你想要的。

选项1:三个独立的数据集

top_percentile_2sec <- Individ %>% filter(Rolling.mean.2 >= Power2) 
top_percentile_3sec <- Individ %>% filter(Rolling.mean.3 >= Power3) 
top_percentile_4sec <- Individ %>% filter(Rolling.mean.4 >= Power4) 

选项2:一个大的数据集合并

top_percentile_all_times <- Individ %>% filter(Rolling.mean.2 >= Power2 | Rolling.mean.3 >= Power3 | Rolling.mean.4 >= Power4) 


top_percentile_all_times 

Participant Time Condition Location Power Rolling.mean.2 Rolling.mean.3 Rolling.mean.4 Power2 Power3 
     (fctr) (dbl) (fctr) (fctr) (dbl)   (dbl)   (dbl)   (dbl) (dbl) (dbl) 
1  Bill  1  Expr  Away 300   400.0  326.6667   307.50 322.15 280.7667 
2  Bill  4  Expr  Away 512   556.0  520.6667   465.50 464.65 465.6667 
3  Bill  1  Expr  Home 300   406.0  470.6667   465.50 337.75 305.0000 
4  Bill  3  Expr  Home 450   475.0  416.6667   440.50 322.15 280.7667 
5  Harry  1  Expr  Away 310   415.0  320.0000   292.50 322.15 280.7667 
6  Harry  3  Expr  Away 608   529.5  456.3333   472.25 551.35 510.0667 
7  Harry  4  Expr  Away 582   595.0  547.0000   487.75 464.65 465.6667 
8  Paul  3  Expr  Home  0   570.0  480.0000   0.00 322.15 280.7667 
9  Paul  4  Expr  Home  0   0.0  570.0000   480.00 556.50 556.5000 

下面是一个链接,极大地帮助了我。

how to calculate 95th percentile of values with grouping variable in R or Excel

这是否解决了从其他后你的问题呢?

+0

谢谢你花时间为我的问题制定答案 - 我真的很感激!当我在更大的数据框上运行你的代码时(988,841 obs),在步骤3中返回以下错误:'$ < - 。data.frame'('* tmp *'',“Power1” ,值= c(1.8886312245,: 替换有11行,数据有988841' – user2716568

+0

如果你不提供任何保密信息,你可以提供一个更大的虚拟数据集吗? 我很难诊断那个错误,除非我能看到每一步都会发生什么 发布每一步之后会发生什么情况的屏幕截图会帮助我或其他人解决这个问题,也可能是您或我的部分出现语法错误或打字错误请谨慎对待此问题 – InfiniteFlashChess

+0

不幸的是我无法提供实际的数据集,因为它是机密数据。我设法克服了我的上述问题,但只匹配了“Name”而不是“Location”,因此,您的代码为我提供了机会h正是我以后的工作。事实上,我从一开始就对不同位置进行了分类,有点麻烦,但这对分析很有效(最终目标是比较位置)。非常感谢您的帮助和支持,我非常感谢! – user2716568