2016-11-30 51 views
1
FAMILY<- c('FAMILYA', 'FAMILYA', 'FAMILYA', 'FAMILYA', 'FAMILYA', 'FAMILYB', 'FAMILYB', 'FAMILYB', 'FAMILYB', 'FAMILYB', 'FAMILYC', 'FAMILYC', 'FAMILYC', 'FAMILYC', 'FAMILYC') 

CHILDREN<-c('JAKE', 'PETE', 'JASON', 'KEVIN', 'ALFRED','DALE', 'STEVE', 'MELISSA', 'DAN', 'THOMAS', 'CAIT', 'BRANDON', 'DEAN', 'ADAM', 'KELSEY') 

CHANGE<-c(1000, -1000, 2000, 3000, 5000, 100, 300, 1234, -1022, -1111, -1112, 1000, 1002, 2131, 1231) 

df1<-data.frame(FAMILY, CHILDREN, CHANGE) 

df1 

    FAMILY CHILDREN CHANGE 
1 FAMILYA  JAKE 1000 
2 FAMILYA  PETE -1000 
3 FAMILYA JASON 2000 
4 FAMILYA KEVIN 3000 
5 FAMILYA ALFRED 5000 
6 FAMILYB  DALE 100 
7 FAMILYB STEVE 300 
8 FAMILYB MELISSA 1234 
9 FAMILYB  DAN -1022 
10 FAMILYB THOMAS -1111 
11 FAMILYC  CAIT -1112 
12 FAMILYC BRANDON 1000 
13 FAMILYC  DEAN 1002 
14 FAMILYC  ADAM 2131 
15 FAMILYC KELSEY 1231 

我想将此数据框转换为有4个新的额外列:前两个显示1)最大值子项,2)第2个最大值子项和最后两个列显示3个)最小值儿童,4)第2小值儿童。将数据帧整形为前2个值

我还希望它旁边的变化是各自的孩子。

最终格式应该是这样的:

FAMILY TOTAL CHANGE  INCREASE #1  INCREASE #2  DECREASE #1  DECREASE #2 
FAMILYA   10000  ALFRED: 5000  KEVIN: 3000  PETE: -1000  JAKE: 1000 
FAMILYB   -499  MELISSA: 1234  STEVE: 300  THOMAS: -1111  DAN: -1022 
FAMILYC   4252  ADAM: 2131  KELSEY: 1231  CAIT: -1112 BRANDON: 1000 

如果你认为这将是更容易地在一个单独的列各子项的值旁边的作品太多,但,这是我需要帮助的概念执行。

任何帮助将是伟大的,谢谢!

回答

2
library(dplyr) 
library(tidyr) 

# below function helps to get the second max or second min 
myfun <- function(x, y) { 
    u <- unique(x) 
    u <- sort(u, decreasing = TRUE) 
    if(y<0) 
    u[length(x)-1] 
    else 
    u[y] 
} 

df2 <- df1 %>% group_by(FAMILY) %>% 
     summarise(a1=CHILDREN[which(CHANGE == max(CHANGE))] , a2 = max(CHANGE), 
       b2 = myfun(CHANGE, 2)   , b1=CHILDREN[which(CHANGE == b2)] , 
       c1=CHILDREN[which(CHANGE == min(CHANGE))] , c2 = min(CHANGE), 
       d2 = myfun(CHANGE,-2)   , d1=CHILDREN[which(CHANGE == d2)]) 
#df2 
# FAMILY  a1 a2  b1 b2  c1 c2  d1 d2 
# <fctr> <fctr> <dbl> <fctr> <dbl> <fctr> <dbl> <fctr> <dbl> 
#1 FAMILYA ALFRED 5000 3000 KEVIN PETE -1000 1000 JAKE 
#2 FAMILYB MELISSA 1234 300 STEVE THOMAS -1111 -1022  DAN 
#3 FAMILYC ADAM 2131 1231 KELSEY CAIT -1112 1000 BRANDON 

# little clumpsy here... would like if someone could suggest a better way of uniting efficiently 
df3 <- unite(df2, "A1", 2,3,sep = ":") 
df4 <- unite(df3, "B1", 4,3,sep = ":") 
df5 <- unite(df4, "c1", 4,5,sep = ":") 
df6 <- unite(df5, "c1", 6,5,sep = ":") 

#df6 
# FAMILY   A1   B1   c1   c1 
# <fctr>  <chr>  <chr>  <chr>  <chr> 
#1 FAMILYA ALFRED:5000 KEVIN:3000 PETE:-1000 JAKE:1000 
#2 FAMILYB MELISSA:1234 STEVE:300 THOMAS:-1111 DAN:-1022 
#3 FAMILYC ADAM:2131 KELSEY:1231 CAIT:-1112 BRANDON:1000 

注:忘了补充TOTAL_CHANGE列 添加TOTAL CHANGE = sum(CHANGE)summarise()和团结添加+1()列索引

+0

感谢您的反馈,我真的很喜欢你正在使用此概念与dplyr。如果你比较上面的例子,我不认为这些值是正确的。 –

+0

我的天啊!我怎么错过了!我很抱歉......修复它! –

+0

我认为你将整个列表中的max-1和min-1,而不是导致错误的组。并感谢您的关注! –

1

这里使用自定义功能和do(从dplyr)的方法将其应用给每个家庭组。自定义功能也使用dplyr

首先,自定义函数生成(排序)有序的变化。然后,它将返回总更改(总和)以及顺序中的第一个和最后两个更改。它必须作为data.frame返回,以便与do正常工作。

myFamFunction <- function(CHILDREN, CHANGE){ 
    toOut <- 
    paste(CHILDREN, CHANGE, sep = ": ")[order(CHANGE, decreasing = TRUE)] 

    c(sum(CHANGE) 
    , head(toOut, 2) 
    , tail(toOut, 2)) %>% 
    rbind() %>% 
    data.frame(stringsAsFactors = FALSE) %>% 
    setNames(c("Total Change" 
       , "Biggest Change" 
       , "Second Biggest Change" 
       , "Second Smallest Change" 
       , "Smallest Change")) 
} 

注意,这可能会引发错误,如果有小于2名儿童(不过,如果有小于4,结果已经将值得怀疑)。如果你有更复杂的实际数据,告诉我们你想要发生什么事情将允许保护这些边缘情况。

然后,就group_by,传递你想进入功能列,瞧:

df1 %>% 
    group_by(FAMILY) %>% 
    do(myFamFunction(.$CHILDREN, .$CHANGE)) 

返回:

FAMILY `Total Change` `Biggest Change` `Second Biggest Change` `Second Smallest Change` `Smallest Change` 
    <fctr>   <chr>   <chr>     <chr>     <chr>    <chr> 
1 FAMILYA   10000  ALFRED: 5000    KEVIN: 3000    JAKE: 1000  PETE: -1000 
2 FAMILYB   -499 MELISSA: 1234    STEVE: 300    DAN: -1022  THOMAS: -1111 
3 FAMILYC   4252  ADAM: 2131   KELSEY: 1231   BRANDON: 1000  CAIT: -1112