2016-10-03 84 views
-2

我想通过两个分组变量(resp & company)和三个数字响应变量(质量,数量,意义)将宽数据帧整形为宽数据框。我试图用dcast函数来完成它,但它不允许我通过两个变量进行分组。谁能帮我吗?使用由两个因素分组的合并函数将长整型数据帧重整为宽数据框

#Current long dataframe: two grouping variables (resp & company), three numerical respons variables (Quality, Amount, Sense) 
resp <- c(1325851107,1325851108,1325851109,1325851107,1325851108,1325851109,1325851107,1325851108,1325851109,1325851107,1325851108,1325851109) 
company <- c("Dark.nl","Dark.nl","Dark.nl","Dark.nl","Dark.nl","Dark.nl","Manual.nl","Manual.nl","Manual.nl","Dark.nl","Dark.nl","Dark.nl") 
question <- c("Quality","Quality","Quality","Amount","Amount","Amount","Quality","Quality","Quality","Sense","Sense","Sense") 
score <- c(4,1,2,6,8,10,5,5,7,4,6,7) 
current <- data.frame(resp,company,question,score,answer); current 

#Desired wide dataframe 
resp2 <- c(1325851107,1325851107,1325851108,1325851108,1325851109,1325851109) 
company2 <- c("Dark.nl","Manual.nl","Dark.nl","Manual.nl","Dark.nl","Manual.nl") 
Quality <- c(4,5,1,5,2,7) 
Amount <- c(6,NA,8,NA,10,NA) 
Sense <- c(4,NA,6,NA,7,NA) 
desired <- data.frame(resp2,company2,Quality,Amount,Sense); desired 

#Using dcast function to reshape 
library("reshape2") 
dcast(current, resp + company ~ question, value.var="score") 

Parfait提供的合并函数有效。我在这里提供了制作技巧的脚本(谢谢Parfait;))。

cols2keep <- c("resp", "company", "score") 
df <- merge(current[current$question=='Quality', cols2keep], #merge two dataframes 
     current[current$question=='Amount', cols2keep], 
     by=c("resp", "company"), all=TRUE) 

df <- merge(df,current[current$question=='Sense',  c("resp","company","score")], #merge third respons variable into new dataframe 
     by=c("resp", "company"), all=TRUE) 
colnames(df) <- c("resp","company","quality","amount","sense") 

该解决方案有效,但我的真实数据集存在53个响应变量。因此这种方法非常耗时。我尝试了Parfait的迭代方法,但是我得到以下错误。

dfList <- lapply(unique(current$question), function(i){ 
temp <- setNames(current[current$question==i, c("resp", "company", "score")], 
       c("resp", "company", paste0(i))) 
}) 

finaldf <- Reduce(function(...) merge(..., y=c("resp", "company"), all=T), dfList) 
Error in fix.by(by.x, x) : 
'by' must specify one or more columns as numbers, names or logical 

我对R编码比较陌生,无法掌握我写的错误。我对现在的解决方案感到满意,但如果有更高效的解决方案,我愿意接受。

回答

1

考虑在过滤的子集的合并:

cols2keep <- c("resp", "company", "score", "answer") 

df <- merge(current[current$question=='Quality', cols2keep], 
      current[current$question=='Amount', cols2keep], 
      by=c("resp", "company"), all=TRUE) 

colnames(df) <- c("resp", "company", "quality", "quality_a", "amount", "amount_a")  
df 

#   resp company quality  quality_a amount amount_a 
# 1 1325851107 Dark.nl  4 Didn't like  6 Maybe 
# 2 1325851107 Manual.nl  5   Fine  NA  <NA> 
# 3 1325851108 Dark.nl  1  Was ok  8  Fine 
# 4 1325851108 Manual.nl  5 No, thank you  NA  <NA> 
# 5 1325851109 Dark.nl  2   Sure  10 Not bad 
# 6 1325851109 Manual.nl  7  Why not  NA  <NA> 

对于多个群体,如,继续进行过滤集合并:

df <- merge(df, 
      current[current$question=='Sense',c("resp", "company", "score", "answer")], 
      by=c("resp", "company"), all=TRUE) 

colnames(df) <- c("resp", "company", "quality", "quality_a", "amount", "amount_a", 
        "sense", "sense_a") 
df 
#   resp company quality  quality_a amount amount_a sense sense_a 
# 1 1325851107 Dark.nl  4 Didn't like  6 Maybe  4 Nice 
# 2 1325851107 Manual.nl  5   Fine  NA  <NA> NA <NA> 
# 3 1325851108 Dark.nl  1  Was ok  8  Fine  6  Ok 
# 4 1325851108 Manual.nl  5 No, thank you  NA  <NA> NA <NA> 
# 5 1325851109 Dark.nl  2   Sure  10 Not bad  7  Yuk 
# 6 1325851109 Manual.nl  7  Why not  NA  <NA> NA <NA> 

此外,对于跨问题各级迭代合并,考虑以下因素:

dfList <- lapply(unique(current$question), function(i){ 
    temp <- setNames(current[current$question==i, c("resp", "company", "score", "answer")], 
       c("resp", "company", paste0(i), paste0(i, "_a"))) 
}) 

finaldf <- Reduce(function(...) merge(..., y=c("resp", "company"), all=T), dfList) 
finaldf 
#   resp company Quality  Quality_a Amount Amount_a Sense Sense_a 
# 1 1325851107 Dark.nl  4 Didn't like  6 Maybe  4 Nice 
# 2 1325851107 Manual.nl  5   Fine  NA  <NA> NA <NA> 
# 3 1325851108 Dark.nl  1  Was ok  8  Fine  6  Ok 
# 4 1325851108 Manual.nl  5 No, thank you  NA  <NA> NA <NA> 
# 5 1325851109 Dark.nl  2   Sure  10 Not bad  7  Yuk 
# 6 1325851109 Manual.nl  7  Why not  NA  <NA> NA <NA> 
+0

非常感谢你Parfait。这个脚本很容易使用,并产生我想到的数据框。 – SHW

+0

好听!乐意效劳。请接受以确认解决方案。快乐的编码! – Parfait

+0

现在我遇到一些困难时,我的一个分组变量(公司)由两个以上的级别组成(请参阅我已添加到原始帖子中的附加代码:#Grouping变量超过两个级别,包括“Senses”)。我得到这个错误:fix.by(by.x,x)中的错误:'by'必须指定一个或多个列作为数字,名称或逻辑。任何想法这里出了什么问题? – SHW

0

使用tidyr,继任的选项reshape2

library(tidyverse) 

current %>% group_by(resp, company) %>% 
    # join answer and score into a single column to be spread to wide form 
    unite(answer_score, answer, score) %>% 
    spread(question, answer_score) %>% 
    # separate joined columns 
    separate(Amount, c('amount', 'amount_a'), sep = '_', convert = TRUE) %>% 
    separate(Quality, into = c('quality', 'quality_a'), sep = '_', convert = TRUE) 

## Source: local data frame [6 x 6] 
## Groups: resp, company [6] 
## 
##   resp company amount amount_a  quality quality_a 
## *  <dbl> <fctr> <chr> <int>   <chr>  <int> 
## 1 1325851107 Dark.nl Maybe  6 Didn't like   4 
## 2 1325851107 Manual.nl <NA>  NA   Fine   5 
## 3 1325851108 Dark.nl Fine  8  Was ok   1 
## 4 1325851108 Manual.nl <NA>  NA No, thank you   5 
## 5 1325851109 Dark.nl Not bad  10   Sure   2 
## 6 1325851109 Manual.nl <NA>  NA  Why not   7 

而不是使用unite你可以使用nest,但spread荷兰国际集团名单列目前制造NULL!而非NA s,这需要一点点额外的角力:

current %>% group_by(resp, company, question) %>% 
    nest() %>% 
    spread(question, data) %>% 
    # insert NAs with purrr::`%||%` so Amount will spread nicely 
    mutate(Amount = map(Amount, ~.x %||% data_frame(score = NA, answer = NA))) %>% 
    unnest(.sep = '_') 

## # A tibble: 6 × 6 
##   resp company Amount_score Amount_answer Quality_score Quality_answer 
##  <dbl> <fctr>  <dbl>  <fctr>   <dbl>   <fctr> 
## 1 1325851107 Dark.nl   6   Maybe    4 Didn't like 
## 2 1325851107 Manual.nl   NA   NA    5   Fine 
## 3 1325851108 Dark.nl   8   Fine    1   Was ok 
## 4 1325851108 Manual.nl   NA   NA    5 No, thank you 
## 5 1325851109 Dark.nl   10  Not bad    2   Sure 
## 6 1325851109 Manual.nl   NA   NA    7  Why not 
+0

感谢您的回答,alistaire,第一个选项已经做到了! – SHW

+0

只是询问,我想知道为什么你使用这个%>%符号。这个脚本可以工作,但我不确定为什么:) – SHW

+0

'%>%'是magrittr包中的_pipe_,它现在被很多包(包括tidyr和dplyr)使用,特别是那些与_tidyverse_,其中管道是一个主要的规则。其基本思想是通过避免嵌套函数调用,大量中间变量或写入同一变量,并按执行的顺序读取,使代码更容易阅读。 [这是一个更好的解释。](http://r4ds.had.co.nz/pipes.html) – alistaire