2017-10-16 57 views
1

我在R中使用一个很长的数据帧,但遇到了一些问题。我的数据帧实际上由两个较小的数据帧组成。然后,我调整了从数月到数年的时间安排,以便两者共享一个共同的时间表。在R中结合行

但是,我现在面临的问题是,有时我有两行具有相同的时间值(因此每个调查问卷只有一行),但是我希望每个时间变量只有一行。 (我附上了问题的图片,这可能比我的解释更具洞察力)请注意,在这一点上,我仍然希望数据框采用长格式,但只想摆脱“额外的行” 。

谁能告诉我该怎么做?

附加头代码,其中nomem = ID,time.compressed = time,sel01-03 =第一个问卷的一部分,close_num和gener_sat =第二个问卷的一部分。

`

structure(list(nomem_encr = c(800009L, 800009L, 800009L, 800012L, 
800015L, 800015L), timeline.compressed = c(79, 79, 95, 79, 28, 
28), sel01 = c(NA, 6L, NA, NA, NA, 7L), sel02 = c(NA, 6L, NA, 
NA, NA, 7L), sel03 = c(NA, 3L, NA, NA, NA, 5L), sel04 = c(NA, 
6L, NA, NA, NA, 6L), close_num = c(1, NA, 0.2, 1, 0.8, NA), gener_sat = c(7L, 
NA, 7L, 8L, 7L, NA)), .Names = c("nomem_encr", "timeline.compressed", 
"sel01", "sel02", "sel03", "sel04", "close_num", "gener_sat"), class = "data.frame", row.names = c(NA, 
6L)) 

`

https://i.stack.imgur.com/3p038.png

+0

你也可以提供样本数据。使用'head'创建子集和'dput'向我们展示如何复制 – Olivia

+0

回复您的第一条评论:我恐怕完全不了解您的意见。我猜想对于每一行,X变量都被回答或Y变量。然而,有时两行具有相同的时间变量,即,X和Y变量同时被回答。我想要的是将这些行组合成一行,其中X和Y变量都被回答。 – Elisabeth

+0

我们如何知道你必须修剪哪些行? – jaySf

回答

0

使用reshape2和dplyr包

加载库和数据:

library(reshape2) 
library(dplyr) 

x <- structure(
    list(
    nomem_encr = c(800009L, 800009L, 800009L, 800012L, 800015L, 800015L), 
    timeline.compressed = c(79, 79, 95, 79, 28, 28), 
    sel01 = c(NA, 6L, NA, NA, NA, 7L), 
    sel02 = c(NA, 6L, NA, NA, NA, 7L), 
    sel03 = c(NA, 3L, NA, NA, NA, 5L), 
    sel04 = c(NA, 6L, NA, NA, NA, 6L), 
    close_num = c(1, NA, 0.2, 1, 0.8, NA), 
    gener_sat = c(7L, NA, 7L, 8L, 7L, NA) 
), 
    .Names = c(
    "nomem_encr", "timeline.compressed", 
    "sel01", "sel02", "sel03", "sel04", "close_num", "gener_sat" 
), 
    class = "data.frame", 
    row.names = c(NA, 6L) 
) 
x 

这是你的数据是什么样子:

nomem_encr timeline.compressed sel01 sel02 sel03 sel04 close_num gener_sat 
1  800009     79 NA NA NA NA  1.0   7 
2  800009     79  6  6  3  6  NA  NA 
3  800009     95 NA NA NA NA  0.2   7 
4  800012     79 NA NA NA NA  1.0   8 
5  800015     28 NA NA NA NA  0.8   7 
6  800015     28  7  7  5  6  NA  NA 

现在,我们将数据融入长型:

melt(data = x, id.vars = c("nomem_encr", "timeline.compressed")) %>% 
head(15) 

输出:

nomem_encr timeline.compressed variable value 
1  800009     79 sel01 NA 
2  800009     79 sel01  6 
3  800009     95 sel01 NA 
4  800012     79 sel01 NA 
5  800015     28 sel01 NA 
6  800015     28 sel01  7 
7  800009     79 sel02 NA 
8  800009     79 sel02  6 
9  800009     95 sel02 NA 
10  800012     79 sel02 NA 
11  800015     28 sel02 NA 
12  800015     28 sel02  7 
13  800009     79 sel03 NA 
14  800009     79 sel03  3 
15  800009     95 sel03 NA 

如果我们投了熔化的数据框,默认行为是计算我们对每件物品有多少条目:

melt(data = x, id.vars = c("nomem_encr", "timeline.compressed")) %>% 
    dcast(
    formula = nomem_encr + timeline.compressed ~ variable 
) 

输出:

Aggregation function missing: defaulting to length 
    nomem_encr timeline.compressed sel01 sel02 sel03 sel04 close_num gener_sat 
1  800009     79  2  2  2  2   2   2 
2  800009     95  1  1  1  1   1   1 
3  800012     79  1  1  1  1   1   1 
4  800015     28  2  2  2  2   2   2 

我们有2项用于通过800009 79(使用nomem_encrtimeline.compressed作为识别变数)所标识的项目。

我们可以改变默认的行为别的东西像sum

melt(data = x, id.vars = c("nomem_encr", "timeline.compressed")) %>% 
    dcast(
    formula = nomem_encr + timeline.compressed ~ variable, 
    fun.aggregate = function(xs) sum(xs, na.rm = TRUE) 
) 

输出:

nomem_encr timeline.compressed sel01 sel02 sel03 sel04 close_num gener_sat 
1  800009     79  6  6  3  6  1.0   7 
2  800009     95  0  0  0  0  0.2   7 
3  800012     79  0  0  0  0  1.0   8 
4  800015     28  7  7  5  6  0.8   7 
+0

这似乎工作。非常感谢! – Elisabeth

+0

更新:我只是注意到,当我使用这段代码时,它返回零,一和occiasional两个我的数据,而不是什么实际值。我复制粘贴你的语法并将其应用于整个数据集。任何想法可能会出错?此外,我得到这个错误:汇聚功能丢失:默认为长度 – Elisabeth

+0

结构(列表(nomem_encr = C(800009L,800009L,800012L,800015L, 800015L,800015L),timeline.compressed = C(79,95,79,28 ,40, 52),sel01 = C(1L,0L,0L,1L,1L,0L),sel02 = C(1L,0L,0L, 1L,1L,0L),sel03 = C(1L,0L, 0L,1L,1L,0L),close_num = C(1L, 1L,1L,1L,1L,1L),gener_sat = C(1L,1L,1L,1L,1L,1L)),.Names = C( “nomem_encr”, “timeline.compressed”, “sel01”, “sel02”, “sel03”, “close_num”, “gener_sat”),类= “data.frame”,row.names = C(NA,6L )) – Elisabeth

0

您可以dplyr + tidyr做到这一点:

library(dplyr) 
library(tidyr) 

df %>% 
    group_by(nomem_encr, timeline.compressed) %>% 
    summarize_all(funs(sort(.)[1])) 

结果:

# A tibble: 4 x 8 
# Groups: nomem_encr [?] 
    nomem_encr timeline.compressed sel01 sel02 sel03 sel04 close_num gener_sat 
     <int>    <dbl> <int> <int> <int> <int>  <dbl>  <int> 
1  800009     79  6  6  3  6  1.0   7 
2  800009     95 NA NA NA NA  0.2   7 
3  800012     79 NA NA NA NA  1.0   8 
4  800015     28  7  7  5  6  0.8   7 

如果你想更换NA与零的,你可以做到以下几点:

df %>% 
    group_by(nomem_encr, timeline.compressed) %>% 
    summarize_all(funs(sort(.)[1])) %>% 
    mutate_all(funs(replace(., is.na(.), 0))) 

结果:

# A tibble: 4 x 8 
# Groups: nomem_encr [3] 
    nomem_encr timeline.compressed sel01 sel02 sel03 sel04 close_num gener_sat 
     <int>    <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> 
1  800009     79  6  6  3  6  1.0   7 
2  800009     95  0  0  0  0  0.2   7 
3  800012     79  0  0  0  0  1.0   8 
4  800015     28  7  7  5  6  0.8   7 

数据:

df = structure(list(nomem_encr = c(800009L, 800009L, 800009L, 800012L, 
800015L, 800015L), timeline.compressed = c(79, 79, 95, 79, 28, 
28), sel01 = c(NA, 6L, NA, NA, NA, 7L), sel02 = c(NA, 6L, NA, 
NA, NA, 7L), sel03 = c(NA, 3L, NA, NA, NA, 5L), sel04 = c(NA, 
6L, NA, NA, NA, 6L), close_num = c(1, NA, 0.2, 1, 0.8, NA), gener_sat = c(7L, 
NA, 7L, 8L, 7L, NA)), .Names = c("nomem_encr", "timeline.compressed", 
"sel01", "sel02", "sel03", "sel04", "close_num", "gener_sat"), class = "data.frame", row.names = c(NA, 
6L))