2016-06-16 15 views
0

我有一个数据帧如下:在履行一个数据帧保持行多约束

> sampledput 
      V1          V2    V3 
1 GSM1010983        adipose Bisulfite-Seq 
2 GSM1120330        adipose Bisulfite-Seq 
3 GSM1120331        adipose Bisulfite-Seq 
4 GSM1282348        adipose Bisulfite-Seq 
5 GSM1282357        adipose Bisulfite-Seq 
6 GSM906416        adipose ChIP-Seq input 
7 GSM906394        adipose  H3K27ac 
8 GSM1010958        adipose  mRNA-Seq 
9 GSM1120304        adipose  mRNA-Seq 
10 GSM1120305        adipose  mRNA-Seq 
11 GSM621443 adipose derived mesenchymal stem cells ChIP-Seq input 
12 GSM621420 adipose derived mesenchymal stem cells  H3K27me3 
13 GSM621446 adipose derived mesenchymal stem cells  H3K36me3 
14 GSM621418 adipose derived mesenchymal stem cells  H3K4me1 
15 GSM621458 adipose derived mesenchymal stem cells  H3K4me3 
16 GSM670020 adipose derived mesenchymal stem cells   H3K9ac 
17 GSM621398 adipose derived mesenchymal stem cells  H3K9me3 

我想保留这些行,其中在列的值V2停留相同(例如,adipose),而列值V3应包含Bisulfite-SeqH3K27acChIP-Seq inputmRNA-Seq。如果有在V3重复的值,然后只取其中的1,你可以看到,我只选择一个,在这种情况下,具有价值mRNA-SeqBisulfite-Seq所以排,我会得到的输出为:

5 GSM1282357        adipose Bisulfite-Seq 
6 GSM906416        adipose ChIP-Seq input 
7 GSM906394        adipose  H3K27ac 
8 GSM1010958        adipose  mRNA-Seq 

这里是dput:

structure(list(V1 = structure(c(2L, 5L, 6L, 7L, 8L, 17L, 16L, 
1L, 3L, 4L, 12L, 11L, 13L, 10L, 14L, 15L, 9L), .Label = c("GSM1010958", 
"GSM1010983", "GSM1120304", "GSM1120305", "GSM1120330", "GSM1120331", 
"GSM1282348", "GSM1282357", "GSM621398", "GSM621418", "GSM621420", 
"GSM621443", "GSM621446", "GSM621458", "GSM670020", "GSM906394", 
"GSM906416"), class = "factor"), V2 = structure(c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("adipose", 
"adipose derived mesenchymal stem cells"), class = "factor"), 
    V3 = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 3L, 10L, 10L, 10L, 
    2L, 4L, 5L, 6L, 7L, 8L, 9L), .Label = c("Bisulfite-Seq", 
    "ChIP-Seq input", "H3K27ac", "H3K27me3", "H3K36me3", "H3K4me1", 
    "H3K4me3", "H3K9ac", "H3K9me3", "mRNA-Seq"), class = "factor")), .Names = c("V1", 
"V2", "V3"), class = "data.frame", row.names = c(NA, -17L)) 
+0

为什么不是第一个四行满足您的约束? V2中的值是'adipose','V3'中的值包含'Bisulfite-Seq' – ZachTurn

+0

@ZTtTurn是的,你是对的,它们也会在输出中。 – Newbie

+0

@ZachTurn其实我想在这里是删除重复,只考虑1类。 – Newbie

回答

1

编辑: “更好” 的解决方案

其实我喜欢这个更好,因为我觉得代码是更符合逻辑:

library(dplyr) 
sampledput %>% group_by(V2) %>% 
    filter(all(c("Bisulfite-Seq","H3K27ac","ChIP-Seq input","mRNA-Seq") %in% V3)) %>% 
    distinct(V2,V3) 

Source: local data frame [4 x 3] 
Groups: V2 [1] 

      V1  V2    V3 
     (fctr) (fctr)   (fctr) 
1 GSM1010983 adipose Bisulfite-Seq 
2 GSM906416 adipose ChIP-Seq input 
3 GSM906394 adipose  H3K27ac 
4 GSM1010958 adipose  mRNA-Seq 

这将测试所有您想要的V3值都包含在V2的每个值中。那么它仍然会过滤掉任何重复的内容。

原液

执行distinct(V2,V3)会抢取其第一出现重复的的当A dplyr解决

library(dplyr) 
sampledput %>% group_by(V2) %>% 
    filter(V3 %in% c("Bisulfite-Seq","H3K27ac","ChIP-Seq input","mRNA-Seq")) %>% 
    distinct(V2,V3) %>% filter(length(unique(V3))==4) 

Source: local data frame [4 x 3] 
Groups: V2 [2] 

      V1          V2    V3 
     (fctr)         (fctr)   (fctr) 
1 GSM1010983        adipose Bisulfite-Seq 
2 GSM906416        adipose ChIP-Seq input 
3 GSM906394        adipose  H3K27ac 
4 GSM1010958        adipose  mRNA-Seq 

不过,请注意。在您期望的输出中,您列出了GSM1282357,而我的解决方案返回GSM1010983。不知道这是否是您的问题。

你必须测试一下,这将推广到你的整个数据集,但它确实会产生你想要的输出。

+0

感谢您的回答。有一个混淆,输出不应该包含最后一行,因为'V2'中的'脂肪来源的间充质干细胞'的情况下,'V3'不包含'mRNA-Seq'。只有那些符合'V2'的行才能满足所有4个条件。 – Newbie

+0

是的,V2或V3的任何不同值都不是问题。 – Newbie

+0

@新手好吧,我现在明白了。让我编辑 – ZachTurn

1

也许有点太简单了,但...

library(dplyr) 
result <- sampledput %>% group_by(V2, V3) %>% summarise(V1 = V1[length(V1)]) 

这将返回最后GSM为每个组喜欢你的理想输出。

0

我们也可以使用data.table

library(data.table) 
setDT(sampledput)[, .(V1 = last(V1)), .(V2, V3)]