2016-08-24 69 views
1

我有一个做了如下方式的数据帧两两部门:如何执行基于行的分组

df <- structure(list(celltype = structure(c(1L, 1L, 2L, 2L, 3L, 3L, 
4L, 4L, 5L, 5L, 6L, 6L, 7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L), .Label = c("Bcells", 
"DendriticCells", "Macrophages", "Monocytes", "NKCells", "Neutrophils", 
"StemCells", "StromalCells", "abTcells", "gdTCells"), class = "factor"), 
    sample = c("SP ID control", "SP ID treated", "SP ID control", 
    "SP ID treated", "SP ID control", "SP ID treated", "SP ID control", 
    "SP ID treated", "SP ID control", "SP ID treated", "SP ID control", 
    "SP ID treated", "SP ID control", "SP ID treated", "SP ID control", 
    "SP ID treated", "SP ID control", "SP ID treated", "SP ID control", 
    "SP ID treated"), `mean(score)` = c(0.160953535029424, 0.155743474395545, 
    0.104788051104575, 0.125247035158472, -0.159665650045289, 
    -0.134662049979712, 0.196249441751866, 0.212256889027029, 
    0.0532668251890109, 0.0738264693971133, 0.151828478029596, 
    0.159941552142933, -0.14128323638966, -0.120556640790534, 
    0.196518649474078, 0.185264282171863, 0.0654641151966543, 
    0.0837989059507186, 0.145111577618456, 0.145448549866796)), .Names = c("celltype", 
"sample", "mean(score)"), row.names = c(7L, 8L, 17L, 18L, 27L, 
28L, 37L, 38L, 47L, 48L, 57L, 58L, 67L, 68L, 77L, 78L, 87L, 88L, 
97L, 98L), class = "data.frame") 

它看起来像这样:

> df 
     celltype  sample mean(score) 
7   Bcells SP ID control 0.16095354 
8   Bcells SP ID treated 0.15574347 
17 DendriticCells SP ID control 0.10478805 
18 DendriticCells SP ID treated 0.12524704 
27 Macrophages SP ID control -0.15966565 
28 Macrophages SP ID treated -0.13466205 
37  Monocytes SP ID control 0.19624944 
38  Monocytes SP ID treated 0.21225689 
47  NKCells SP ID control 0.05326683 
48  NKCells SP ID treated 0.07382647 
57 Neutrophils SP ID control 0.15182848 
58 Neutrophils SP ID treated 0.15994155 
67  StemCells SP ID control -0.14128324 
68  StemCells SP ID treated -0.12055664 
77 StromalCells SP ID control 0.19651865 
78 StromalCells SP ID treated 0.18526428 
87  abTcells SP ID control 0.06546412 
88  abTcells SP ID treated 0.08379891 
97  gdTCells SP ID control 0.14511158 
98  gdTCells SP ID treated 0.14544855 

我想要做的是根据cell type分组内的treatedcontrol样本计算得分分数。

下面的Excel图像说明了这个例子。我们在最右栏之后。例如在Bcells(0.155/0.161 = 0.967)。

enter image description here

在这一天结束时,我想获得的是看起来像这样的DF:

celltype   sample   Pairwise division 
Bcells    SP ID treated 0.967630031 
DendriticCells  SP ID treated 1.195241574 
Macrophages   SP ID treated 0.843400255 
Monocytes   SP ID treated 1.081566841 
NKCells    SP ID treated 1.385974647 
Neutrophils   SP ID treated 1.053435786 
StemCells   SP ID treated 0.853297563 
StromalCells  SP ID treated 0.942731303 
abTcells   SP ID treated 1.280073915 
gdTCells   SP ID treated 1.002322158 

我如何能实现在R里面?

回答

2

如果蔓延到广泛的形式,这是很简单的:

library(tidyr) 
library(dplyr) 

df %>% spread(sample, `mean(score)`) %>% 
    mutate(pairwise_division = `SP ID treated`/`SP ID control`) 

##   celltype SP ID control SP ID treated pairwise_division 
## 1   Bcells 0.16095354 0.15574347   0.9676300 
## 2 DendriticCells 0.10478805 0.12524704   1.1952416 
## 3  Macrophages -0.15966565 -0.13466205   0.8434003 
## 4  Monocytes 0.19624944 0.21225689   1.0815668 
## 5   NKCells 0.05326683 0.07382647   1.3859746 
## 6  Neutrophils 0.15182848 0.15994155   1.0534358 
## 7  StemCells -0.14128324 -0.12055664   0.8532976 
## 8 StromalCells 0.19651865 0.18526428   0.9427313 
## 9  abTcells 0.06546412 0.08379891   1.2800739 
## 10  gdTCells 0.14511158 0.14544855   1.0023222 

请注意,你应该解决您的列名,这样你就不必经常使用反引号。

要获得精确期望的结果,收集回长,过滤器,只处理行,并选择所需的列:

df %>% spread(sample, `mean(score)`) %>% 
    mutate(pairwise_division = `SP ID treated`/`SP ID control`) %>% 
    gather(sample, `mean(score)`, starts_with('SP')) %>% 
    filter(sample == 'SP ID treated') %>% 
    select(celltype, sample, pairwise_division) 

##   celltype  sample pairwise_division 
## 1   Bcells SP ID treated   0.9676300 
## 2 DendriticCells SP ID treated   1.1952416 
## 3  Macrophages SP ID treated   0.8434003 
## 4  Monocytes SP ID treated   1.0815668 
## 5   NKCells SP ID treated   1.3859746 
## 6  Neutrophils SP ID treated   1.0534358 
## 7  StemCells SP ID treated   0.8532976 
## 8 StromalCells SP ID treated   0.9427313 
## 9  abTcells SP ID treated   1.2800739 
## 10  gdTCells SP ID treated   1.0023222 

等效版本是在基地可能与data.table,如果你喜欢。或采取直接的路线:

aggregate(cbind(pairwise_division = `mean(score)`) ~ celltype, 
      df[order(df$celltype, df$sample), ], 
      FUN = function(x){x[2]/x[1]}) 

##   celltype pairwise_division 
## 1   Bcells   0.9676300 
## 2 DendriticCells   1.1952416 
## 3  Macrophages   0.8434003 
## 4  Monocytes   1.0815668 
## 5   NKCells   1.3859746 
## 6  Neutrophils   1.0534358 
## 7  StemCells   0.8532976 
## 8 StromalCells   0.9427313 
## 9  abTcells   1.2800739 
## 10  gdTCells   1.0023222 
+0

谢谢,但怎么来的值不是你的结果的第一行'0.967630031'? – neversaint

+2

糟糕,向后分开并贴出错误的版本。固定。 – alistaire

5

如果您的数据是有序和完全配对:

pair_index <- 1:(dim(df)[1]/2)*2 
df[pair_index,'pairwise-division'] <- df[pair_index,3]/df[pair_index-1,3] 
df[pair_index,c(1,2,4)]