如何基于另一个数据框

我有一个看起来像这两CSV数据的分组平均数据框架柱：如何基于另一个数据框

gene,stem1,stem2,stem3,b1,b2,b3,t1 
foo,20,10,11,23,22,79,3 
bar,17,13,505,12,13,88,1 
qui,17,13,5,12,13,88,3

而且这样的：

celltype,phenotype 
SC,stem1 
BC,b2 
SC,stem2 
SC,stem3 
BC,b1 
TC,t1 
BC,b3

数据帧这个样子的：

In [5]: import pandas as pd 
In [7]: main_df = pd.read_table("http://dpaste.com/2MRRRM3.txt", sep=",") 

In [8]: main_df 
Out[8]: 
     gene stem1 stem2 stem3 b1 b2 b3 t1 
    0 foo  20  10  11 23 22 79 3 
    1 bar  17  13 505 12 13 88 1 
    2 qui  17  13  5 12 13 88 3 


In [11]: source_df = pd.read_table("http://dpaste.com/091PNE5.txt", sep=",") 

In [12]: source_df 
Out[12]: 
    celltype phenotype 
0  SC  stem1 
1  BC  b2 
2  SC  stem2 
3  SC  stem3 
4  BC  b1 
5  TC  t1 
6  BC  b3

我想要做的是基于分组在main_df以平均每列在source_df。所以最终看起来像这样：

 SC    BC    TC 
foo (20+10+11)/3  (23+22+79)/3  3/1 
bar (17+13+505)/3 (12+13+88)/3  1/1 
qui (17+13+5)/3  (12+13+88)/3  3/1

我该如何做到这一点？

来源

2016-01-21 neversaint

你可以转换source_df为dict和使用.groupby()上axis=1应用此main_df：

main_df.set_index('gene', inplace=True) 
col_dict = source_df.set_index('phenotype').squeeze().to_dict() 
main_df.groupby(col_dict, axis=1).mean() 

      BC   SC TC 
gene       
foo 41.333333 13.666667 3 
bar 37.666667 178.333333 1 
qui 37.666667 11.666667 3

来源

2016-01-21 03:02:49 Stefan

您可以为source_df设定指标和main_df然后使用pd.concat和groupby通过celltype：

main_df.set_index('gene', inplace=True) 
source_df.set_index("phenotype", inplace=True) 

In [30]: pd.concat([main_df.T, source_df], axis=1) 
Out[30]: 
gene foo bar qui celltype 
b1  23 12 12  BC 
b2  22 13 13  BC 
b3  79 88 88  BC 
stem1 20 17 17  SC 
stem2 10 13 13  SC 
stem3 11 505 5  SC 
t1  3 1 3  TC 


In [33]: pd.concat([main_df.T, source_df], axis=1).groupby(['celltype']).mean().T 
Out[33]: 
celltype   BC   SC TC 
gene 
foo  41.333333 13.666667 3 
bar  37.666667 178.333333 1 
qui  37.666667 11.666667 3

来源

2016-01-21 05:57:31

如何基于另一个数据框

回答

相关问题