2016-04-29 51 views
0

我有这样Python:如何计算pandas数据框中对之间的协作?

df = pd.DataFrame({'Item':['A','A','A','B','B','C','C','C','C'], 
'Name':[Tom,John,Paul,Tom,Frank,Tom, John, Richard, James], 
'Weight:[2,2,2,3,3,5, 5, 5, 5]'}) 
df 
Item Name Weight 
A Tom  4 
A John 4 
A Paul 4 
B Tom  3 
B Frank 3 
C Tom  5 
C John 5 
C Richard 5 
C James 5 

对于每个人,我想与同一项目的人的名单平均在weight

df1 
Name    People       Times 
Tom  [John, Paul, Frank, Richard, James]  [(1/4+1/5),1/4,1/3,1/5,1/5] 
John [Tom, Richard, James]      [(1/4+1/5),1/5,1/5] 
Paul [Tom, John]        [1/4,1/4] 
Frank [Tom]          [1/3] 
Richard [Tom, John, James]      [1/5,1/5,1/5] 
James [Tom, John, Richard]      [1/5,1/5,1/5] 

为了计算协作的时间,而不考虑的一个数据帧weight,我所做的:

#merge M:N by column Item 
df1 = pd.merge(df, df, on=['Item']) 

#remove duplicity - column Name_x == Name_y 
df1 = df1[~(df1['Name_x'] == df1['Name_y'])] 
#print df1 

#create lists 
df1 = df1.groupby('Name_x')['Name_y'].apply(lambda x: x.tolist()).reset_index() 
print df1 
    Name_x          Name_y 
0 Frank          [Tom] 
1 James      [Tom, John, Richard] 
2  John   [Tom, Paul, Tom, Richard, James] 
3  Paul        [Tom, John] 
4 Richard       [Tom, John, James] 
5  Tom [John, Paul, Frank, John, Richard, James] 


#get count by np.unique 
df1['People'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[0]) 
df1['times'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[1]) 
#remove column Name_y 
df1 = df1.drop('Name_y', axis=1).rename(columns={'Name_x':'Name'}) 
print df1 
     Name        People   times 
0 Frank        [Tom]    [1] 
1 James     [John, Richard, Tom]  [1, 1, 1] 
2  John   [James, Paul, Richard, Tom]  [1, 1, 1, 2] 
3  Paul       [John, Tom]   [1, 1] 
4 Richard     [James, John, Tom]  [1, 1, 1] 
5  Tom [Frank, James, John, Paul, Richard] [1, 1, 2, 1, 1] 

在过去的数据帧我有科拉的计数所有对之间硼化,但是我想他们的合作

回答

0

的加权计数与开始:

df = pd.DataFrame({'Item': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'C'], 
        'Name': ['Tom', 'John', 'Paul', 'Tom', 'Frank', 'Tom', 'John', 'Richard', 'James'], 
        'Weight': [2, 2, 2, 3, 3, 5, 5, 5, 5]}) 

df1 = pd.merge(df, df, on=['Item']) 
df1 = df1[~(df1['Name_x'] == df1['Name_y'])].set_index(['Name_x', 'Name_y']).drop(['Item', 'Weight_y'], axis=1) 

你可以使用.apply()创造的价值和.unstack()为宽幅:

collab = df1.groupby(level=['Name_x', 'Name_y']).apply(lambda x: np.sum(1/x)).unstack().loc[:, 'Weight_x'] 

Name_y  Frank James John Paul Richard  Tom 
Name_x             
Frank   NaN NaN NaN NaN  NaN 0.333333 
James   NaN NaN 0.2 NaN  0.2 0.200000 
John   NaN 0.2 NaN 0.5  0.2 0.700000 
Paul   NaN NaN 0.5 NaN  NaN 0.500000 
Richard  NaN 0.2 0.2 NaN  NaN 0.200000 
Tom  0.333333 0.2 0.7 0.5  0.2  NaN 

然后遍历行并转换为列表:

df = pd.DataFrame(columns=['People', 'Times']) 
for p, data in collab.iterrows(): 
    s = data.dropna() 
    df.loc[p] = [s.index.tolist(), s.values] 

             People \ 
Frank         [Tom] 
James     [John, Richard, Tom] 
John    [James, Paul, Richard, Tom] 
Paul        [John, Tom] 
Richard     [James, John, Tom] 
Tom  [Frank, James, John, Paul, Richard] 

             Times 
Frank      [0.333333333333] 
James       [0.2, 0.2, 0.2] 
John      [0.2, 0.5, 0.2, 0.7] 
Paul        [0.5, 0.5] 
Richard      [0.2, 0.2, 0.2] 
Tom  [0.333333333333, 0.2, 0.7, 0.5, 0.2] 
+0

这是我想要的但我收到以下错误 – emax

+0

对不起,我跳过了一步,看到更新。 – Stefan

+0

太棒了!!!!!! – emax

相关问题