2016-07-22 64 views
0

我想在python做的,很明显,操作简单:熊猫:如果总和列中的值一致

我有一些数据集,说6,我想总结如果值一列的值另外两列重合。之后,我想划分已经加上的数据集的数量,在这种情况下,6(即计算算术平均值)的列值。另外我想总结0,如果其他列的值不一致。

我写下来这里有两个dataframes,作为例子:

Code1 Code2 Distance 0 15.0 15.0 2 1 15.0 60.0 3 2 15.0 69.0 2 3 15.0 434.0 1 4 15.0 842.0 0

Code1 Code2 Distance 0 14.0 15.0 4 1 14.0 60.0 7 2 15.0 15.0 0 3 15.0 60.0 1 4 15.0 69.0 9

的第一列是df.index列。然后,只有'代码1'和'代码2'列重合时,我才会计算'距离'列的总和。在这种情况下,所需的输出会是这样的:

Code1 Code2 Distance 0 14.0 15.0 2 1 14.0 60.0 3.5 2 15.0 15.0 1 3 15.0 60.0 2 4 15.0 69.0 5.5 5 15.0 434.0 0.5 6 15.0 842.0 0

我试图做到这一点使用条件语句,但对于两个以上的df是真的很难做。熊猫有没有办法更快地做到这一点?

任何帮助:-)

+0

难道你'Code1'和'Code2'在一个数据帧一样吗? –

+0

我不确定我是否理解,如果Code1和Code2匹配,您想要添加距离列的值,彼此之间?在df之间?独立的指数?另外,如果你有N个DataFrame都具有相同的列,为什么你不能只用一个大的df来处理所有的数据并且使用像sum col这样的条件? – nico

+0

@AntonProtopopov,是的,可能是一样的。 –

回答

1

理解你可以把你所有的数据帧中的列表,然后使用reduce要么appendmerge他们。 看看减少here

首先,为样本数据生成定义一些函数。

import pandas 
import numpy as np 

# GENERATE DATA 
# Code 1 between 13 and 15 
def generate_code_1(n): 
    return np.floor(np.random.rand(n,1) * 3 + 13) 

# Code 2 between 1 and 1000 
def generate_code_2(n): 
    return np.floor(np.random.rand(n,1) * 1000) + 1 

# Distance between 0 and 9 
def generate_distance(n): 
    return np.floor(np.random.rand(n,1) * 10) 

# Generate a data frame as hstack of 3 arrays 
def generate_data_frame(n): 
    data = np.hstack([ 
     generate_code_1(n) 
     ,generate_code_2(n) 
     ,generate_distance(n) 
    ]) 
    df = pandas.DataFrame(data=data, columns=['Code 1', 'Code 2', 'Distance']) 
    # Remove possible duplications of Code 1 and Code 2. Take smallest distance in case of duplications. 
    # Duplications will break merge method however will not break append method 
    df = df.groupby(['Code 1', 'Code 2'], as_index=False) 
    df = df.aggregate(np.min) 
    return df 

# Generate n data frames each with m rows in a list 
def generate_data_frames(n, m, with_count=False): 
    df_list = [] 
    for k in range(0, n): 
     df = generate_data_frame(m) 
     # Add count column, needed for merge method to keep track of how many cases we have seen 
     if with_count: 
      df['Count'] = 1 
     df_list.append(df) 
    return df_list 

Append方法(更快,更短,更好)

df_list = generate_data_frames(94, 5) 

# Append all data frames together using reduce 
df_append = reduce(lambda df_1, df_2 : df_1.append(df_2), df_list) 

# Aggregate by Code 1 and Code 2 
df_append_grouped = df_append.groupby(['Code 1', 'Code 2'], as_index=False) 
df_append_result = df_append_grouped.aggregate(np.mean) 
df_append_result 

合并方法

df_list = generate_data_frames(94, 5, with_count=True) 

# Function to be passed to reduce. Merge 2 data frames and update Distance and Count 
def merge_dfs(df_1, df_2): 
    df = pandas.merge(df_1, df_2, on=['Code 1', 'Code 2'], how='outer', suffixes=('', '_y')) 
    df = df.fillna(0) 
    df['Distance'] = df['Distance'] + df['Distance_y'] 
    df['Count'] = df['Count'] + df['Count_y'] 
    del df['Distance_y'] 
    del df['Count_y'] 
    return df 

# Use reduce to apply merge over the list of data frames 
df_merge_result = reduce(merge_dfs, df_list) 

# Replace distance with its mean and drop Count 
df_merge_result['Distance'] = df_merge_result['Distance']/df_merge_result['Count'] 
del df_merge_result['Count'] 
df_merge_result