2017-10-08 107 views
3

寻找给定DataFrame中相同行的索引而不迭代单独行的熊猫方法是什么?在熊猫中查找重复行的索引DataFrame

虽然可以找到与unique = df[df.duplicated()]所有独特的行,然后再遍历与unique.iterrows()和提取等条目与pd.where()帮助的索引中唯一条目,有什么做的熊猫呢?

实施例: 鉴于下述结构的数据帧:

| param_a | param_b | param_c 
1 | 0  | 0  | 0 
2 | 0  | 2  | 1 
3 | 2  | 1  | 1 
4 | 0  | 2  | 1 
5 | 2  | 1  | 1 
6 | 0  | 0  | 0 

输出:

[(1, 6), (2, 4), (3, 5)] 

回答

2

使用带keep=False所有欺骗行,然后groupby由所有列和转换参数duplicated索引值给元组,最后转换输出Serieslist

df = df[df.duplicated(keep=False)] 

df = df.groupby(df.columns.tolist()).apply(lambda x: tuple(x.index)).tolist() 
print (df) 
[(1, 6), (2, 4), (3, 5)] 

如果你还想看到重复数据删除值:

df1 = (df.groupby(df.columns.tolist()) 
     .apply(lambda x: tuple(x.index)) 
     .reset_index(name='idx')) 
print (df1) 
    param_a param_b param_c  idx 
0  0  0  0 (1, 6) 
1  0  2  1 (2, 4) 
2  2  1  1 (3, 5) 
1

方法#1

这是一个被this post启发一个量化的方法 -

def group_duplicate_index(df): 
    a = df.values 
    sidx = np.lexsort(a.T) 
    b = a[sidx] 

    m = np.concatenate(([False], (b[1:] == b[:-1]).all(1), [False])) 
    idx = np.flatnonzero(m[1:] != m[:-1]) 
    I = df.index[sidx].tolist()  
    return [I[i:j] for i,j in zip(idx[::2],idx[1::2]+1)] 

采样运行 -

In [42]: df 
Out[42]: 
    param_a param_b param_c 
1  0  0  0 
2  0  2  1 
3  2  1  1 
4  0  2  1 
5  2  1  1 
6  0  0  0 

In [43]: group_duplicate_index(df) 
Out[43]: [[1, 6], [3, 5], [2, 4]] 

方法2

对于整数编号dataframes,我们可以每行降低到一个标量每能让我们有一个1D阵列工作,给我们一个更好的性能之一,像这样 -

def group_duplicate_index_v2(df): 
    a = df.values 
    s = (a.max()+1)**np.arange(df.shape[1]) 
    sidx = a.dot(s).argsort() 
    b = a[sidx] 

    m = np.concatenate(([False], (b[1:] == b[:-1]).all(1), [False])) 
    idx = np.flatnonzero(m[1:] != m[:-1]) 
    I = df.index[sidx].tolist() 
    return [I[i:j] for i,j in zip(idx[::2],idx[1::2]+1)] 

运行测试

其他方法(ES) -

def groupby_app(df): # @jezrael's soln 
    df = df[df.duplicated(keep=False)] 
    df = df.groupby(df.columns.tolist()).apply(lambda x: tuple(x.index)).tolist() 
    return df 

计时 -

In [274]: df = pd.DataFrame(np.random.randint(0,10,(100000,3))) 

In [275]: %timeit group_duplicate_index(df) 
10 loops, best of 3: 36.1 ms per loop 

In [276]: %timeit group_duplicate_index_v2(df) 
100 loops, best of 3: 15 ms per loop 

In [277]: %timeit groupby_app(df) # @jezrael's soln 
10 loops, best of 3: 25.9 ms per loop