两个DataFrame中的每一行和列之间的差异（Python/Pandas）

是否有更有效的方法将一个DF中每行的每一列与另一个DF的每一行中的每一列进行比较？这对我来说感觉很渺茫，但我的循环/申请尝试速度要慢得多。两个DataFrame中的每一行和列之间的差异（Python/Pandas）

df1 = pd.DataFrame({'a': np.random.randn(1000), 
        'b': [1, 2] * 500, 
        'c': np.random.randn(1000)}, 
        index=pd.date_range('1/1/2000', periods=1000)) 
df2 = pd.DataFrame({'a': np.random.randn(100), 
       'b': [2, 1] * 50, 
       'c': np.random.randn(100)}, 
       index=pd.date_range('1/1/2000', periods=100)) 
df1 = df1.reset_index() 
df1['embarrassingHackInd'] = 0 
df1.set_index('embarrassingHackInd', inplace=True) 
df1.rename(columns={'index':'origIndex'}, inplace=True) 
df1['df1Date'] = df1.origIndex.astype(np.int64) // 10**9 
df1['df2Date'] = 0 
df2 = df2.reset_index() 
df2['embarrassingHackInd'] = 0 
df2.set_index('embarrassingHackInd', inplace=True) 
df2.rename(columns={'index':'origIndex'}, inplace=True) 
df2['df2Date'] = df2.origIndex.astype(np.int64) // 10**9 
df2['df1Date'] = 0 
timeit df3 = abs(df1-df2)

10个循环，最好的3：每次循环

我需要知道哪些比较了，因而难看除了每个相对指数以比较DF的，这样它会在最终60.6毫秒最后的DF。

在此先感谢您的帮助。

来源

2014-08-31 howMuchCheeseIsTooMuchCheese

我忘了提及，我的实际DF有几百万行和几十列进行比较。有了这个规模，申请的尝试需要数小时。 – howMuchCheeseIsTooMuchCheese 2014-08-31 21:37:02

请参阅：http://stackoverflow.com/questions/17095101/outputting-difference-in-two-pandas-dataframes-side-by-side-highlighting-the-d – EdChum 2014-08-31 22:07:31

@EdChum是的，我看到一个，那决定两个DF之间的变化，而不是数值的差异。 – howMuchCheeseIsTooMuchCheese 2014-09-10 16:24:53

您发布的代码显示了一种巧妙的方法来生成一个减法表。但是，它并没有发挥熊猫的长处。 Pandas DataFrames将基础数据存储在基于列的块中。因此，按列进行数据检索的速度最快，而不是按行进行。由于所有行都具有相同的索引，所以减法是按行执行的（将每行与每隔一行对齐），这意味着在df1-df2中有很多基于行的数据检索正在进行。对于熊猫来说这并不理想，特别是当并非所有列都具有相同的dtype时。

减法表是什么NumPy的擅长：

In [5]: x = np.arange(10) 

In [6]: y = np.arange(5) 

In [7]: x[:, np.newaxis] - y 
Out[7]: 
array([[ 0, -1, -2, -3, -4], 
     [ 1, 0, -1, -2, -3], 
     [ 2, 1, 0, -1, -2], 
     [ 3, 2, 1, 0, -1], 
     [ 4, 3, 2, 1, 0], 
     [ 5, 4, 3, 2, 1], 
     [ 6, 5, 4, 3, 2], 
     [ 7, 6, 5, 4, 3], 
     [ 8, 7, 6, 5, 4], 
     [ 9, 8, 7, 6, 5]])

你能想到的x为df1一列，并df2y为一列。您将在下面看到，NumPy可以使用基本相同的语法以基本相同的方式处理df1的所有列和df2的所有列。

下面的代码定义了orig和using_numpy。 orig是你发布的代码，using_numpy是进行使用NumPy的阵列中的减法的替代方法：

In [2]: %timeit orig(df1.copy(), df2.copy()) 
10 loops, best of 3: 96.1 ms per loop 

In [3]: %timeit using_numpy(df1.copy(), df2.copy()) 
10 loops, best of 3: 19.9 ms per loop

import numpy as np 
import pandas as pd 
N = 100 
df1 = pd.DataFrame({'a': np.random.randn(10*N), 
        'b': [1, 2] * 5*N, 
        'c': np.random.randn(10*N)}, 
        index=pd.date_range('1/1/2000', periods=10*N)) 
df2 = pd.DataFrame({'a': np.random.randn(N), 
       'b': [2, 1] * (N//2), 
       'c': np.random.randn(N)}, 
       index=pd.date_range('1/1/2000', periods=N)) 

def orig(df1, df2): 
    df1 = df1.reset_index() # 312 µs per loop 
    df1['embarrassingHackInd'] = 0 # 75.2 µs per loop 
    df1.set_index('embarrassingHackInd', inplace=True) # 526 µs per loop 
    df1.rename(columns={'index':'origIndex'}, inplace=True) # 209 µs per loop 
    df1['df1Date'] = df1.origIndex.astype(np.int64) // 10**9 # 23.1 µs per loop 
    df1['df2Date'] = 0 

    df2 = df2.reset_index() 
    df2['embarrassingHackInd'] = 0 
    df2.set_index('embarrassingHackInd', inplace=True) 
    df2.rename(columns={'index':'origIndex'}, inplace=True) 
    df2['df2Date'] = df2.origIndex.astype(np.int64) // 10**9 
    df2['df1Date'] = 0 
    df3 = abs(df1-df2) # 88.7 ms per loop <-- this is the bottleneck 
    return df3 

def using_numpy(df1, df2): 
    df1.index.name = 'origIndex' 
    df2.index.name = 'origIndex' 
    df1.reset_index(inplace=True) 
    df2.reset_index(inplace=True) 
    df1_date = df1['origIndex'] 
    df2_date = df2['origIndex'] 
    df1['origIndex'] = df1_date.astype(np.int64) 
    df2['origIndex'] = df2_date.astype(np.int64) 

    arr1 = df1.values 
    arr2 = df2.values 
    arr3 = np.abs(arr1[:,np.newaxis,:]-arr2) # 3.32 ms per loop vs 88.7 ms 
    arr3 = arr3.reshape(-1, 4) 
    index = pd.MultiIndex.from_product(
     [df1_date, df2_date], names=['df1Date', 'df2Date']) 
    result = pd.DataFrame(arr3, index=index, columns=df1.columns) 
    # You could stop here, but the rest makes the result more similar to orig 
    result.reset_index(inplace=True, drop=False) 
    result['df1Date'] = result['df1Date'].astype(np.int64) // 10**9 
    result['df2Date'] = result['df2Date'].astype(np.int64) // 10**9 
    return result 

def is_equal(expected, result): 
    expected.reset_index(inplace=True, drop=True) 
    result.reset_index(inplace=True, drop=True) 

    # expected has dtypes 'O', while result has some float and int dtypes. 
    # Make all the dtypes float for a quick and dirty comparison check 
    expected = expected.astype('float') 
    result = result.astype('float') 
    columns = ['a','b','c','origIndex','df1Date','df2Date'] 
    return expected[columns].equals(result[columns]) 

expected = orig(df1.copy(), df2.copy()) 
result = using_numpy(df1.copy(), df2.copy()) 
assert is_equal(expected, result)

如何x[:, np.newaxis] - y作品：

这个表达式利用的NumPy广播。了解广播 - 以及通常与NumPy - 它支付给知道数组的形状：

In [6]: x.shape 
Out[6]: (10,) 

In [7]: x[:, np.newaxis].shape 
Out[7]: (10, 1) 

In [8]: y.shape 
Out[8]: (5,)

的[:, np.newaxis]增加了一个新的轴x在权，所以形状(10, 1)。所以x[:, np.newaxis] - y是用形状(5,)的数组减去形状(10, 1)的数组。

表面上看来，这没有意义，但NumPy阵列广播他们的形状according to certain rules试图使他们的形状兼容。

第一条规则是可以在左侧上添加新轴。所以一组形状(5,)可以播放自己以塑造(1, 5)。

下一条规则是长度为1的轴可以将自身广播为任意长度。根据需要沿着额外维度简单重复数组中的值。

因此，当形状(10, 1)和(1, 5)的阵列在一个NumPy的算术运算被放在一起，它们都广播到形状(10, 5)的数组：

In [14]: broadcasted_x, broadcasted_y = np.broadcast_arrays(x[:, np.newaxis], y) 

In [15]: broadcasted_x 
Out[15]: 
array([[0, 0, 0, 0, 0], 
     [1, 1, 1, 1, 1], 
     [2, 2, 2, 2, 2], 
     [3, 3, 3, 3, 3], 
     [4, 4, 4, 4, 4], 
     [5, 5, 5, 5, 5], 
     [6, 6, 6, 6, 6], 
     [7, 7, 7, 7, 7], 
     [8, 8, 8, 8, 8], 
     [9, 9, 9, 9, 9]]) 

In [16]: broadcasted_y 
Out[16]: 
array([[0, 1, 2, 3, 4], 
     [0, 1, 2, 3, 4], 
     [0, 1, 2, 3, 4], 
     [0, 1, 2, 3, 4], 
     [0, 1, 2, 3, 4], 
     [0, 1, 2, 3, 4], 
     [0, 1, 2, 3, 4], 
     [0, 1, 2, 3, 4], 
     [0, 1, 2, 3, 4], 
     [0, 1, 2, 3, 4]])

所以x[:, np.newaxis] - y相当于broadcasted_x - broadcasted_y。

现在，通过这个简单的例子，我们可以看到 arr1[:,np.newaxis,:]-arr2。

arr1已形状(1000, 4)和arr2已形状(100, 4)。我们想要减去长度为4的轴上的项目，沿着1000长度轴的每一行以及沿着100长度轴的每一行。换句话说，我们希望减法形成一个形状为(1000, 100, 4)的数组。

重要的是，我们不希望1000-axis与100-axis交互。 我们希望他们在单独的轴。

因此，如果我们增加一个轴arr1这样的：arr1[:,np.newaxis,:]，那么它的形状变得

In [22]: arr1[:, np.newaxis, :].shape 
Out[22]: (1000, 1, 4)

而现在，NumPy的广播打气两个阵列的(1000, 100, 4)该相同的形状。瞧，一个减法表。

按摩值成形状(1000*100, 4)的2D数据框，我们可以使用reshape：

arr3 = arr3.reshape(-1, 4)

的-1告诉NumPy的有需要的任何正整数的重塑是有意义的替代-1。由于arr具有1000 * 100 * 4的值，所以将-1替换为1000*100。使用-1比编写1000*100要好，因为它允许代码工作，即使我们更改了df1和df2中的行数。

来源

2014-09-01 01:57:36 unutbu

你能解释一下'x [:, np.newaxis]'是如何工作的吗？我知道x [：]只是切分整个表格，但我不明白'np.newaxis'发生了什么（以及如何）。而且这个语法是什么？它是否是特定的，可以以不同的方式使用？ – 2014-09-01 11:33:45

我已经添加了一个关于'x [:, np.newaxis] - y'如何工作的解释。 – unutbu 2014-09-01 12:29:59

非常感谢，这说明了很多 – 2014-09-01 13:24:20

两个DataFrame中的每一行和列之间的差异（Python/Pandas）

回答

相关问题