2017-02-13 65 views
4

我有两个numpy数组NS,EW来加总。他们每个人都有不同位置的缺失值,像在numpy数组求和中将nan视为零除了所有数组中的nan

NS = array([[ 1., 2., nan], 
     [ 4., 5., nan], 
     [ 6., nan, nan]]) 
EW = array([[ 1., 2., nan], 
     [ 4., nan, nan], 
     [ 6., nan, 9.]] 

我如何能在numpy的方式进行求和操作,这将把南为零,如果一个数组楠在一个位置,并保持楠如果两个数组在同一位置有nan。

我希望看到的结果是

SUM = array([[ 2., 4., nan], 
      [ 8., 5., nan], 
      [ 12., nan, 9.]]) 

当我尝试

SUM=np.add(NS,EW) 

它给了我

SUM=array([[ 2., 4., nan], 
     [ 8., nan, nan], 
     [ 12., nan, nan]]) 

当我尝试

SUM = np.nansum(np.dstack((NS,EW)),2) 

它给了我

SUM=array([[ 2., 4., 0.], 
     [ 8., 5., 0.], 
     [ 12., 0., 9.]]) 

当然,我可以做元素级操作实现我的目标,

for i in range(np.size(NS,0)): 
    for j in range(np.size(NS,1)): 
     if np.isnan(NS[i,j]) and np.isnan(EW[i,j]): 
      SUM[i,j] = np.nan 
     elif np.isnan(NS[i,j]): 
      SUM[i,j] = EW[i,j] 
     elif np.isnan(EW[i,j]): 
      SUM[i,j] = NS[i,j] 
     else: 
      SUM[i,j] = NS[i,j]+EW[i,j] 

但它是非常缓慢的。所以我正在寻找一种更加朴素的解决方案来解决这个问题。

感谢您的帮助!

回答

4

方法1:一种方法与np.where -

def sum_nan_arrays(a,b): 
    ma = np.isnan(a) 
    mb = np.isnan(b) 
    return np.where(ma&mb, np.nan, np.where(ma,0,a) + np.where(mb,0,b)) 

采样运行 -

In [43]: NS 
Out[43]: 
array([[ 1., 2., nan], 
     [ 4., 5., nan], 
     [ 6., nan, nan]]) 

In [44]: EW 
Out[44]: 
array([[ 1., 2., nan], 
     [ 4., nan, nan], 
     [ 6., nan, 9.]]) 

In [45]: sum_nan_arrays(NS, EW) 
Out[45]: 
array([[ 2., 4., nan], 
     [ 8., 5., nan], 
     [ 12., nan, 9.]]) 

方法2:可能是更快的一个与boolean-indexing混合 -

def sum_nan_arrays_v2(a,b): 
    ma = np.isnan(a) 
    mb = np.isnan(b) 
    m_keep_a = ~ma & mb 
    m_keep_b = ma & ~mb 
    out = a + b 
    out[m_keep_a] = a[m_keep_a] 
    out[m_keep_b] = b[m_keep_b] 
    return out 

运行测试 -

In [140]: # Setup input arrays with 4/9 ratio of NaNs (same as in the question) 
    ...: a = np.random.rand(3000,3000) 
    ...: b = np.random.rand(3000,3000) 
    ...: a.ravel()[np.random.choice(range(a.size), size=4000000, replace=0)] = np.nan 
    ...: b.ravel()[np.random.choice(range(b.size), size=4000000, replace=0)] = np.nan 
    ...: 

In [141]: np.nanmax(np.abs(sum_nan_arrays(a, b) - sum_nan_arrays_v2(a, b))) # Verify 
Out[141]: 0.0 

In [142]: %timeit sum_nan_arrays(a, b) 
10 loops, best of 3: 141 ms per loop 

In [143]: %timeit sum_nan_arrays_v2(a, b) 
10 loops, best of 3: 177 ms per loop 

In [144]: # Setup input arrays with lesser NaNs 
    ...: a = np.random.rand(3000,3000) 
    ...: b = np.random.rand(3000,3000) 
    ...: a.ravel()[np.random.choice(range(a.size), size=4000, replace=0)] = np.nan 
    ...: b.ravel()[np.random.choice(range(b.size), size=4000, replace=0)] = np.nan 
    ...: 

In [145]: np.nanmax(np.abs(sum_nan_arrays(a, b) - sum_nan_arrays_v2(a, b))) # Verify 
Out[145]: 0.0 

In [146]: %timeit sum_nan_arrays(a, b) 
10 loops, best of 3: 69.6 ms per loop 

In [147]: %timeit sum_nan_arrays_v2(a, b) 
10 loops, best of 3: 38 ms per loop 
+0

它完美的工作,也比我使用的元素级操作快200倍。感谢您的帮助! – Superstar

1

我认为我们可以得到一点更简洁,在同样的Divakar的第二种方法。随着a = NSb = EW

na = numpy.isnan(a) 
nb = numpy.isnan(b) 
a[na] = 0 
b[nb] = 0 
a += b 
na &= nb 
a[na] = numpy.nan 

的操作是就地在可能的情况,以节省内存中完成,假设这是在您的方案是可行的。最终结果在a

+0

是的,较少的内存是优选的,因为计算可以在大矩阵上执行。我将在我的代码中切换到您的解决方案。谢谢! – Superstar

2

其实你nansum方法几乎工作,你只需要在再次nans补充:

def add_ignore_nans(a, b): 
    stacked = np.array([a, b]) 
    res = np.nansum(stacked, axis=0) 
    res[np.all(np.isnan(stacked), axis=0)] = np.nan 
    return res 

>>> add_ignore_nans(a, b) 
array([[ 2., 4., nan], 
     [ 8., 5., nan], 
     [ 12., nan, 9.]]) 

这将是比@Divakar的回答慢,但我想提一提,你是非常接近了!:-)

+0

我明白了,我错过了一个额外的逻辑和陈述来过滤索引。谢谢你的帮助! – Superstar