Python：快速遍历np.array

我有一个超过1.5亿个数据点的一维np数组，它使用np.fromfile填充二进制数据文件。Python：快速遍历np.array

鉴于该数组，我需要为每个点添加一个值'val'，除非该点等于'x'。

此外，数组中的每个值（取决于其值）都将对应于另一个我想要存储在另一个列表中的值。的变量

说明：

**临时工= np.arange（-30.00,0.01,0.01，D型细胞= 'FLOAT32'）

**单反列表，在临时工索引0对应于索引0在slr等等。两个列表的长度相同

这里是我当前的代码：

import sys 
import numpy as np 

with open("file.dat", "rb") as f: 
array = np.fromfile(f, dtype=np.float32) 
f.close() 

#This is the process below that I need to speed up 

T_SLR = np.array(np.zeros(len(array), dtype='Float64')) 
for i in range(0,len(array)): 
    if array[i] != float(-9.99e+08): 
     array[i] = array[i] - 273.15  
    if array[i] in temps: 
     index, = np.where(temps==array[i])[0] 
     T_SLR = slr[index] 
    else: 
     T_SLR[i] = 0.00

来源

2015-12-03 user2938093

看起来您的传感器可能只会返回0.01度增量值。真的吗？而且，如果是这样的话，选择'temps'是为了让所有的温度在-30到0之间，还是你真的想要那些没有百分之一小数的样本进入'T_SLR'？ –

是的，临时工应该有-30到0合并每0.01。那里的每个温度值都对应于列表slr中的slr值。 T_SLR是一个新的列表（将具有与'数组'相同的长度）。数组的值与临时值进行比较，如果它的温度低于索引值。该索引用于从slr中提取值。然后附加到T_SLR – user2938093

在你的代码中的最慢点在列表中的O（n）的遍历：

if array[i] in temps: 
    index, = np.where(temps==array[i])[0]

由于temps是不是很大，你可以将它与dict：

temps2 = dict(zip(temps, range(len(temps)))

，并使其O（1）：

if array[i] in temps2: 
    index = temps2[array[i]]

您还可以尝试避免for循环加快。例如，下面的代码：

for i in range(0,len(array)): 
    if array[i] != float(-9.99e+08): 
     array[i] = array[i] - 273.15

可以做到：

array[array!=float(-9.99e+08)] -= 273.15

另一个问题在你的代码是浮动比较。您不应该使用完全相同的运算符==或!=，尝试使用numpy.isclose，并将浮点数转换为int。

来源

2015-12-03 02:08:46 eph

由于您的选择标准似乎是逐点的，因此您没有理由需要阅读全部1.5亿分。您可以使用np.fromfile上的count参数来限制您一次比较的阵列的大小。一旦大于几千块的处理，for循环将无关紧要，并且您将不会使用来自所有1.5亿个点的巨大数组来执行内存。

slr和temps看起来像索引转换表。您可以用浮点比较和计算查找来替换temps上的搜索。由于-9.99e + 8明显超出搜索标准，因此您不需要对这些点进行任何特殊处理。

f = open("file.dat", "rb") 
N = 10000 
T_SLR = np.zeros(size_of_TMPprs/4, dtype=np.float64) 
t_off = 0 
array = np.fromfile(f, count=N, dtype=np.float32) 
while array.size > 0: 
    array -= 273.15 
    index = np.where((array >= -30) & (array <= 0))[0] 
    T_SLR[t_off+index] = slr[np.round((array[index]+30)*100)] 
    t_off += array.size 
    array = np.fromfile(f, count=N, dtype=np.float32)

，如果你想T_SLR包含在slr中的最后一项，当测量值超过零，您可以简化这个还要多。然后，可以使用

array = np.maximum(np.minimum(array, 0), -30)

限制值的范围在array，只是将其用于计算索引slr如上（在这种情况下，不使用的where）。

来源

2015-12-03 02:47:00

我在size_of_TMPprs的停止处使用“os.fstat（f.fileno（））。st_size”，但得到以下错误： TypeError：只能将长度为1的数组转换为Python标量关于T_SLR [t_off + index] = slr [int（（array [index] +30）* 100）] – user2938093

对不起！ int（）应该是np.round（），它返回一个可用于索引T_SLR的数组值。我在回答中改变了它。 –

我也注意到float32中有4个字节，而不是32个，正如我原来计算的那样。答案也改变了。 –

当使用with open，不要自行将其关闭。 with上下文自动执行。我也改变了通用array名的东西有阴影别的东西的风险较小（如np.array？）

with open("file.dat", "rb") as f: 
    data = np.fromfile(f, dtype=np.float32)

首先没有必要np.array包np.zeros。它已经是一个数组。 len(data)是确定的，如果data是一维的，但我更喜欢的工作shape元组。

T_SLR = np.zeros(data.shape, dtype='Float64')

布尔索引/掩蔽让你成为整个阵列上一次：

mask = data != -9.99e8 # don't need `float` here 
         # using != test with floats is poor idea 
data[mask] -= 273.15

我需要细化!=测试。整数可以，但不适用于浮点数。类似np.abs(data+9.99e8)>1是更好的

同样in是不是一个很好的测试与浮动。并与整数时，in和where执行多余的工作。

假设temps是图1D中，np.where(...)返回1个元素的元组。 [0]选择该元素，返回一个数组。 ,然后在index,中是多余的。 index, = np.where()没有[0]应该已经工作。

T_SLR[i]已经被数组的初始化为0了。无需重新设置。

for i in range(0,len(array)): 
    if array[i] in temps: 
     index, = np.where(temps==array[i])[0] 
     T_SLR = slr[index] 
    else: 
     T_SLR[i] = 0.00

但我认为我们也可以摆脱这种迭代。但我稍后会讨论这个问题。

In [461]: temps=np.arange(-30.00,0.01,0.01, dtype='float32') 
In [462]: temps 
Out[462]: 
array([ -3.00000000e+01, -2.99899998e+01, -2.99799995e+01, ..., 
     -1.93138123e-02, -9.31358337e-03, 6.86645508e-04], dtype=float32) 
In [463]: temps.shape 
Out[463]: (3001,)

难怪做array[i] in temps和np.where(temps==array[i])缓慢

我们可以切出in与一看where

In [464]: np.where(temps==12.34) 
Out[464]: (array([], dtype=int32),) 
In [465]: np.where(temps==temps[3]) 
Out[465]: (array([3], dtype=int32),)

如果没有匹配where回报一个空阵列。

In [466]: idx,=np.where(temps==temps[3]) 
In [467]: idx.shape 
Out[467]: (1,) 
In [468]: idx,=np.where(temps==123.34) 
In [469]: idx.shape 
Out[469]: (0,)

in可如果比赛是在列表中早于where快，但慢，如果不是更多的话，它的比赛时间是再结，或没有匹配。

In [478]: timeit np.where(temps==temps[-1])[0].shape[0]>0 
10000 loops, best of 3: 35.6 µs per loop 
In [479]: timeit temps[-1] in temps 
10000 loops, best of 3: 39.9 µs per loop

一个四舍五入的方法：

In [487]: (np.round(temps,2)/.01).astype(int) 
Out[487]: array([-3000, -2999, -2998, ..., -2, -1,  0])

我建议的调整：

T_SLR = -np.round(data, 2)/.01).astype(int)

来源

2015-12-03 03:44:40 hpaulj

嗨，谢谢。我已纳入您的更改并了解详细的回复。然而，这是我需要消除的for循环。遍历'数据'数组的每个索引都非常缓慢，并且经常崩溃内核。非常感谢这方面的帮助。 – user2938093

看看'temps'的形状。它很大。我们需要考虑一种更好的测试方法，或者将'数据'值映射到'索引'。 – hpaulj

临时形状（3001,0）。作为一个参考，这开始在python知识的尖端上变得蹒跚起伏。因为我已经处理了较小的文件，所以我已经能够使用上述粗略的方法。 – user2938093

因为temps进行排序，你可以使用np.searchsorted，并避免任何显式循环：

array[array != float(-9.99e+08)] -= 273.15 
indices = np.searchsorted(temps, array) 
# Remove indices out of bounds 
mask = indices < array.shape[0] 
# Remove in-bounds indices not matching exactly 
mask[mask] &= temps[indices[mask]] != array[mask] 
T_SLR = np.where(mask, slr[indices[mask]], 0)

来源

2015-12-03 05:38:11 Jaime

Python：快速遍历np.array

回答

相关问题