2016-11-29 139 views
5

设a和b是具有8位整数(0-255)的相同大小的向量。我想计算那些向量不同的位数,即通过这些数字的二进制表示的串联形成的向量之间的汉明距离。例如:获得整数数组的汉明距离的最快方法

a = [127,255] 
b= [127,240] 

使用numpy的库

np.bitwise_xor(a,b) 
# Output: array([ 0, 15]) 

我需要的是现在二进制表示上述阵列的每个元素,并在阵列中的所有元素计数的1的数量。上面的例子将使汉明距离为0 + 4 = 4。Python中的任何快速和优雅的解决方案?

+1

那不'0 + 1'代替因为254是除了在一个位全为1,而255是全1? – Divakar

+0

大概只需要一个标准的popcount配方,在阵列上播放它,然后对结果进行求和。您可以通过将数组的缓冲区视为更大的dtype来获得加速。 – user2357112

+0

@Divakar这是我的错误。接得好。样本数据中的数字更新为240。 –

回答

6

方法1:我们可以进行广播为二进制位不同位的&计数,像这样 -

def hamming_distance(a, b): 
    r = (1 << np.arange(8))[:,None] 
    return np.count_nonzero((a & r) != (b & r)) 

采样运行 -

In [144]: a = [127,255] 
    ...: b = [127,240] 
    ...: 

In [145]: hamming_distance(a, b) 
Out[145]: 4 

方法2:使用bitwise-xor操作,我们可以ab之间找出不同的二进制位的数量 -

def hamming_distance_v2(a, b): 
    r = (1 << np.arange(8))[:,None] 
    return np.count_nonzero((np.bitwise_xor(a,b) & r) != 0) 
+0

方法2时抛出异常: 类型错误 - :“名单”和“名单” –

+0

@DebasishMitra添加一个更好的用'xor'那里。 – Divakar

1

也许不是最有效的方式,但最简单的海事组织您ouptut数组转换为二进制形式的字符串,然后把所有的字符和转换回整数...

import numpy as np 

output = np.random.randint(0,63,10) 
hamming = ['{:b}'.format(x).count('1') for x in output] 
0
sum(bin(x).count("1") for x in np.bitwise_xor(a,b)) 
4

如果你要调用的距离函数多在您执行程序的一次执行期间,您可以通过使用预计算的位计数表获得一些速度。这里的(另一个)版本的汉明距离函数:

# _nbits[k] is the number of 1s in the binary representation of k for 0 <= k < 256. 
_nbits = np.array(
     [0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3, 
     4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 
     4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2, 
     3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 
     4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 
     5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 
     3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 
     3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 
     4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 
     6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5, 
     5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 
     7, 7, 8], dtype=np.uint8) 


def hamming_distance1(a, b): 
    c = np.bitwise_xor(a, b) 
    n = _nbits[c].sum() 
    return n 

在下面,ab在这个问题评论给定长度32的Python列表。 divakar_hamming_distance()divakar_hamming_distance_v2()来自@ Divakar的回答。

这里有定时@ Divakar的功能:

In [116]: %timeit divakar_hamming_distance(a, b) 
The slowest run took 5.57 times longer than the fastest. This could mean that an intermediate result is being cached. 
100000 loops, best of 3: 11.3 µs per loop 

In [117]: %timeit divakar_hamming_distance_v2(a, b) 
The slowest run took 5.35 times longer than the fastest. This could mean that an intermediate result is being cached. 
100000 loops, best of 3: 10.3 µs per loop 

hamming_distance1(a, b)是快了一点:

In [118]: %timeit hamming_distance1(a, b) 
The slowest run took 6.04 times longer than the fastest. This could mean that an intermediate result is being cached. 
100000 loops, best of 3: 7.42 µs per loop 

在我的电脑,初始化_nbits大约需要11微秒,所以没有优势如果您只调用一次函数,则使用hamming_distance1。如果你三次或更多次称呼它,则表现有净增益。

如果输入已经numpy的阵列,所有的功能都显著快:

In [119]: aa = np.array(a) 

In [120]: bb = np.array(b) 

In [121]: %timeit divakar_hamming_distance_v2(aa, bb) 
The slowest run took 8.22 times longer than the fastest. This could mean that an intermediate result is being cached. 
100000 loops, best of 3: 5.72 µs per loop 

In [122]: %timeit hamming_distance1(aa, bb) 
The slowest run took 12.67 times longer than the fastest. This could mean that an intermediate result is being cached. 
100000 loops, best of 3: 2.77 µs per loop 

当然,如果你总是这样,你计算的汉明距离之前,做转换的时候一定要包括在总体时间中。但是,如果您编写生成ab的代码以便早日利用numpy,则在计算海明距离时,可能已将它们作为numpy阵列。


(I也试验了一下与预先计算的汉明距离的8个值之间的2-d阵列 - 具有形状(256阵列,256) - 但初始化成本较高和性能增益小。)