快速阈值和位打包算法（可能的改进？）

我正在执行一个算法，执行一个8位灰度图像的全局阈值为1位（位打包，使1字节包含8个像素）单色图片。灰度图像中的每个像素的亮度值可以为0 - 255.快速阈值和位打包算法（可能的改进？）

我的环境是Microsoft Visual Studio C++中的Win32。

我有兴趣尽量优化算法出好奇心，1位图像将变成TIFF。目前我正在将FillOrder设置为MSB2LSB（最高有效位至最低有效位），因为TIFF规范暗示了这一点（它不一定需要是MSB2LSB）

仅仅为那些不知道的人：

MSB2LSB将像素从左到右排列在一个字节中，就像在X坐标增加时像素在图像中定向一样。如果您在X轴上从左到右遍历灰度图像，显然要求您在打包当前字节中的位时考虑“向后”。有了这个说法，让我告诉你我目前有什么（这是C语言，我没有尝试ASM或编译器内部函数，但仅仅因为我没有什么经验，但这是可能的）。

因为单色图像的每字节8个像素，单色图像的宽度将

(grayscaleWidth+7)/8;

仅供参考，我认为我的最大图像为6000个像素宽：

我首先要做的（之前的任何图像被处理）是

1）计算量的查找表我需要转移到给定的X中的特定字节从我的灰度图像坐标：

int _shift_lut[6000]; 

for(int x = 0 ; x < 6000; x++) 
{ 
    _shift_lut[x] = 7-(x%8); 
}

有了这个查询表，我可以包一个单色位值到我的东西，如工作的当前字节：

monochrome_pixel |= 1 << _shift_lut[ grayX ];

从而结束了在做

monochrome_pixel |= 1 << _shift_lut[ 7-(x%8)];

是一个巨大的速度增加我计算的第二个查找表是一个查找表，它告诉我在灰度像素上给出一个X像素的单色像素中的X索引。这个非常简单的LUT计算如下这样：

int xOffsetLut[6000]; 
int element_size=8; //8 bits 
for(int x = 0; x < 6000; x++) 
{ 
    xOffsetLut[x]=x/element_size; 
}

这LUT让我做的事情一样

monochrome_image[ xOffsetLut[ GrayX ] ] = packed_byte; //packed byte contains 8 pixels

我的灰度图像是一个简单的无符号字符*，所以是我的黑白图像;

这是我如何初始化单色图像：

int bitPackedScanlineStride = (grayscaleWidth+7)/8; 
int bitpackedLength=bitPackedScanlineStride * grayscaleHeight; 
unsigned char * bitpack_image = new unsigned char[bitpackedLength]; 
memset(bitpack_image,0,bitpackedLength);

然后，我打电话给我的双稳态功能就像这样：

binarize(
    gray_image.DataPtr(), 
    bitpack_image, 
    globalFormThreshold, 
    grayscaleWidth, 
    grayscaleHeight, 
    bitPackedScanlineStride, 
    bitpackedLength, 
    _shift_lut, 
    xOffsetLut);

这里是我的二值化功能（你可以看到我做了一些循环展开，这可能会或可能不会帮助）。

void binarize(unsigned char grayImage[], unsigned char bitPackImage[], int threshold, int grayscaleWidth, int grayscaleHeight, int bitPackedScanlineStride, int bitpackedLength, int shiftLUT[], int xOffsetLUT[]) 
{ 
    int yoff; 
    int byoff; 
    unsigned char bitpackPel=0; 
    unsigned char pel1=0; 
    unsigned char pel2=0; 
    unsigned char pel3=0; 
    unsigned char pel4=0; 
    unsigned char pel5=0; 
    unsigned char pel6=0; 
    unsigned char pel7=0; 
    unsigned char pel8=0; 
    int checkX=grayscaleWidth; 
    int checkY=grayscaleHeight; 

    for (int by = 0 ; by < checkY; by++) 
    { 
    yoff=by*grayscaleWidth; 
    byoff=by*bitPackedScanlineStride; 

    for(int bx = 0; bx < checkX; bx+=32) 
    { 
     bitpackPel = 0; 

     //pixel 1 in bitpack image 
     pel1=grayImage[yoff+bx]; 
     pel2=grayImage[yoff+bx+1]; 
     pel3=grayImage[yoff+bx+2]; 
     pel4=grayImage[yoff+bx+3]; 
     pel5=grayImage[yoff+bx+4]; 
     pel6=grayImage[yoff+bx+5]; 
     pel7=grayImage[yoff+bx+6]; 
     pel8=grayImage[yoff+bx+7]; 

     bitpackPel |= ((pel1<=threshold) << shiftLUT[bx]); 
     bitpackPel |= ((pel2<=threshold) << shiftLUT[bx+1]); 
     bitpackPel |= ((pel3<=threshold) << shiftLUT[bx+2]); 
     bitpackPel |= ((pel4<=threshold) << shiftLUT[bx+3]); 
     bitpackPel |= ((pel5<=threshold) << shiftLUT[bx+4]); 
     bitpackPel |= ((pel6<=threshold) << shiftLUT[bx+5]); 
     bitpackPel |= ((pel7<=threshold) << shiftLUT[bx+6]); 
     bitpackPel |= ((pel8<=threshold) << shiftLUT[bx+7]); 

     bitPackImage[byoff+(xOffsetLUT[bx])] = bitpackPel; 

     //pixel 2 in bitpack image 
     pel1=grayImage[yoff+bx+8]; 
     pel2=grayImage[yoff+bx+9]; 
     pel3=grayImage[yoff+bx+10]; 
     pel4=grayImage[yoff+bx+11]; 
     pel5=grayImage[yoff+bx+12]; 
     pel6=grayImage[yoff+bx+13]; 
     pel7=grayImage[yoff+bx+14]; 
     pel8=grayImage[yoff+bx+15]; 

     bitpackPel = 0; 

     bitpackPel |= ((pel1<=threshold) << shiftLUT[bx+8] ); 
     bitpackPel |= ((pel2<=threshold) << shiftLUT[bx+9] ); 
     bitpackPel |= ((pel3<=threshold) << shiftLUT[bx+10]); 
     bitpackPel |= ((pel4<=threshold) << shiftLUT[bx+11]); 
     bitpackPel |= ((pel5<=threshold) << shiftLUT[bx+12]); 
     bitpackPel |= ((pel6<=threshold) << shiftLUT[bx+13]); 
     bitpackPel |= ((pel7<=threshold) << shiftLUT[bx+14]); 
     bitpackPel |= ((pel8<=threshold) << shiftLUT[bx+15]); 

     bitPackImage[byoff+(xOffsetLUT[bx+8])] = bitpackPel; 

     //pixel 3 in bitpack image 
     pel1=grayImage[yoff+bx+16]; 
     pel2=grayImage[yoff+bx+17]; 
     pel3=grayImage[yoff+bx+18]; 
     pel4=grayImage[yoff+bx+19]; 
     pel5=grayImage[yoff+bx+20]; 
     pel6=grayImage[yoff+bx+21]; 
     pel7=grayImage[yoff+bx+22]; 
     pel8=grayImage[yoff+bx+23]; 

     bitpackPel = 0; 

     bitpackPel |= ((pel1<=threshold) << shiftLUT[bx+16] ); 
     bitpackPel |= ((pel2<=threshold) << shiftLUT[bx+17] ); 
     bitpackPel |= ((pel3<=threshold) << shiftLUT[bx+18]); 
     bitpackPel |= ((pel4<=threshold) << shiftLUT[bx+19]); 
     bitpackPel |= ((pel5<=threshold) << shiftLUT[bx+20]); 
     bitpackPel |= ((pel6<=threshold) << shiftLUT[bx+21]); 
     bitpackPel |= ((pel7<=threshold) << shiftLUT[bx+22]); 
     bitpackPel |= ((pel8<=threshold) << shiftLUT[bx+23]); 

     bitPackImage[byoff+(xOffsetLUT[bx+16])] = bitpackPel; 

     //pixel 4 in bitpack image 
     pel1=grayImage[yoff+bx+24]; 
     pel2=grayImage[yoff+bx+25]; 
     pel3=grayImage[yoff+bx+26]; 
     pel4=grayImage[yoff+bx+27]; 
     pel5=grayImage[yoff+bx+28]; 
     pel6=grayImage[yoff+bx+29]; 
     pel7=grayImage[yoff+bx+30]; 
     pel8=grayImage[yoff+bx+31]; 

     bitpackPel = 0; 

     bitpackPel |= ((pel1<=threshold) << shiftLUT[bx+24] ); 
     bitpackPel |= ((pel2<=threshold) << shiftLUT[bx+25] ); 
     bitpackPel |= ((pel3<=threshold) << shiftLUT[bx+26]); 
     bitpackPel |= ((pel4<=threshold) << shiftLUT[bx+27]); 
     bitpackPel |= ((pel5<=threshold) << shiftLUT[bx+28]); 
     bitpackPel |= ((pel6<=threshold) << shiftLUT[bx+29]); 
     bitpackPel |= ((pel7<=threshold) << shiftLUT[bx+30]); 
     bitpackPel |= ((pel8<=threshold) << shiftLUT[bx+31]); 

     bitPackImage[byoff+(xOffsetLUT[bx+24])] = bitpackPel; 
    } 
} 
}

我知道这个算法可能会遗漏每行中的一些尾随像素，但不用担心这一点。

正如你可以看到每个单色字节，我处理8个灰度像素。

如果您看到 pel8 < =门槛是一个整洁的小把戏解析为0或1，比如果{}其他快得多{}

对于XI的每个增量收拾了一下进入了更高序位比以前的X

因此对于灰度图像

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

这在第一组的8个像素是什么字节看起来像位（显然每个编号位是举ST已经处理相应编号的像素的阈值的结果，但你的想法）

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

呼这应该是它。随意使用一些漂亮的捣蛋技巧来获得一些乐趣，这些技巧会挤掉这个算法中的更多汁液。

编译器优化开启后，此功能在核心2 duo机器上的大约5000 x 2200像素图像上平均需要16毫秒。

编辑：

R.，的建议是，以除去偏移LUT，只是使用常量这实际上是完全合理...我已经修改每个像素的的OR'ing是因为这样：

void binarize(unsigned char grayImage[], unsigned char bitPackImage[], int threshold, int grayscaleWidth, int grayscaleHeight, int bitPackedScanlineStride, int bitpackedLength, int shiftLUT[], int xOffsetLUT[]) 
{ 
int yoff; 
int byoff; 
unsigned char bitpackPel=0; 
unsigned char pel1=0; 
unsigned char pel2=0; 
unsigned char pel3=0; 
unsigned char pel4=0; 
unsigned char pel5=0; 
unsigned char pel6=0; 
unsigned char pel7=0; 
unsigned char pel8=0; 
int checkX=grayscaleWidth-32; 
int checkY=grayscaleHeight; 

for (int by = 0 ; by < checkY; by++) 
{ 
    yoff=by*grayscaleWidth; 
    byoff=by*bitPackedScanlineStride; 

    for(int bx = 0; bx < checkX; bx+=32) 
    { 
     bitpackPel = 0; 

     //pixel 1 in bitpack image 
     pel1=grayImage[yoff+bx]; 
     pel2=grayImage[yoff+bx+1]; 
     pel3=grayImage[yoff+bx+2]; 
     pel4=grayImage[yoff+bx+3]; 
     pel5=grayImage[yoff+bx+4]; 
     pel6=grayImage[yoff+bx+5]; 
     pel7=grayImage[yoff+bx+6]; 
     pel8=grayImage[yoff+bx+7]; 

     /*bitpackPel |= ((pel1<=threshold) << shiftLUT[bx]); 
     bitpackPel |= ((pel2<=threshold) << shiftLUT[bx+1]); 
     bitpackPel |= ((pel3<=threshold) << shiftLUT[bx+2]); 
     bitpackPel |= ((pel4<=threshold) << shiftLUT[bx+3]); 
     bitpackPel |= ((pel5<=threshold) << shiftLUT[bx+4]); 
     bitpackPel |= ((pel6<=threshold) << shiftLUT[bx+5]); 
     bitpackPel |= ((pel7<=threshold) << shiftLUT[bx+6]); 
     bitpackPel |= ((pel8<=threshold) << shiftLUT[bx+7]);*/ 
     bitpackPel |= ((pel1<=threshold) << 7); 
     bitpackPel |= ((pel2<=threshold) << 6); 
     bitpackPel |= ((pel3<=threshold) << 5); 
     bitpackPel |= ((pel4<=threshold) << 4); 
     bitpackPel |= ((pel5<=threshold) << 3); 
     bitpackPel |= ((pel6<=threshold) << 2); 
     bitpackPel |= ((pel7<=threshold) << 1); 
     bitpackPel |= ((pel8<=threshold) ); 

     bitPackImage[byoff+(xOffsetLUT[bx])] = bitpackPel; 

     //pixel 2 in bitpack image 
     pel1=grayImage[yoff+bx+8]; 
     pel2=grayImage[yoff+bx+9]; 
     pel3=grayImage[yoff+bx+10]; 
     pel4=grayImage[yoff+bx+11]; 
     pel5=grayImage[yoff+bx+12]; 
     pel6=grayImage[yoff+bx+13]; 
     pel7=grayImage[yoff+bx+14]; 
     pel8=grayImage[yoff+bx+15]; 

     bitpackPel = 0; 

     /*bitpackPel |= ((pel1<=threshold) << shiftLUT[bx+8] ); 
     bitpackPel |= ((pel2<=threshold) << shiftLUT[bx+9] ); 
     bitpackPel |= ((pel3<=threshold) << shiftLUT[bx+10]); 
     bitpackPel |= ((pel4<=threshold) << shiftLUT[bx+11]); 
     bitpackPel |= ((pel5<=threshold) << shiftLUT[bx+12]); 
     bitpackPel |= ((pel6<=threshold) << shiftLUT[bx+13]); 
     bitpackPel |= ((pel7<=threshold) << shiftLUT[bx+14]); 
     bitpackPel |= ((pel8<=threshold) << shiftLUT[bx+15]);*/ 
     bitpackPel |= ((pel1<=threshold) << 7); 
     bitpackPel |= ((pel2<=threshold) << 6); 
     bitpackPel |= ((pel3<=threshold) << 5); 
     bitpackPel |= ((pel4<=threshold) << 4); 
     bitpackPel |= ((pel5<=threshold) << 3); 
     bitpackPel |= ((pel6<=threshold) << 2); 
     bitpackPel |= ((pel7<=threshold) << 1); 
     bitpackPel |= ((pel8<=threshold) ); 


     bitPackImage[byoff+(xOffsetLUT[bx+8])] = bitpackPel; 

     //pixel 3 in bitpack image 
     pel1=grayImage[yoff+bx+16]; 
     pel2=grayImage[yoff+bx+17]; 
     pel3=grayImage[yoff+bx+18]; 
     pel4=grayImage[yoff+bx+19]; 
     pel5=grayImage[yoff+bx+20]; 
     pel6=grayImage[yoff+bx+21]; 
     pel7=grayImage[yoff+bx+22]; 
     pel8=grayImage[yoff+bx+23]; 

     bitpackPel = 0; 

     /*bitpackPel |= ((pel1<=threshold) << shiftLUT[bx+16] ); 
     bitpackPel |= ((pel2<=threshold) << shiftLUT[bx+17] ); 
     bitpackPel |= ((pel3<=threshold) << shiftLUT[bx+18]); 
     bitpackPel |= ((pel4<=threshold) << shiftLUT[bx+19]); 
     bitpackPel |= ((pel5<=threshold) << shiftLUT[bx+20]); 
     bitpackPel |= ((pel6<=threshold) << shiftLUT[bx+21]); 
     bitpackPel |= ((pel7<=threshold) << shiftLUT[bx+22]); 
     bitpackPel |= ((pel8<=threshold) << shiftLUT[bx+23]);*/ 
      bitpackPel |= ((pel1<=threshold) << 7); 
     bitpackPel |= ((pel2<=threshold) << 6); 
     bitpackPel |= ((pel3<=threshold) << 5); 
     bitpackPel |= ((pel4<=threshold) << 4); 
     bitpackPel |= ((pel5<=threshold) << 3); 
     bitpackPel |= ((pel6<=threshold) << 2); 
     bitpackPel |= ((pel7<=threshold) << 1); 
     bitpackPel |= ((pel8<=threshold) ); 


     bitPackImage[byoff+(xOffsetLUT[bx+16])] = bitpackPel; 

     //pixel 4 in bitpack image 
     pel1=grayImage[yoff+bx+24]; 
     pel2=grayImage[yoff+bx+25]; 
     pel3=grayImage[yoff+bx+26]; 
     pel4=grayImage[yoff+bx+27]; 
     pel5=grayImage[yoff+bx+28]; 
     pel6=grayImage[yoff+bx+29]; 
     pel7=grayImage[yoff+bx+30]; 
     pel8=grayImage[yoff+bx+31]; 

     bitpackPel = 0; 

     /*bitpackPel |= ((pel1<=threshold) << shiftLUT[bx+24] ); 
     bitpackPel |= ((pel2<=threshold) << shiftLUT[bx+25] ); 
     bitpackPel |= ((pel3<=threshold) << shiftLUT[bx+26]); 
     bitpackPel |= ((pel4<=threshold) << shiftLUT[bx+27]); 
     bitpackPel |= ((pel5<=threshold) << shiftLUT[bx+28]); 
     bitpackPel |= ((pel6<=threshold) << shiftLUT[bx+29]); 
     bitpackPel |= ((pel7<=threshold) << shiftLUT[bx+30]); 
     bitpackPel |= ((pel8<=threshold) << shiftLUT[bx+31]);*/ 
     bitpackPel |= ((pel1<=threshold) << 7); 
     bitpackPel |= ((pel2<=threshold) << 6); 
     bitpackPel |= ((pel3<=threshold) << 5); 
     bitpackPel |= ((pel4<=threshold) << 4); 
     bitpackPel |= ((pel5<=threshold) << 3); 
     bitpackPel |= ((pel6<=threshold) << 2); 
     bitpackPel |= ((pel7<=threshold) << 1); 
     bitpackPel |= ((pel8<=threshold) ); 


     bitPackImage[byoff+(xOffsetLUT[bx+24])] = bitpackPel; 
    } 
} 
}

我现在使用（GCC）4.1.2在Intel Xeon 5670上进行测试。在这些规范下，硬编码bitshift比使用我原来的LUT算法慢4 ms。在Xeon和GCC中，LUT算法平均需要8.61 ms，而硬编码的bitshift平均需要12.285 ms。

来源

2010-09-14 alessandro ferrucci

您的查找表是无用的。简单地计算一下shift（如果你正确地做了，而不是使用'％'操作符和一个有符号整数，这是浪费很慢的），比表查找要快得多。或者，更好的是，您可以展开循环并对8个班次进行硬编码。通常通过常量移位比变量bitshift快很多，所以它可以帮助很多。 – 2010-09-14 00:39:49

我修改了算法以简单地使用恒定的位移......它实际上最终比LUT慢4 ms。我现在在GCC 1.4.2上使用Intel Xeon。使用LUT的算法平均需要8.61毫秒，没有LUT平均需要12.285毫秒。 – 2010-09-14 01:25:51

+1 to R ..，第二个lut同样没用，因为'x/8'将变成'x >> 3'，它比'*（lut + x）'更快，因为你不需要解引用指针。如果你真的认为疯狂的可移植性是有价值的（并且不被你正在使用的其他构造所排除），那么你可以使用'x/CHAR_BIT'。 – 2010-09-14 01:29:38

尝试类似：

unsigned i, w8=w>>3, x; 
for (i=0; i<w8; i++) { 
    x = thres-src[0]>>1&0x80; 
    x |= thres-src[1]>>2&0x40; 
    x |= thres-src[2]>>3&0x20; 
    x |= thres-src[3]>>4&0x10; 
    x |= thres-src[4]>>5&0x08; 
    x |= thres-src[5]>>6&0x04; 
    x |= thres-src[6]>>7&0x02; 
    x |= thres-src[7]>>8&0x01; 
    out[i] = x; 
    src += 8; 
}

你可以计算出额外的代码，在宽度行的末尾其余不是8的倍数，或者你可以只垫/校准源以确保它是8的倍数。

来源

2010-09-14 01:49:08

你确定这些转换不能从0到7而不是1到8（假设阈值和src都是8位值）。 – caf 2010-09-14 04:50:10

是的，我选择了正确的移位值。我正在下移第8位而不是第7位，因为我想从整数结果中获得借位。无论'thres-src [k]'是否包装模UINT_MAX + 1'，位7都可以是0或1。 – 2010-09-14 12:36:40

你可以很容易地用SSE做到这一点，一次处理16个像素，例如

负载向量（16×8位无符号）
附加（255 - 阈值），以每个元素
使用PMOVMSKB到符号位提取成16位字
存储区16位字

使用SSE内在函数的示例代码（警告：未经测试！）：

void threshold_and_pack(
    const uint8_t * in_image,  // input image, 16 byte aligned, height rows x width cols, width = multiple of 16 
    uint8_t * out_image,   // output image, 2 byte aligned, height rows x width/8 cols, width = multiple of 2 
    const uint8_t threshold,  // threshold 
    const int width, 
    const int height) 
{ 
    const __m128i vThreshold = _mm_set1_epi8(255 - threshold); 
    int i, j; 

    for (i = 0; i < height; ++i) 
    { 
     const __m128i * p_in = (__m128i *)&in_image[i * width]; 
     uint16_t * p_out = (uint16_t *)&out_image[i * width/CHAR_BIT]; 

     for (j = 0; j < width; j += 16) 
     { 
      __m128i v = _mm_load_si128(p_in); 
      uint16_t b; 

      v = _mm_add_epi8(v, vThreshold); 
      b = _mm_movemask_epi8(v); // use PMOVMSKB to pack sign bits into 16 bit word 

      *p_out = b; 

      p_in++; 
      p_out++; 
     } 
    } 
}

来源

2010-09-14 10:55:45

快速阈值和位打包算法（可能的改进？）

回答

相关问题