2010-09-14 55 views
1

我正在执行一个算法,执行一个8位灰度图像的全局阈值为1位(位打包,使1字节包含8个像素)单色图片。灰度图像中的每个像素的亮度值可以为0 - 255.快速阈值和位打包算法(可能的改进?)

我的环境是Microsoft Visual Studio C++中的Win32。

我有兴趣尽量优化算法出好奇心,1位图像将变成TIFF。目前我正在将FillOrder设置为MSB2LSB(最高有效位至最低有效位),因为TIFF规范暗示了这一点(它不一定需要是MSB2LSB)

仅仅为那些不知道的人:

MSB2LSB将像素从左到右排列在一个字节中,就像在X坐标增加时像素在图像中定向一样。如果您在X轴上从左到右遍历灰度图像,显然要求您在打包当前字节中的位时考虑“向后”。有了这个说法,让我告诉你我目前有什么(这是C语言,我没有尝试ASM或编译器内部函数,但仅仅因为我没有什么经验,但这是可能的)。

因为单色图像的每字节8个像素,单色图像的宽度将

(grayscaleWidth+7)/8;

仅供参考,我认为我的最大图像为6000个像素宽:

我首先要做的(之前的任何图像被处理)是

1)计算量的查找表我需要转移到给定的X中的特定字节从我的灰度图像坐标:

int _shift_lut[6000]; 

for(int x = 0 ; x < 6000; x++) 
{ 
    _shift_lut[x] = 7-(x%8); 
} 

有了这个查询表,我可以包一个单色位值到我的东西,如工作的当前字节:

monochrome_pixel |= 1 << _shift_lut[ grayX ]; 

从而结束了在做

monochrome_pixel |= 1 << _shift_lut[ 7-(x%8)]; 

是一个巨大的速度增加我计算的第二个查找表是一个查找表,它告诉我在灰度像素上给出一个X像素的单色像素中的X索引。这个非常简单的LUT计算如下这样:

int xOffsetLut[6000]; 
int element_size=8; //8 bits 
for(int x = 0; x < 6000; x++) 
{ 
    xOffsetLut[x]=x/element_size; 
} 

这LUT让我做的事情一样

monochrome_image[ xOffsetLut[ GrayX ] ] = packed_byte; //packed byte contains 8 pixels 

我的灰度图像是一个简单的无符号字符*,所以是我的黑白图像;

这是我如何初始化单色图像:

int bitPackedScanlineStride = (grayscaleWidth+7)/8; 
int bitpackedLength=bitPackedScanlineStride * grayscaleHeight; 
unsigned char * bitpack_image = new unsigned char[bitpackedLength]; 
memset(bitpack_image,0,bitpackedLength); 

然后,我打电话给我的双稳态功能就像这样:

binarize(
    gray_image.DataPtr(), 
    bitpack_image, 
    globalFormThreshold, 
    grayscaleWidth, 
    grayscaleHeight, 
    bitPackedScanlineStride, 
    bitpackedLength, 
    _shift_lut, 
    xOffsetLut); 

这里是我的二值化功能(你可以看到我做了一些循环展开,这可能会或可能不会帮助)。

void binarize(unsigned char grayImage[], unsigned char bitPackImage[], int threshold, int grayscaleWidth, int grayscaleHeight, int bitPackedScanlineStride, int bitpackedLength, int shiftLUT[], int xOffsetLUT[]) 
{ 
    int yoff; 
    int byoff; 
    unsigned char bitpackPel=0; 
    unsigned char pel1=0; 
    unsigned char pel2=0; 
    unsigned char pel3=0; 
    unsigned char pel4=0; 
    unsigned char pel5=0; 
    unsigned char pel6=0; 
    unsigned char pel7=0; 
    unsigned char pel8=0; 
    int checkX=grayscaleWidth; 
    int checkY=grayscaleHeight; 

    for (int by = 0 ; by < checkY; by++) 
    { 
    yoff=by*grayscaleWidth; 
    byoff=by*bitPackedScanlineStride; 

    for(int bx = 0; bx < checkX; bx+=32) 
    { 
     bitpackPel = 0; 

     //pixel 1 in bitpack image 
     pel1=grayImage[yoff+bx]; 
     pel2=grayImage[yoff+bx+1]; 
     pel3=grayImage[yoff+bx+2]; 
     pel4=grayImage[yoff+bx+3]; 
     pel5=grayImage[yoff+bx+4]; 
     pel6=grayImage[yoff+bx+5]; 
     pel7=grayImage[yoff+bx+6]; 
     pel8=grayImage[yoff+bx+7]; 

     bitpackPel |= ((pel1<=threshold) << shiftLUT[bx]); 
     bitpackPel |= ((pel2<=threshold) << shiftLUT[bx+1]); 
     bitpackPel |= ((pel3<=threshold) << shiftLUT[bx+2]); 
     bitpackPel |= ((pel4<=threshold) << shiftLUT[bx+3]); 
     bitpackPel |= ((pel5<=threshold) << shiftLUT[bx+4]); 
     bitpackPel |= ((pel6<=threshold) << shiftLUT[bx+5]); 
     bitpackPel |= ((pel7<=threshold) << shiftLUT[bx+6]); 
     bitpackPel |= ((pel8<=threshold) << shiftLUT[bx+7]); 

     bitPackImage[byoff+(xOffsetLUT[bx])] = bitpackPel; 

     //pixel 2 in bitpack image 
     pel1=grayImage[yoff+bx+8]; 
     pel2=grayImage[yoff+bx+9]; 
     pel3=grayImage[yoff+bx+10]; 
     pel4=grayImage[yoff+bx+11]; 
     pel5=grayImage[yoff+bx+12]; 
     pel6=grayImage[yoff+bx+13]; 
     pel7=grayImage[yoff+bx+14]; 
     pel8=grayImage[yoff+bx+15]; 

     bitpackPel = 0; 

     bitpackPel |= ((pel1<=threshold) << shiftLUT[bx+8] ); 
     bitpackPel |= ((pel2<=threshold) << shiftLUT[bx+9] ); 
     bitpackPel |= ((pel3<=threshold) << shiftLUT[bx+10]); 
     bitpackPel |= ((pel4<=threshold) << shiftLUT[bx+11]); 
     bitpackPel |= ((pel5<=threshold) << shiftLUT[bx+12]); 
     bitpackPel |= ((pel6<=threshold) << shiftLUT[bx+13]); 
     bitpackPel |= ((pel7<=threshold) << shiftLUT[bx+14]); 
     bitpackPel |= ((pel8<=threshold) << shiftLUT[bx+15]); 

     bitPackImage[byoff+(xOffsetLUT[bx+8])] = bitpackPel; 

     //pixel 3 in bitpack image 
     pel1=grayImage[yoff+bx+16]; 
     pel2=grayImage[yoff+bx+17]; 
     pel3=grayImage[yoff+bx+18]; 
     pel4=grayImage[yoff+bx+19]; 
     pel5=grayImage[yoff+bx+20]; 
     pel6=grayImage[yoff+bx+21]; 
     pel7=grayImage[yoff+bx+22]; 
     pel8=grayImage[yoff+bx+23]; 

     bitpackPel = 0; 

     bitpackPel |= ((pel1<=threshold) << shiftLUT[bx+16] ); 
     bitpackPel |= ((pel2<=threshold) << shiftLUT[bx+17] ); 
     bitpackPel |= ((pel3<=threshold) << shiftLUT[bx+18]); 
     bitpackPel |= ((pel4<=threshold) << shiftLUT[bx+19]); 
     bitpackPel |= ((pel5<=threshold) << shiftLUT[bx+20]); 
     bitpackPel |= ((pel6<=threshold) << shiftLUT[bx+21]); 
     bitpackPel |= ((pel7<=threshold) << shiftLUT[bx+22]); 
     bitpackPel |= ((pel8<=threshold) << shiftLUT[bx+23]); 

     bitPackImage[byoff+(xOffsetLUT[bx+16])] = bitpackPel; 

     //pixel 4 in bitpack image 
     pel1=grayImage[yoff+bx+24]; 
     pel2=grayImage[yoff+bx+25]; 
     pel3=grayImage[yoff+bx+26]; 
     pel4=grayImage[yoff+bx+27]; 
     pel5=grayImage[yoff+bx+28]; 
     pel6=grayImage[yoff+bx+29]; 
     pel7=grayImage[yoff+bx+30]; 
     pel8=grayImage[yoff+bx+31]; 

     bitpackPel = 0; 

     bitpackPel |= ((pel1<=threshold) << shiftLUT[bx+24] ); 
     bitpackPel |= ((pel2<=threshold) << shiftLUT[bx+25] ); 
     bitpackPel |= ((pel3<=threshold) << shiftLUT[bx+26]); 
     bitpackPel |= ((pel4<=threshold) << shiftLUT[bx+27]); 
     bitpackPel |= ((pel5<=threshold) << shiftLUT[bx+28]); 
     bitpackPel |= ((pel6<=threshold) << shiftLUT[bx+29]); 
     bitpackPel |= ((pel7<=threshold) << shiftLUT[bx+30]); 
     bitpackPel |= ((pel8<=threshold) << shiftLUT[bx+31]); 

     bitPackImage[byoff+(xOffsetLUT[bx+24])] = bitpackPel; 
    } 
} 
} 

我知道这个算法可能会遗漏每行中的一些尾随像素,但不用担心这一点。

正如你可以看到每个单色字节,我处理8个灰度像素。

如果您看到 pel8 < =门槛 是一个整洁的小把戏解析为0或1,比如果{}其他快得多{}

对于XI的每个增量收拾了一下进入了更高序位比以前的X

因此对于灰度图像

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 

这在第一组的8个像素是什么字节看起来像位(显然每个编号位是举ST已经处理相应编号的像素的阈值的结果,但你的想法)

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 

这应该是它。随意使用一些漂亮的捣蛋技巧来获得一些乐趣,这些技巧会挤掉这个算法中的更多汁液。

编译器优化开启后,此功能在核心2 duo机器上的大约5000 x 2200像素图像上平均需要16毫秒。

编辑:

R.,的建议是,以除去偏移LUT,只是使用常量这实际上是完全合理...我已经修改每个像素的的OR'ing是因为这样:

void binarize(unsigned char grayImage[], unsigned char bitPackImage[], int threshold, int grayscaleWidth, int grayscaleHeight, int bitPackedScanlineStride, int bitpackedLength, int shiftLUT[], int xOffsetLUT[]) 
{ 
int yoff; 
int byoff; 
unsigned char bitpackPel=0; 
unsigned char pel1=0; 
unsigned char pel2=0; 
unsigned char pel3=0; 
unsigned char pel4=0; 
unsigned char pel5=0; 
unsigned char pel6=0; 
unsigned char pel7=0; 
unsigned char pel8=0; 
int checkX=grayscaleWidth-32; 
int checkY=grayscaleHeight; 

for (int by = 0 ; by < checkY; by++) 
{ 
    yoff=by*grayscaleWidth; 
    byoff=by*bitPackedScanlineStride; 

    for(int bx = 0; bx < checkX; bx+=32) 
    { 
     bitpackPel = 0; 

     //pixel 1 in bitpack image 
     pel1=grayImage[yoff+bx]; 
     pel2=grayImage[yoff+bx+1]; 
     pel3=grayImage[yoff+bx+2]; 
     pel4=grayImage[yoff+bx+3]; 
     pel5=grayImage[yoff+bx+4]; 
     pel6=grayImage[yoff+bx+5]; 
     pel7=grayImage[yoff+bx+6]; 
     pel8=grayImage[yoff+bx+7]; 

     /*bitpackPel |= ((pel1<=threshold) << shiftLUT[bx]); 
     bitpackPel |= ((pel2<=threshold) << shiftLUT[bx+1]); 
     bitpackPel |= ((pel3<=threshold) << shiftLUT[bx+2]); 
     bitpackPel |= ((pel4<=threshold) << shiftLUT[bx+3]); 
     bitpackPel |= ((pel5<=threshold) << shiftLUT[bx+4]); 
     bitpackPel |= ((pel6<=threshold) << shiftLUT[bx+5]); 
     bitpackPel |= ((pel7<=threshold) << shiftLUT[bx+6]); 
     bitpackPel |= ((pel8<=threshold) << shiftLUT[bx+7]);*/ 
     bitpackPel |= ((pel1<=threshold) << 7); 
     bitpackPel |= ((pel2<=threshold) << 6); 
     bitpackPel |= ((pel3<=threshold) << 5); 
     bitpackPel |= ((pel4<=threshold) << 4); 
     bitpackPel |= ((pel5<=threshold) << 3); 
     bitpackPel |= ((pel6<=threshold) << 2); 
     bitpackPel |= ((pel7<=threshold) << 1); 
     bitpackPel |= ((pel8<=threshold) ); 

     bitPackImage[byoff+(xOffsetLUT[bx])] = bitpackPel; 

     //pixel 2 in bitpack image 
     pel1=grayImage[yoff+bx+8]; 
     pel2=grayImage[yoff+bx+9]; 
     pel3=grayImage[yoff+bx+10]; 
     pel4=grayImage[yoff+bx+11]; 
     pel5=grayImage[yoff+bx+12]; 
     pel6=grayImage[yoff+bx+13]; 
     pel7=grayImage[yoff+bx+14]; 
     pel8=grayImage[yoff+bx+15]; 

     bitpackPel = 0; 

     /*bitpackPel |= ((pel1<=threshold) << shiftLUT[bx+8] ); 
     bitpackPel |= ((pel2<=threshold) << shiftLUT[bx+9] ); 
     bitpackPel |= ((pel3<=threshold) << shiftLUT[bx+10]); 
     bitpackPel |= ((pel4<=threshold) << shiftLUT[bx+11]); 
     bitpackPel |= ((pel5<=threshold) << shiftLUT[bx+12]); 
     bitpackPel |= ((pel6<=threshold) << shiftLUT[bx+13]); 
     bitpackPel |= ((pel7<=threshold) << shiftLUT[bx+14]); 
     bitpackPel |= ((pel8<=threshold) << shiftLUT[bx+15]);*/ 
     bitpackPel |= ((pel1<=threshold) << 7); 
     bitpackPel |= ((pel2<=threshold) << 6); 
     bitpackPel |= ((pel3<=threshold) << 5); 
     bitpackPel |= ((pel4<=threshold) << 4); 
     bitpackPel |= ((pel5<=threshold) << 3); 
     bitpackPel |= ((pel6<=threshold) << 2); 
     bitpackPel |= ((pel7<=threshold) << 1); 
     bitpackPel |= ((pel8<=threshold) ); 


     bitPackImage[byoff+(xOffsetLUT[bx+8])] = bitpackPel; 

     //pixel 3 in bitpack image 
     pel1=grayImage[yoff+bx+16]; 
     pel2=grayImage[yoff+bx+17]; 
     pel3=grayImage[yoff+bx+18]; 
     pel4=grayImage[yoff+bx+19]; 
     pel5=grayImage[yoff+bx+20]; 
     pel6=grayImage[yoff+bx+21]; 
     pel7=grayImage[yoff+bx+22]; 
     pel8=grayImage[yoff+bx+23]; 

     bitpackPel = 0; 

     /*bitpackPel |= ((pel1<=threshold) << shiftLUT[bx+16] ); 
     bitpackPel |= ((pel2<=threshold) << shiftLUT[bx+17] ); 
     bitpackPel |= ((pel3<=threshold) << shiftLUT[bx+18]); 
     bitpackPel |= ((pel4<=threshold) << shiftLUT[bx+19]); 
     bitpackPel |= ((pel5<=threshold) << shiftLUT[bx+20]); 
     bitpackPel |= ((pel6<=threshold) << shiftLUT[bx+21]); 
     bitpackPel |= ((pel7<=threshold) << shiftLUT[bx+22]); 
     bitpackPel |= ((pel8<=threshold) << shiftLUT[bx+23]);*/ 
      bitpackPel |= ((pel1<=threshold) << 7); 
     bitpackPel |= ((pel2<=threshold) << 6); 
     bitpackPel |= ((pel3<=threshold) << 5); 
     bitpackPel |= ((pel4<=threshold) << 4); 
     bitpackPel |= ((pel5<=threshold) << 3); 
     bitpackPel |= ((pel6<=threshold) << 2); 
     bitpackPel |= ((pel7<=threshold) << 1); 
     bitpackPel |= ((pel8<=threshold) ); 


     bitPackImage[byoff+(xOffsetLUT[bx+16])] = bitpackPel; 

     //pixel 4 in bitpack image 
     pel1=grayImage[yoff+bx+24]; 
     pel2=grayImage[yoff+bx+25]; 
     pel3=grayImage[yoff+bx+26]; 
     pel4=grayImage[yoff+bx+27]; 
     pel5=grayImage[yoff+bx+28]; 
     pel6=grayImage[yoff+bx+29]; 
     pel7=grayImage[yoff+bx+30]; 
     pel8=grayImage[yoff+bx+31]; 

     bitpackPel = 0; 

     /*bitpackPel |= ((pel1<=threshold) << shiftLUT[bx+24] ); 
     bitpackPel |= ((pel2<=threshold) << shiftLUT[bx+25] ); 
     bitpackPel |= ((pel3<=threshold) << shiftLUT[bx+26]); 
     bitpackPel |= ((pel4<=threshold) << shiftLUT[bx+27]); 
     bitpackPel |= ((pel5<=threshold) << shiftLUT[bx+28]); 
     bitpackPel |= ((pel6<=threshold) << shiftLUT[bx+29]); 
     bitpackPel |= ((pel7<=threshold) << shiftLUT[bx+30]); 
     bitpackPel |= ((pel8<=threshold) << shiftLUT[bx+31]);*/ 
     bitpackPel |= ((pel1<=threshold) << 7); 
     bitpackPel |= ((pel2<=threshold) << 6); 
     bitpackPel |= ((pel3<=threshold) << 5); 
     bitpackPel |= ((pel4<=threshold) << 4); 
     bitpackPel |= ((pel5<=threshold) << 3); 
     bitpackPel |= ((pel6<=threshold) << 2); 
     bitpackPel |= ((pel7<=threshold) << 1); 
     bitpackPel |= ((pel8<=threshold) ); 


     bitPackImage[byoff+(xOffsetLUT[bx+24])] = bitpackPel; 
    } 
} 
} 

我现在使用(GCC)4.1.2在Intel Xeon 5670上进行测试。在这些规范下,硬编码bitshift比使用我原来的LUT算法慢4 ms。在Xeon和GCC中,LUT算法平均需要8.61 ms,而硬编码的bitshift平均需要12.285 ms。

+3

您的查找表是无用的。简单地计算一下shift(如果你正确地做了,而不是使用'%'操作符和一个有符号整数,这是浪费很慢的),比表查找要快得多。或者,更好的是,您可以展开循环并对8个班次进行硬编码。通常通过常量移位比变量bitshift快很多,所以它可以帮助很多。 – 2010-09-14 00:39:49

+0

我修改了算法以简单地使用恒定的位移......它实际上最终比LUT慢4 ms。我现在在GCC 1.4.2上使用Intel Xeon。使用LUT的算法平均需要8.61毫秒,没有LUT平均需要12.285毫秒。 – 2010-09-14 01:25:51

+0

+1 to R ..,第二个lut同样没用,因为'x/8'将变成'x >> 3',它比'*(lut + x)'更快,因为你不需要解引用指针。如果你真的认为疯狂的可移植性是有价值的(并且不被你正在使用的其他构造所排除),那么你可以使用'x/CHAR_BIT'。 – 2010-09-14 01:29:38

回答

2

尝试类似:

unsigned i, w8=w>>3, x; 
for (i=0; i<w8; i++) { 
    x = thres-src[0]>>1&0x80; 
    x |= thres-src[1]>>2&0x40; 
    x |= thres-src[2]>>3&0x20; 
    x |= thres-src[3]>>4&0x10; 
    x |= thres-src[4]>>5&0x08; 
    x |= thres-src[5]>>6&0x04; 
    x |= thres-src[6]>>7&0x02; 
    x |= thres-src[7]>>8&0x01; 
    out[i] = x; 
    src += 8; 
} 

你可以计算出额外的代码,在宽度行的末尾其余不是8的倍数,或者你可以只垫/校准源以确保它是8的倍数。

+1

你确定这些转换不能从0到7而不是1到8(假设阈值和src都是8位值)。 – caf 2010-09-14 04:50:10

+0

是的,我选择了正确的移位值。我正在下移第8位而不是第7位,因为我想从整数结果中获得借位。无论'thres-src [k]'是否包装模UINT_MAX + 1',位7都可以是0或1。 – 2010-09-14 12:36:40

1

你可以很容易地用SSE做到这一点,一次处理16个像素,例如

  • 负载向量(16×8位无符号)
  • 附加(255 - 阈值),以每个元素
  • 使用PMOVMSKB到符号位提取成16位字
  • 存储区16位字

使用SSE内在函数的示例代码(警告:未经测试!):

void threshold_and_pack(
    const uint8_t * in_image,  // input image, 16 byte aligned, height rows x width cols, width = multiple of 16 
    uint8_t * out_image,   // output image, 2 byte aligned, height rows x width/8 cols, width = multiple of 2 
    const uint8_t threshold,  // threshold 
    const int width, 
    const int height) 
{ 
    const __m128i vThreshold = _mm_set1_epi8(255 - threshold); 
    int i, j; 

    for (i = 0; i < height; ++i) 
    { 
     const __m128i * p_in = (__m128i *)&in_image[i * width]; 
     uint16_t * p_out = (uint16_t *)&out_image[i * width/CHAR_BIT]; 

     for (j = 0; j < width; j += 16) 
     { 
      __m128i v = _mm_load_si128(p_in); 
      uint16_t b; 

      v = _mm_add_epi8(v, vThreshold); 
      b = _mm_movemask_epi8(v); // use PMOVMSKB to pack sign bits into 16 bit word 

      *p_out = b; 

      p_in++; 
      p_out++; 
     } 
    } 
}