我正在执行一个算法,执行一个8位灰度图像的全局阈值为1位(位打包,使1字节包含8个像素)单色图片。灰度图像中的每个像素的亮度值可以为0 - 255.快速阈值和位打包算法(可能的改进?)
我的环境是Microsoft Visual Studio C++中的Win32。
我有兴趣尽量优化算法出好奇心,1位图像将变成TIFF。目前我正在将FillOrder设置为MSB2LSB(最高有效位至最低有效位),因为TIFF规范暗示了这一点(它不一定需要是MSB2LSB)
仅仅为那些不知道的人:
MSB2LSB将像素从左到右排列在一个字节中,就像在X坐标增加时像素在图像中定向一样。如果您在X轴上从左到右遍历灰度图像,显然要求您在打包当前字节中的位时考虑“向后”。有了这个说法,让我告诉你我目前有什么(这是C语言,我没有尝试ASM或编译器内部函数,但仅仅因为我没有什么经验,但这是可能的)。
因为单色图像的每字节8个像素,单色图像的宽度将
(grayscaleWidth+7)/8;
仅供参考,我认为我的最大图像为6000个像素宽:
我首先要做的(之前的任何图像被处理)是
1)计算量的查找表我需要转移到给定的X中的特定字节从我的灰度图像坐标:
int _shift_lut[6000];
for(int x = 0 ; x < 6000; x++)
{
_shift_lut[x] = 7-(x%8);
}
有了这个查询表,我可以包一个单色位值到我的东西,如工作的当前字节:
monochrome_pixel |= 1 << _shift_lut[ grayX ];
从而结束了在做
monochrome_pixel |= 1 << _shift_lut[ 7-(x%8)];
是一个巨大的速度增加我计算的第二个查找表是一个查找表,它告诉我在灰度像素上给出一个X像素的单色像素中的X索引。这个非常简单的LUT计算如下这样:
int xOffsetLut[6000];
int element_size=8; //8 bits
for(int x = 0; x < 6000; x++)
{
xOffsetLut[x]=x/element_size;
}
这LUT让我做的事情一样
monochrome_image[ xOffsetLut[ GrayX ] ] = packed_byte; //packed byte contains 8 pixels
我的灰度图像是一个简单的无符号字符*,所以是我的黑白图像;
这是我如何初始化单色图像:
int bitPackedScanlineStride = (grayscaleWidth+7)/8;
int bitpackedLength=bitPackedScanlineStride * grayscaleHeight;
unsigned char * bitpack_image = new unsigned char[bitpackedLength];
memset(bitpack_image,0,bitpackedLength);
然后,我打电话给我的双稳态功能就像这样:
binarize(
gray_image.DataPtr(),
bitpack_image,
globalFormThreshold,
grayscaleWidth,
grayscaleHeight,
bitPackedScanlineStride,
bitpackedLength,
_shift_lut,
xOffsetLut);
这里是我的二值化功能(你可以看到我做了一些循环展开,这可能会或可能不会帮助)。
void binarize(unsigned char grayImage[], unsigned char bitPackImage[], int threshold, int grayscaleWidth, int grayscaleHeight, int bitPackedScanlineStride, int bitpackedLength, int shiftLUT[], int xOffsetLUT[])
{
int yoff;
int byoff;
unsigned char bitpackPel=0;
unsigned char pel1=0;
unsigned char pel2=0;
unsigned char pel3=0;
unsigned char pel4=0;
unsigned char pel5=0;
unsigned char pel6=0;
unsigned char pel7=0;
unsigned char pel8=0;
int checkX=grayscaleWidth;
int checkY=grayscaleHeight;
for (int by = 0 ; by < checkY; by++)
{
yoff=by*grayscaleWidth;
byoff=by*bitPackedScanlineStride;
for(int bx = 0; bx < checkX; bx+=32)
{
bitpackPel = 0;
//pixel 1 in bitpack image
pel1=grayImage[yoff+bx];
pel2=grayImage[yoff+bx+1];
pel3=grayImage[yoff+bx+2];
pel4=grayImage[yoff+bx+3];
pel5=grayImage[yoff+bx+4];
pel6=grayImage[yoff+bx+5];
pel7=grayImage[yoff+bx+6];
pel8=grayImage[yoff+bx+7];
bitpackPel |= ((pel1<=threshold) << shiftLUT[bx]);
bitpackPel |= ((pel2<=threshold) << shiftLUT[bx+1]);
bitpackPel |= ((pel3<=threshold) << shiftLUT[bx+2]);
bitpackPel |= ((pel4<=threshold) << shiftLUT[bx+3]);
bitpackPel |= ((pel5<=threshold) << shiftLUT[bx+4]);
bitpackPel |= ((pel6<=threshold) << shiftLUT[bx+5]);
bitpackPel |= ((pel7<=threshold) << shiftLUT[bx+6]);
bitpackPel |= ((pel8<=threshold) << shiftLUT[bx+7]);
bitPackImage[byoff+(xOffsetLUT[bx])] = bitpackPel;
//pixel 2 in bitpack image
pel1=grayImage[yoff+bx+8];
pel2=grayImage[yoff+bx+9];
pel3=grayImage[yoff+bx+10];
pel4=grayImage[yoff+bx+11];
pel5=grayImage[yoff+bx+12];
pel6=grayImage[yoff+bx+13];
pel7=grayImage[yoff+bx+14];
pel8=grayImage[yoff+bx+15];
bitpackPel = 0;
bitpackPel |= ((pel1<=threshold) << shiftLUT[bx+8] );
bitpackPel |= ((pel2<=threshold) << shiftLUT[bx+9] );
bitpackPel |= ((pel3<=threshold) << shiftLUT[bx+10]);
bitpackPel |= ((pel4<=threshold) << shiftLUT[bx+11]);
bitpackPel |= ((pel5<=threshold) << shiftLUT[bx+12]);
bitpackPel |= ((pel6<=threshold) << shiftLUT[bx+13]);
bitpackPel |= ((pel7<=threshold) << shiftLUT[bx+14]);
bitpackPel |= ((pel8<=threshold) << shiftLUT[bx+15]);
bitPackImage[byoff+(xOffsetLUT[bx+8])] = bitpackPel;
//pixel 3 in bitpack image
pel1=grayImage[yoff+bx+16];
pel2=grayImage[yoff+bx+17];
pel3=grayImage[yoff+bx+18];
pel4=grayImage[yoff+bx+19];
pel5=grayImage[yoff+bx+20];
pel6=grayImage[yoff+bx+21];
pel7=grayImage[yoff+bx+22];
pel8=grayImage[yoff+bx+23];
bitpackPel = 0;
bitpackPel |= ((pel1<=threshold) << shiftLUT[bx+16] );
bitpackPel |= ((pel2<=threshold) << shiftLUT[bx+17] );
bitpackPel |= ((pel3<=threshold) << shiftLUT[bx+18]);
bitpackPel |= ((pel4<=threshold) << shiftLUT[bx+19]);
bitpackPel |= ((pel5<=threshold) << shiftLUT[bx+20]);
bitpackPel |= ((pel6<=threshold) << shiftLUT[bx+21]);
bitpackPel |= ((pel7<=threshold) << shiftLUT[bx+22]);
bitpackPel |= ((pel8<=threshold) << shiftLUT[bx+23]);
bitPackImage[byoff+(xOffsetLUT[bx+16])] = bitpackPel;
//pixel 4 in bitpack image
pel1=grayImage[yoff+bx+24];
pel2=grayImage[yoff+bx+25];
pel3=grayImage[yoff+bx+26];
pel4=grayImage[yoff+bx+27];
pel5=grayImage[yoff+bx+28];
pel6=grayImage[yoff+bx+29];
pel7=grayImage[yoff+bx+30];
pel8=grayImage[yoff+bx+31];
bitpackPel = 0;
bitpackPel |= ((pel1<=threshold) << shiftLUT[bx+24] );
bitpackPel |= ((pel2<=threshold) << shiftLUT[bx+25] );
bitpackPel |= ((pel3<=threshold) << shiftLUT[bx+26]);
bitpackPel |= ((pel4<=threshold) << shiftLUT[bx+27]);
bitpackPel |= ((pel5<=threshold) << shiftLUT[bx+28]);
bitpackPel |= ((pel6<=threshold) << shiftLUT[bx+29]);
bitpackPel |= ((pel7<=threshold) << shiftLUT[bx+30]);
bitpackPel |= ((pel8<=threshold) << shiftLUT[bx+31]);
bitPackImage[byoff+(xOffsetLUT[bx+24])] = bitpackPel;
}
}
}
我知道这个算法可能会遗漏每行中的一些尾随像素,但不用担心这一点。
正如你可以看到每个单色字节,我处理8个灰度像素。
如果您看到 pel8 < =门槛 是一个整洁的小把戏解析为0或1,比如果{}其他快得多{}
对于XI的每个增量收拾了一下进入了更高序位比以前的X
因此对于灰度图像
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8
这在第一组的8个像素是什么字节看起来像位(显然每个编号位是举ST已经处理相应编号的像素的阈值的结果,但你的想法)
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8
呼这应该是它。随意使用一些漂亮的捣蛋技巧来获得一些乐趣,这些技巧会挤掉这个算法中的更多汁液。
编译器优化开启后,此功能在核心2 duo机器上的大约5000 x 2200像素图像上平均需要16毫秒。
编辑:
R.,的建议是,以除去偏移LUT,只是使用常量这实际上是完全合理...我已经修改每个像素的的OR'ing是因为这样:
void binarize(unsigned char grayImage[], unsigned char bitPackImage[], int threshold, int grayscaleWidth, int grayscaleHeight, int bitPackedScanlineStride, int bitpackedLength, int shiftLUT[], int xOffsetLUT[])
{
int yoff;
int byoff;
unsigned char bitpackPel=0;
unsigned char pel1=0;
unsigned char pel2=0;
unsigned char pel3=0;
unsigned char pel4=0;
unsigned char pel5=0;
unsigned char pel6=0;
unsigned char pel7=0;
unsigned char pel8=0;
int checkX=grayscaleWidth-32;
int checkY=grayscaleHeight;
for (int by = 0 ; by < checkY; by++)
{
yoff=by*grayscaleWidth;
byoff=by*bitPackedScanlineStride;
for(int bx = 0; bx < checkX; bx+=32)
{
bitpackPel = 0;
//pixel 1 in bitpack image
pel1=grayImage[yoff+bx];
pel2=grayImage[yoff+bx+1];
pel3=grayImage[yoff+bx+2];
pel4=grayImage[yoff+bx+3];
pel5=grayImage[yoff+bx+4];
pel6=grayImage[yoff+bx+5];
pel7=grayImage[yoff+bx+6];
pel8=grayImage[yoff+bx+7];
/*bitpackPel |= ((pel1<=threshold) << shiftLUT[bx]);
bitpackPel |= ((pel2<=threshold) << shiftLUT[bx+1]);
bitpackPel |= ((pel3<=threshold) << shiftLUT[bx+2]);
bitpackPel |= ((pel4<=threshold) << shiftLUT[bx+3]);
bitpackPel |= ((pel5<=threshold) << shiftLUT[bx+4]);
bitpackPel |= ((pel6<=threshold) << shiftLUT[bx+5]);
bitpackPel |= ((pel7<=threshold) << shiftLUT[bx+6]);
bitpackPel |= ((pel8<=threshold) << shiftLUT[bx+7]);*/
bitpackPel |= ((pel1<=threshold) << 7);
bitpackPel |= ((pel2<=threshold) << 6);
bitpackPel |= ((pel3<=threshold) << 5);
bitpackPel |= ((pel4<=threshold) << 4);
bitpackPel |= ((pel5<=threshold) << 3);
bitpackPel |= ((pel6<=threshold) << 2);
bitpackPel |= ((pel7<=threshold) << 1);
bitpackPel |= ((pel8<=threshold) );
bitPackImage[byoff+(xOffsetLUT[bx])] = bitpackPel;
//pixel 2 in bitpack image
pel1=grayImage[yoff+bx+8];
pel2=grayImage[yoff+bx+9];
pel3=grayImage[yoff+bx+10];
pel4=grayImage[yoff+bx+11];
pel5=grayImage[yoff+bx+12];
pel6=grayImage[yoff+bx+13];
pel7=grayImage[yoff+bx+14];
pel8=grayImage[yoff+bx+15];
bitpackPel = 0;
/*bitpackPel |= ((pel1<=threshold) << shiftLUT[bx+8] );
bitpackPel |= ((pel2<=threshold) << shiftLUT[bx+9] );
bitpackPel |= ((pel3<=threshold) << shiftLUT[bx+10]);
bitpackPel |= ((pel4<=threshold) << shiftLUT[bx+11]);
bitpackPel |= ((pel5<=threshold) << shiftLUT[bx+12]);
bitpackPel |= ((pel6<=threshold) << shiftLUT[bx+13]);
bitpackPel |= ((pel7<=threshold) << shiftLUT[bx+14]);
bitpackPel |= ((pel8<=threshold) << shiftLUT[bx+15]);*/
bitpackPel |= ((pel1<=threshold) << 7);
bitpackPel |= ((pel2<=threshold) << 6);
bitpackPel |= ((pel3<=threshold) << 5);
bitpackPel |= ((pel4<=threshold) << 4);
bitpackPel |= ((pel5<=threshold) << 3);
bitpackPel |= ((pel6<=threshold) << 2);
bitpackPel |= ((pel7<=threshold) << 1);
bitpackPel |= ((pel8<=threshold) );
bitPackImage[byoff+(xOffsetLUT[bx+8])] = bitpackPel;
//pixel 3 in bitpack image
pel1=grayImage[yoff+bx+16];
pel2=grayImage[yoff+bx+17];
pel3=grayImage[yoff+bx+18];
pel4=grayImage[yoff+bx+19];
pel5=grayImage[yoff+bx+20];
pel6=grayImage[yoff+bx+21];
pel7=grayImage[yoff+bx+22];
pel8=grayImage[yoff+bx+23];
bitpackPel = 0;
/*bitpackPel |= ((pel1<=threshold) << shiftLUT[bx+16] );
bitpackPel |= ((pel2<=threshold) << shiftLUT[bx+17] );
bitpackPel |= ((pel3<=threshold) << shiftLUT[bx+18]);
bitpackPel |= ((pel4<=threshold) << shiftLUT[bx+19]);
bitpackPel |= ((pel5<=threshold) << shiftLUT[bx+20]);
bitpackPel |= ((pel6<=threshold) << shiftLUT[bx+21]);
bitpackPel |= ((pel7<=threshold) << shiftLUT[bx+22]);
bitpackPel |= ((pel8<=threshold) << shiftLUT[bx+23]);*/
bitpackPel |= ((pel1<=threshold) << 7);
bitpackPel |= ((pel2<=threshold) << 6);
bitpackPel |= ((pel3<=threshold) << 5);
bitpackPel |= ((pel4<=threshold) << 4);
bitpackPel |= ((pel5<=threshold) << 3);
bitpackPel |= ((pel6<=threshold) << 2);
bitpackPel |= ((pel7<=threshold) << 1);
bitpackPel |= ((pel8<=threshold) );
bitPackImage[byoff+(xOffsetLUT[bx+16])] = bitpackPel;
//pixel 4 in bitpack image
pel1=grayImage[yoff+bx+24];
pel2=grayImage[yoff+bx+25];
pel3=grayImage[yoff+bx+26];
pel4=grayImage[yoff+bx+27];
pel5=grayImage[yoff+bx+28];
pel6=grayImage[yoff+bx+29];
pel7=grayImage[yoff+bx+30];
pel8=grayImage[yoff+bx+31];
bitpackPel = 0;
/*bitpackPel |= ((pel1<=threshold) << shiftLUT[bx+24] );
bitpackPel |= ((pel2<=threshold) << shiftLUT[bx+25] );
bitpackPel |= ((pel3<=threshold) << shiftLUT[bx+26]);
bitpackPel |= ((pel4<=threshold) << shiftLUT[bx+27]);
bitpackPel |= ((pel5<=threshold) << shiftLUT[bx+28]);
bitpackPel |= ((pel6<=threshold) << shiftLUT[bx+29]);
bitpackPel |= ((pel7<=threshold) << shiftLUT[bx+30]);
bitpackPel |= ((pel8<=threshold) << shiftLUT[bx+31]);*/
bitpackPel |= ((pel1<=threshold) << 7);
bitpackPel |= ((pel2<=threshold) << 6);
bitpackPel |= ((pel3<=threshold) << 5);
bitpackPel |= ((pel4<=threshold) << 4);
bitpackPel |= ((pel5<=threshold) << 3);
bitpackPel |= ((pel6<=threshold) << 2);
bitpackPel |= ((pel7<=threshold) << 1);
bitpackPel |= ((pel8<=threshold) );
bitPackImage[byoff+(xOffsetLUT[bx+24])] = bitpackPel;
}
}
}
我现在使用(GCC)4.1.2在Intel Xeon 5670上进行测试。在这些规范下,硬编码bitshift比使用我原来的LUT算法慢4 ms。在Xeon和GCC中,LUT算法平均需要8.61 ms,而硬编码的bitshift平均需要12.285 ms。
您的查找表是无用的。简单地计算一下shift(如果你正确地做了,而不是使用'%'操作符和一个有符号整数,这是浪费很慢的),比表查找要快得多。或者,更好的是,您可以展开循环并对8个班次进行硬编码。通常通过常量移位比变量bitshift快很多,所以它可以帮助很多。 – 2010-09-14 00:39:49
我修改了算法以简单地使用恒定的位移......它实际上最终比LUT慢4 ms。我现在在GCC 1.4.2上使用Intel Xeon。使用LUT的算法平均需要8.61毫秒,没有LUT平均需要12.285毫秒。 – 2010-09-14 01:25:51
+1 to R ..,第二个lut同样没用,因为'x/8'将变成'x >> 3',它比'*(lut + x)'更快,因为你不需要解引用指针。如果你真的认为疯狂的可移植性是有价值的(并且不被你正在使用的其他构造所排除),那么你可以使用'x/CHAR_BIT'。 – 2010-09-14 01:29:38