LockBits性能关键代码

我有一个方法需要尽可能快，它使用不安全的内存指针，它是我第一次进入这种类型的编码，所以我知道它可能会更快。LockBits性能关键代码

/// <summary> 
    /// Copies bitmapdata from one bitmap to another at a specified point on the output bitmapdata 
    /// </summary> 
    /// <param name="sourcebtmpdata">The sourcebitmap must be smaller that the destbitmap</param> 
    /// <param name="destbtmpdata"></param> 
    /// <param name="point">The point on the destination bitmap to draw at</param> 
    private static unsafe void CopyBitmapToDest(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point) 
    { 
     // calculate total number of rows to draw. 
     var totalRow = Math.Min(
      destbtmpdata.Height - point.Y, 
      sourcebtmpdata.Height); 


     //loop through each row on the source bitmap and get mem pointers 
     //to the source bitmap and dest bitmap 
     for (int i = 0; i < totalRow; i++) 
     { 
      int destRow = point.Y + i; 

      //get the pointer to the start of the current pixel "row" on the output image 
      byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride); 
      //get the pointer to the start of the FIRST pixel row on the source image 
      byte* srcRowPtr = (byte*)sourcebtmpdata.Scan0 + (i * sourcebtmpdata.Stride); 

      int pointX = point.X; 
      //the rowSize is pre-computed before the loop to improve performance 
      int rowSize = Math.Min(destbtmpdata.Width - pointX, sourcebtmpdata.Width); 
      //for each row each set each pixel 
      for (int j = 0; j < rowSize; j++) 
      { 
       int firstBlueByte = ((pointX + j)*3); 

       int srcByte = j *3; 
       destRowPtr[(firstBlueByte)] = srcRowPtr[srcByte]; 
       destRowPtr[(firstBlueByte) + 1] = srcRowPtr[srcByte + 1]; 
       destRowPtr[(firstBlueByte) + 2] = srcRowPtr[srcByte + 2]; 
      } 


     } 
    }

那么有什么办法可以让这个更快吗？现在忽略了待办事项，稍后我有一些基线性能测量时就会遇到问题。

更新：对不起，应该提到，我使用这个而不是Graphics.DrawImage的原因是因为即时实现多线程，因为我不能使用DrawImage。

更新2：我仍然不满意的表现，我敢肯定还有几个ms可以有。

来源

2009-04-11 Lee Treveil

你为什么要调用LockBits？你是否在你发布的代码中没有的位图上直接做了些什么？ becouse而不是锁定位图和复制每个字节的字节，你可以调用Graphics.DrawImage – 2009-04-11 18:23:38

我已经添加了另一个答案...看看是否有帮助:-) – chakrit 2009-04-11 19:11:06

有某种根本性错误，我不能相信我没有注意到，直到如今的代码。

byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride);

这得到一个指向目的地行，但它并没有得到它复制到，在旧的代码在ROWSIZE循环内进行的列。它现在看起来像：

byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride) + pointX * 3;

所以现在我们有正确的目标数据指针。现在我们可以摆脱那个循环。从Vilx-和Rob使用建议的代码现在看起来像：

 private static unsafe void CopyBitmapToDestSuperFast(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point) 
    { 
     //calculate total number of rows to copy. 
     //using ternary operator instead of Math.Min, few ms faster 
     int totalRows = (destbtmpdata.Height - point.Y < sourcebtmpdata.Height) ? destbtmpdata.Height - point.Y : sourcebtmpdata.Height; 
     //calculate the width of the image to draw, this cuts off the image 
     //if it goes past the width of the destination image 
     int rowWidth = (destbtmpdata.Width - point.X < sourcebtmpdata.Width) ? destbtmpdata.Width - point.X : sourcebtmpdata.Width; 

     //loop through each row on the source bitmap and get mem pointers 
     //to the source bitmap and dest bitmap 
     for (int i = 0; i < totalRows; i++) 
     { 
      int destRow = point.Y + i; 

      //get the pointer to the start of the current pixel "row" and column on the output image 
      byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride) + point.X * 3; 

      //get the pointer to the start of the FIRST pixel row on the source image 
      byte* srcRowPtr = (byte*)sourcebtmpdata.Scan0 + (i * sourcebtmpdata.Stride); 

      //RtlMoveMemory function 
      CopyMemory(new IntPtr(destRowPtr), new IntPtr(srcRowPtr), (uint)rowWidth * 3); 

     } 
    }

复制500×500的图像到5000x5000图像中的网格50次了：00：00：07.9948993秒。现在，上面的更改需要00：00：01.8714263秒。好多了。

来源

2009-05-13 15:50:25

嗯......我不知道.NET位图数据格式是否完全与Windows的GDI32功能我学到兼容...

但前几的Win32 API的一个被BitBlt的：

BOOL BitBlt(
    HDC hdcDest, 
    int nXDest, 
    int nYDest, 
    int nWidth, 
    int nHeight, 
    HDC hdcSrc, 
    int nXSrc, 
    int nYSrc, 
    DWORD dwRop 
);

它是最快的方式来复制数据，如果我没记错的话。

下面是C＃和相关使用信息的BitBlt的PInvoke签名的使用，对于任何一个在C＃与高性能显卡工作的一个伟大的阅读：

http://www.pinvoke.net/default.aspx/gdi32/BitBlt.html

绝对值得一看。

来源

2009-04-11 18:13:56 chakrit

是的，我看着这个，但我不能使用它，因为手柄，即时通讯只是操纵原始图像对象。 – 2009-04-11 18:27:44

你看看StretchDIBits？ – 2009-05-12 18:15:16

我认为可以提前计算出步幅和行数的限制。

我预先计算所有乘法，导致下面的代码：

private static unsafe void CopyBitmapToDest(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point) 
{ 
    //TODO: It is expected that the bitmap PixelFormat is Format24bppRgb but this could change in the future 
    const int pixelSize = 3; 

    // calculate total number of rows to draw. 
    var totalRow = Math.Min(
     destbtmpdata.Height - point.Y, 
     sourcebtmpdata.Height); 

    var rowSize = Math.Min(
     (destbtmpdata.Width - point.X) * pixelSize, 
     sourcebtmpdata.Width * pixelSize); 

    // starting point of copy operation 
    byte* srcPtr = (byte*)sourcebtmpdata.Scan0; 
    byte* destPtr = (byte*)destbtmpdata.Scan0 + point.Y * destbtmpdata.Stride; 

    // loop through each row 
    for (int i = 0; i < totalRow; i++) { 

     // draw the entire row 
     for (int j = 0; j < rowSize; j++) 
      destPtr[point.X + j] = srcPtr[j]; 

     // advance each pointer by 1 row 
     destPtr += destbtmpdata.Stride; 
     srcPtr += sourcebtmpdata.Stride; 
    } 

}

没有带全面的测试，但你应该能够得到那个工作。

我已经从循环中删除了乘法运算（代之以预先计算），并删除了大部分分支，所以它应该稍微快一些。

让我知道，如果这有助于:-)

来源

2009-04-11 18:54:26 chakrit

谢谢你，只是跑了一些测试，速度慢了3倍！无法理解它，因为它看起来更快。 – 2009-04-11 19:47:21

Bah！ :-(...我想Math.Min的电话可能是罪魁祸首...无论如何，我现在不睡觉... – chakrit 2009-04-11 19:59:47

内环是要集中很多的时间（不过，做测量，以确保）

for (int j = 0; j < sourcebtmpdata.Width; j++) 
{ 
    destRowPtr[(point.X + j) * 3] = srcRowPtr[j * 3]; 
    destRowPtr[((point.X + j) * 3) + 1] = srcRowPtr[(j * 3) + 1]; 
    destRowPtr[((point.X + j) * 3) + 2] = srcRowPtr[(j * 3) + 2]; 
}

摆脱multiplies和数组索引（这是一个乘法之下的乘法），并用您正在增加的指针替换。
同上+1，+2，增加一个指针。
可能您的编译器不会保留计算点。X（检查），但为了以防万一，建立一个局部变量。它不会在单次迭代中完成，但它可能会每次迭代。

来源

2009-04-11 22:51:21

你可能想看看Eigen。

它是一个C++模板库，它使用SSE（2及更高版本）和AltiVec指令集，优雅地回退到非矢量化代码。

快。（见基准）。
表达式模板允许智能地删除临时对象并启用延迟评估（如果适当的话） - Eigen会自动处理并在大多数情况下也处理别名。
针对SSE（2及更高版本）和AltiVec指令集执行显式矢量化，并优雅地回退到非矢量化代码。表达式模板允许为整个表达式全局执行这些优化。
对于固定大小的对象，可以避免动态内存分配，并且在有意义时展开循环。
对于大型矩阵，应特别注意缓存友好性。

你可以实现你的C++函数，然后调用从C＃

来源

2009-05-09 18:16:01 lothar

我与lothar。C++/CLI的高性能代码。C＃的可爱的代码。 – GregC 2009-05-09 23:00:29

你并不总是需要使用指针，以获得良好的速度。这应该在原始的几毫秒内：

 private static void CopyBitmapToDest(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point) 
    { 
     byte[] src = new byte[sourcebtmpdata.Height * sourcebtmpdata.Width * 3]; 
     int maximum = src.Length; 
     byte[] dest = new byte[maximum]; 
     Marshal.Copy(sourcebtmpdata.Scan0, src, 0, src.Length); 
     int pointX = point.X * 3; 
     int copyLength = destbtmpdata.Width*3 - pointX; 
     int k = pointX + point.Y * sourcebtmpdata.Stride; 
     int rowWidth = sourcebtmpdata.Stride; 
     while (k<maximum) 
     { 
      Array.Copy(src,k,dest,k,copyLength); 
      k += rowWidth; 

     } 
     Marshal.Copy(dest, 0, destbtmpdata.Scan0, dest.Length); 
    }

来源

2009-05-09 21:57:29

我在看你的C＃代码，我无法识别任何熟悉的东西。它看起来像一吨C++。顺便说一句，它看起来像DirectX/XNA需要成为你的新朋友。只是我2美分。不要杀死信使。

如果你必须依靠CPU来做到这一点：我自己做了一些24位布局优化，我可以告诉你内存访问速度应该是你的瓶颈。使用SSE3指令尽可能快地按字节访问。这意味着C++和嵌入式汇编语言。在纯C中，大多数机器的速度会降低30％。

请记住，在这类操作中，现代GPU比CPU要快得多。

来源

2009-05-09 23:03:20 GregC

我不确定这是否会提供额外的性能，但我在Reflector中看到了很多模式。

所以：

int srcByte = j *3; 
destRowPtr[(firstBlueByte)] = srcRowPtr[srcByte]; 
destRowPtr[(firstBlueByte) + 1] = srcRowPtr[srcByte + 1]; 
destRowPtr[(firstBlueByte) + 2] = srcRowPtr[srcByte + 2];

变为：

*destRowPtr++ = *srcRowPtr++; 
*destRowPtr++ = *srcRowPtr++; 
*destRowPtr++ = *srcRowPtr++;

可能需要更多的支撑。

如果宽度是固定的，则可以将整行展开成几百行。 :)

更新

您也可以尝试使用更大的类型，如的Int32或Int64的有更好的表现。

来源

2009-05-11 14:44:44 leppie

不幸的是，我没有时间写出完整的解决方案，但我会考虑使用平台RtlMoveMemory（）函数来移动整个行，而不是逐字节。这应该快很多。

来源

2009-05-13 11:41:49

好吧，这将会非常接近您可以从算法中获得多少毫秒的线路，但是您可以去掉Math.Min的呼叫，而是用一个三元运算符代替它。

一般来说，使得库调用会比这样做你自己的东西长，我做了一个简单的测试驱动程序，以确认本作Math.Min.

using System; 
using System.Diagnostics; 

namespace TestDriver 
{ 
    class Program 
    { 
     static void Main(string[] args) 
     { 
      // Start the stopwatch 
      if (Stopwatch.IsHighResolution) 
      { Console.WriteLine("Using high resolution timer"); } 
      else 
      { Console.WriteLine("High resolution timer unavailable"); } 
      // Test Math.Min for 10000 iterations 
      Stopwatch sw = Stopwatch.StartNew(); 
      for (int ndx = 0; ndx < 10000; ndx++) 
      { 
       int result = Math.Min(ndx, 5000); 
      } 
      Console.WriteLine(sw.Elapsed.TotalMilliseconds.ToString("0.0000")); 
      // Test trinary operator for 10000 iterations 
      sw = Stopwatch.StartNew(); 
      for (int ndx = 0; ndx < 10000; ndx++) 
      { 
       int result = (ndx < 5000) ? ndx : 5000; 
      } 
      Console.WriteLine(sw.Elapsed.TotalMilliseconds.ToString("0.0000")); 
      Console.ReadKey(); 
     } 
    } 
}

结果运行我的电脑，英特尔以上时T2400 @ 1.83GHz。另外，请注意结果有一些变化，但通常trinay运算速度大约为0.01毫秒。这并不多，但是通过一个足够大的数据集将会加起来。

使用高分辨率定时器
0.0539
0.0402

来源

2009-05-13 13:27:16 rjzii

LockBits性能关键代码

回答

相关问题