使用带GPU的Halide extern

我尝试在Halide中使用extern函数。在我的情况下，我想在GPU上做到这一点。使用带GPU的Halide extern

我使用opencl语句编译AOT编译。当然，OpenCL的仍然可以使用CPU，所以我用这个：

halide_set_ocl_device_type("gpu");

现在，一切都在compute_root时间表（）。

第一个问题，如果我使用compute_root（）和OpenCL GPU，我的过程是否会在设备上计算一些CopyHtoD和DtoH？（或者它会在主机缓冲区中）

第二个问题，更多涉及到extern函数。我们使用一些外部调用，因为我们的一些算法不在Halide中。 EXTERN电话：

foo.define_extern("cool_foo", args, Float(32), 4);

EXTERN检索：为extern “C” INT cool_foo（buffer_t *中，INT W，INT小时，INT Z，buffer_t *总分）{..}

但是， cool_foo函数，我的buffer_t只在主机内存中加载。开发地址是0（默认）。

如果我试图算法之前到内存复制：

halide_copy_to_dev(NULL, &in);

它什么都不做。

如果我让仅在设备内存：

in.host = NULL;

我的主机指针是空的，但该设备的地址仍为0

（dev_dirty是真的对我的情况和host_dirty是假的）

有什么想法？

EDIT（回答dsharlet）

这里是我的代码的结构：

解析数据正确的CPU。 - >在GPU上发送缓冲区（使用halide_copy_to_dev ...） - >在Halide结构中输入，读取参数并添加边界条件 - >进入我的外部函数 - > ...

I在我的extern函数中没有有效的buffer_t。我在compute_root（）中调度所有内容，但使用HL_TARGET = host-opencl并将ocl设置为gpu。在进入Halide之前，我可以阅读我的设备地址，这没关系。

这里是我的代码：

卤化物之前，一切都被CPU东西（指针），我们就transfert到GPU

buffer_t k = { 0, (uint8_t *) k_full, {w_k, h_k, num_patch_x * num_patch_y * 3}, {1, w_k, w_k * h_k}, {0}, sizeof(float), }; 
#if defined(USEGPU) 
    // Transfer into GPU 
    halide_copy_to_dev(NULL, &k); 
    k.host_dirty = false; 
    k.dev_dirty = true; 
    //k.host = NULL; // It's k_full 
#endif 
halide_func(&k)

内卤化物

ImageParam ... 
Func process; 
process = halide_sub_func(k, width, height, k.channels()); 
process.compute_root(); 

... 

Func halide_sub_func(ImageParam k, Expr width, Expr height, Expr patches) 
{ 
    Func kBounded("kBounded"), kShifted("kShifted"), khat("khat"), khat_tuple("khat_tuple"); 
    kBounded = repeat_image(constant_exterior(k, 0.0f), 0, width, 0, height, 0, patches); 
    kShifted(x, y, pi) = kBounded(x + k.width()/2, y + k.height()/2, pi); 

    khat = extern_func(kShifted, width, height, patches); 
    khat_tuple(x, y, pi) = Tuple(khat(0, x, y, pi), khat(1, x, y, pi)); 

    kShifted.compute_root(); 
    khat.compute_root(); 

    return khat_tuple; 
}

外卤化物（EXTERN功能）：

inline .... 
{ 
    //The buffer_t.dev and .host are 0 and null. I expect a null from the host, but the dev.. 
}

来源

2014-10-08 Darkjay

你可以分享在extern阶段之前定义和安排阶段的代码吗？它是否安排在GPU上？如果没有，我认为你所看到的行为是可以预料的。 – dsharlet 2014-10-08 22:03:06

您是否知道外部数组函数的边界推断协议？这发生在任何缓冲区的主机指针为NULL时。（简单地说，在这种情况下，您需要填充具有NULL主机指针的buffer_t结构的范围字段，并且别的什么都不做。）如果您已经处理了该问题，则忽略上述内容。

如果您已经测试过所有缓冲区的主机指针都不为NULL，则应该调用halide_copy_to_dev。根据缓冲区的来源，您可能需要事先明确将host_dirty设置为true以获取复制部分。（我希望Halide能够正确地获得它，如果缓冲区来自CPU上一个管道级，那么它已经设置好了，但是如果缓冲区来自Halide之外的某个地方，那么脏的位可能是初始化时的错误，看来halide_dev_malloc应该设置dev_dirty是否分配设备内存，目前不支持。）

我希望在调用halide_copy_to_dev之后填充dev字段，因为它首先调用halide_dev_malloc。你可以尝试明确地调用halide_dev_malloc，设置host_dirty，然后调用halide_copy_to_dev。

主机上还是GPU上的前一阶段？如果它在GPU上，我希望输入缓冲区也在GPU上。

此API需要工作。我处于某些有助于重构的第一个重构中，但最终它将需要更改buffer_t结构。有可能获得大部分工作，但它需要修改host_dirty和dev_dirty位以及以正确的方式调用halide_dev * API。感谢您的耐心等待。

来源

2014-10-08 22:25:52

谢谢Zalman。如果我的缓冲区为NULL，则填充范围字段并退出extern。但是，如果我希望主机为NULL并且Dev有东西，我需要做什么？我的流水线的前一阶段（halide之前）在CPU上，但我只想在GPU上使用我的Halide部分。最后一件事，如果我在compute_root中调度并使用HL_TARGET = host-opencl并选择gpu，代码是否会在GPU（没有优化）或CPU上运行？ – Darkjay 2014-10-09 02:06:57

我找到解决我的问题。

我在这里发布代码的答案。（因为我做了一点离线测试，变量名称不匹配）

内卤化物（Halide_func.cpp）

#include <Halide.h> 


using namespace Halide; 

using namespace Halide::BoundaryConditions; 

Func thirdPartyFunction(ImageParam f); 
Func fourthPartyFunction(ImageParam f); 
Var x, y; 

int main(int argc, char **argv) { 
    // Input: 
    ImageParam f(Float(32), 2, "f"); 

    printf(" Argument: %d\n",argc); 

    int test = atoi(argv[1]); 

    if (test == 1) { 
     Func f1; 
     f1(x, y) = f(x, y) + 1.0f; 
     f1.gpu_tile(x, 256); 
     std::vector<Argument> args(1); 
     args[ 0 ] = f; 
     f1.compile_to_file("halide_func", args); 

    } else if (test == 2) { 
     Func fOutput("fOutput"); 
     Func fBounded("fBounded"); 
     fBounded = repeat_image(f, 0, f.width(), 0, f.height()); 
     fOutput(x, y) = fBounded(x-1, y) + 1.0f; 


     fOutput.gpu_tile(x, 256); 
     std::vector<Argument> args(1); 
     args[ 0 ] = f; 
     fOutput.compile_to_file("halide_func", args); 

    } else if (test == 3) { 
     Func h("hOut"); 

     h = thirdPartyFunction(f); 

     h.gpu_tile(x, 256); 
     std::vector<Argument> args(1); 
     args[ 0 ] = f; 
     h.compile_to_file("halide_func", args); 

    } else { 
     Func h("hOut"); 

     h = fourthPartyFunction(f); 

     std::vector<Argument> args(1); 
     args[ 0 ] = f; 
     h.compile_to_file("halide_func", args); 
    } 
} 

Func thirdPartyFunction(ImageParam f) { 
    Func g("g"); 
    Func fBounded("fBounded"); 
    Func h("h"); 
    //Boundary 
    fBounded = repeat_image(f, 0, f.width(), 0, f.height()); 
    g(x, y) = fBounded(x-1, y) + 1.0f; 
    h(x, y) = g(x, y) - 1.0f; 

    // Need to be comment out if you want to use GPU schedule. 
    //g.compute_root(); //At least one stage schedule alone 
    //h.compute_root(); 

    return h; 
} 

Func fourthPartyFunction(ImageParam f) { 
    Func fBounded("fBounded"); 
    Func g("g"); 
    Func h("h"); 

    //Boundary 
    fBounded = repeat_image(f, 0, f.width(), 0, f.height()); 

    // Preprocess 
    g(x, y) = fBounded(x-1, y) + 1.0f; 

    g.compute_root(); 
    g.gpu_tile(x, y, 256, 1); 


    // Extern 
    std::vector <ExternFuncArgument> args = { g, f.width(), f.height() }; 
    h.define_extern("extern_func", args, Int(16), 3); 

    h.compute_root(); 
    return h; 
}

外部函数：（external_func.h）

#include <cstdint> 
#include <cstdio> 
#include <cstdlib> 
#include <cassert> 
#include <cinttypes> 
#include <cstring> 
#include <fstream> 
#include <map> 
#include <vector> 
#include <complex> 
#include <chrono> 
#include <iostream> 


#include <clFFT.h> // All OpenCL I need are include. 

using namespace std; 
// Useful stuff. 
void completeDetails2D(buffer_t buffer) { 
    // Read all elements: 
    std::cout << "Buffer information:" << std::endl; 
    std::cout << "Extent: " << buffer.extent[0] << ", " << buffer.extent[1] << std::endl; 
    std::cout << "Stride: " << buffer.stride[0] << ", " << buffer.stride[1] << std::endl; 
    std::cout << "Min: " << buffer.min[0] << ", " << buffer.min[1] << std::endl; 
    std::cout << "Elem size: " << buffer.elem_size << std::endl; 
    std::cout << "Host dirty: " << buffer.host_dirty << ", Dev dirty: " << buffer.dev_dirty << std::endl; 
    printf("Host pointer: %p, Dev pointer: %" PRIu64 "\n\n\n", buffer.host, buffer.dev); 
} 

extern cl_context _ZN6Halide7Runtime8Internal11weak_cl_ctxE; 
extern cl_command_queue _ZN6Halide7Runtime8Internal9weak_cl_qE; 


extern "C" int extern_func(buffer_t * in, int width, int height, buffer_t * out) 
{ 
    printf("In extern\n"); 
    completeDetails2D(*in); 
    printf("Out extern\n"); 
    completeDetails2D(*out); 

    if(in->dev == 0) { 
     // Boundary stuff 
     in->min[0] = 0; 
     in->min[1] = 0; 
     in->extent[0] = width; 
     in->extent[1] = height; 
     return 0; 
    } 

    // Super awesome stuff on GPU 
    // ... 

    cl_context & ctx = _ZN6Halide7Runtime8Internal11weak_cl_ctxE; // Found by zougloub 
    cl_command_queue & queue = _ZN6Halide7Runtime8Internal9weak_cl_qE; // Same 

    printf("ctx: %p\n", ctx); 

    printf("queue: %p\n", queue); 

    cl_mem buffer_in; 
    buffer_in = (cl_mem) in->dev; 
    cl_mem buffer_out; 
    buffer_out = (cl_mem) out->dev; 

    // Just copying data from one buffer to another 
    int err = clEnqueueCopyBuffer(queue, buffer_in, buffer_out, 0, 0, 256*256*4, 0, NULL, NULL); 

    printf("copy: %d\n", err); 

    err = clFinish(queue); 

    printf("finish: %d\n\n", err); 

    return 0; 
}

最后，无卤化物材料：（Halide_test.cpp）

#include <halide_func.h> 
#include <iostream> 
#include <cinttypes> 

#include <external_func.h> 

// Extern function available inside the .o generated. 
#include "HalideRuntime.h" 

int main(int argc, char **argv) { 

    // Init the kernel in GPU 
    halide_set_ocl_device_type("gpu"); 

    // Create a buffer 
    int width = 256; 
    int height = 256; 
    float * bufferHostIn = (float*) malloc(sizeof(float) * width * height); 
    float * bufferHostOut = (float*) malloc(sizeof(float) * width * height); 

    for(int j = 0; j < height; ++j) { 
     for(int i = 0; i < width; ++i) { 
      bufferHostIn[i + j * width] = i+j; 
     } 
    } 

    buffer_t bufferHalideIn = {0, (uint8_t *) bufferHostIn, {width, height}, {1, width, width * height}, {0, 0}, sizeof(float), true, false}; 
    buffer_t bufferHalideOut = {0, (uint8_t *) bufferHostOut, {width, height}, {1, width, width * height}, {0, 0}, sizeof(float), true, false}; 

    printf("IN\n"); 
    completeDetails2D(bufferHalideIn); 
    printf("Data (host): "); 
    for(int i = 0; i < 10; ++ i) { 
     printf(" %f, ", bufferHostIn[i]); 
    } 
    printf("\n"); 

    printf("OUT\n"); 
    completeDetails2D(bufferHalideOut); 

    // Send to GPU 
    halide_copy_to_dev(NULL, &bufferHalideIn); 
    halide_copy_to_dev(NULL, &bufferHalideOut); 
    bufferHalideIn.host_dirty = false; 
    bufferHalideIn.dev_dirty = true; 
    bufferHalideOut.host_dirty = false; 
    bufferHalideOut.dev_dirty = true; 
    // TRICKS Halide to force the use of device. 
    bufferHalideIn.host = NULL; 
    bufferHalideOut.host = NULL; 

    printf("IN After device\n"); 
    completeDetails2D(bufferHalideIn); 

    // Halide function 
    halide_func(&bufferHalideIn, &bufferHalideOut); 

    // Get back to HOST 
    bufferHalideIn.host = (uint8_t*)bufferHostIn; 
    bufferHalideOut.host = (uint8_t*)bufferHostOut; 
    halide_copy_to_host(NULL, &bufferHalideOut); 
    halide_copy_to_host(NULL, &bufferHalideIn); 

    // Validation 
    printf("\nOUT\n"); 
    completeDetails2D(bufferHalideOut); 
    printf("Data (host): "); 
    for(int i = 0; i < 10; ++ i) { 
     printf(" %f, ", bufferHostOut[i]); 
    } 
    printf("\n"); 

    // Free all 
    free(bufferHostIn); 
    free(bufferHostOut); 

}

你可以编译次的halide_func e测试4使用所有Extern功能。

下面是我的一些结论。（感谢Zalman和zougloub）

如果您单独使用Compute_root，则不要调用设备。
我们需要在代码中调用gpu_tile（）的gpu（）来调用GPU例程。（顺便说一句，你需要把你所有的变量放在里面）
gpu_tile比你的物品会崩溃你的东西。
BoundaryCondition在GPU中运行良好。
在调用extern函数之前，作为输入的Func必须是： f.compute_root（）; f.gpu_tile（X，Y，...，...）;中间阶段的compute_root并不是隐含的。
如果开发地址为0，这是正常的，我们重新发送维度，并将再次调用外部。
作为compute_root（）隐含的最后一个阶段。

来源

2014-10-16 21:06:02 Darkjay

使用带GPU的Halide extern

回答

相关问题