CUDA和并行搜索

我想要做以下事情：假设我有一个大小为N（N相当大）的排序数字向量和一个数字x。我想在这个向量中并行搜索数字x的正确位置。例如：CUDA和并行搜索

myVector = [1，2，3，...，10000]并且x = 3.2，

然后我不得不返回3.第一线程来找到正确的位置应当中断其他线程的工作。那么花费的时间将会最小化：t = min（t_1，t_2，......，线程的t_number）您认为使用多线程寻找正确位置可能会更快吗？线程之间的通信如何？由于线程一旦红色值与搜索结果不匹配，其他线程必须在搜索过程中跳过此值（可能是必须更改的布尔值。）

您是否有一些建议要共享关于这个算法？

来源

2013-03-07 ALFRAM

除非已排序的向量已经存在于设备内存中，否则对此使用CUDA是没有意义的。 CPU上的二进制搜索具有复杂性log2 n。 – RoBiK 2013-03-07 10:16:16

您可能对[thrust :: lower_bound]感兴趣（http://thrust.github.com/doc/group__binary__search.html）或[thrust :: partition_point]（http://thrust.github.com/doc/group__searching .html＃ga1b61bfe7c810941e02b723e050c805ba）如果你不熟悉推力，有一个[入门指南]（https://github.com/thrust/thrust/wiki/Quick-Start-Guide）。 – 2013-03-07 16:10:32

前一段时间我写了下面的代码，做类似的事情：

#include "cuda_runtime.h" 
#include "device_launch_parameters.h" 

#include <stdio.h> 
#include <stdlib.h> 

__global__ void fast_finder(unsigned int *g_found, float x, float *y) 
{ 
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x; 
    unsigned int pos = (unsigned int)(x == y[i]); 
    g_found[i * (1 - pos)] = i * pos; 
} 

int main(int argc, char *argv[]) 
{ 
    int N = 65536; 
    unsigned int h_found, *d_found; 
    float *h_y = (float *)malloc(N * sizeof(float)), *d_y, x = 5.0f; 
    int nThreads = 1024, nBloks = N/nThreads; 

    for (int i = 0; i < N; ++i) h_y[i] = (float)(N - i - 1); 

    if (x != h_y[0]) { 
     cudaSetDevice(0); 
     cudaMalloc((void **)&d_found, N * sizeof(unsigned int)); 
     cudaMalloc((void **)&d_y, N * sizeof(float)); 
     cudaMemcpy(d_y, h_y, N * sizeof(float), cudaMemcpyHostToDevice); 

     fast_finder<<<nBloks, nThreads>>>(d_found, x, d_y); 
     cudaThreadSynchronize(); 

     cudaMemcpy(&h_found, d_found, sizeof(unsigned int), cudaMemcpyDeviceToHost); 
     if (h_found) printf("%g found on %d. position!\n", x, h_found); 
     else printf("%g not found!\n", x); 

     cudaFree(d_y); 
     cudaFree(d_found); 

    } else printf("%g found on the first position!\n", x); 

    free(h_y); 

    getchar(); 
    return EXIT_SUCCESS; 
}

这里每个线程检查由全局线程指数y提供的值等于x。如果它是真的，则线程将其索引写入g_found数组的第一个位置，否则将0写入其索引提供的g_found的位置。对于长度为16的y，在第11位包含在y值5的出来如下：

g_found = { 10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }

在这种情况下y不需要被排序，但必须只包含唯一值。此代码可伊斯利改变为一个发现其中提供x将被插入（设备部件）指数，如下：这个版本

__global__ void fast_finder(unsigned int *g_found, float x, float *y) 
{ 
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x; 
    unsigned int pos = (unsigned int)(x >= y[i] || x <= y[i+1]); 
    g_found[i * (1 - pos)] = (i + 1) * pos; 
}

输出将类似于矿。当位置0处的g_found为0时，x的值不存在于阵列y中。在内核被调用之前，通过主机代码检查y的第一个元素是否等于x。改变这个部分也不是一个问题，也不适用于你想要的条件。

正如您所看到的，在这样的解决方案中，所有线程一起工作，并且不需要任何执行终止，只要找到x即可。好的事情也将是申请包搜索，意思是一个线程寻找y的一小部分，因此允许y要大得多。

来源

2013-03-07 10:26:46 Matso

谢谢你的帮助。 – ALFRAM 2013-03-08 14:04:30

无需线程和模块之间的通信。您可以检查，看看是否在当前索引值比预期更大。如果这样返回，大多数线程将无法生存此检查。

现在，您只有线索的索引值小于期望值，请检查下一个值是否大于或等于查询并返回相应索引。

这是我在上午5点写的未经测试的内核。

template<typename ty> 
__global___ static void search(int *out, ty *list, ty val, int n) 
{ 
    int start = threadIdx.x + blockIdx.x * blockDim.x; 
    for (int idx = start; idx < n; idx += gridDim.x * blockDim.x) { 
     if (list[idx] >= val) return; 
     ty next = list[idx + 1]; 
     if (idx == n-1 || next >= val) { 
      *out = next == val ? (idx + 1) : idx; 
      return; 
     } 
    } 
}

这就是说，你真的不想这样做。使用CPU时，可能会出现O（log n）的最差情况。这意味着搜索十亿个元素可以分32步完成。除非你有数据已经在GPU上，并且想要避免内存拷贝，否则这在CPU上更好。

来源

2013-03-07 10:25:05

CUDA和并行搜索

回答

相关问题