Matlab histc with vector bin

如果我有两个矩阵A和B，大小为[mxn]和[pxn]，我想要找出B中每行出现在A中的次数，例如：Matlab histc with vector bin

>> A = rand(5,3) 

A = 

    0.1419 0.6557 0.7577 
    0.4218 0.0357 0.7431 
    0.9157 0.8491 0.3922 
    0.7922 0.9340 0.6555 
    0.9595 0.6787 0.1712 

>> B = [A(2,:); A(1,:); A(2,:); A(3,:); A(3,:); A(4,:); A(5,:)] 

B = 

    0.4218 0.0357 0.7431 
    0.1419 0.6557 0.7577 
    0.4218 0.0357 0.7431 
    0.9157 0.8491 0.3922 
    0.9157 0.8491 0.3922 
    0.7922 0.9340 0.6555 
    0.9595 0.6787 0.1712

在这种情况下，作为答案

ans = 

    1  2  2  1  1

虽然不像这个例子中，在广义M >> p

如果A和B是向量MATLAB的histc会做的工作，但有似乎并不等同alent如果垃圾箱是矢量。

目前我做的：

for i=1:length(B) 
    indices(i) = find(abs(A/B(i,:)-1) < 1e-15); 
    % division requires a tolerance due to numerical issues 
end 
histc(indices, 1:size(A,1)) 

ans = 

    1  2  2  1  1

但因为我有很多这样的矩阵B，A和B都大，这是可怕的慢。任何想法如何改善呢？

编辑：在方法

展望到目前为止，我有以下数据：

A     7871139x3    188907336 double      
B      902x3     21648 double

为了使事情更快，我只是要使用第10排B的

B = B(1:10,:);

请注意，对于完整的应用程序，我（当前）有> 10^4的这种矩阵（这最终将> 10^6 ....）

我的第一种方法：

tic, C = get_vector_index(A,B); toc 
Elapsed time is 36.630107 seconds.

bdecaf的方法

>> tic, C1 = get_vector_index(A,B); toc 
Elapsed time is 28.957243 seconds. 
>> isequal(C, C1) 

ans = 

    1

OLI的pdist2方法（可以通过去除if语句和使用L1距离，而不是L2距离被减小到〜25秒），

>> tic, C2 = get_vector_index(A,B); toc 
Elapsed time is 7.244965 seconds. 

>> isequal(C,C2) 

ans = 

    1

oli的标准化方法

>> tic, C3 = get_vector_index(A,B); toc 
Elapsed time is 3.756682 seconds. 

>> isequal(C,C3) 

ans = 

    1

最后，我想出了另一种方法，我搜索第一列，然后搜索第一列的命中内的第二列，递归直到列耗尽。这是迄今为止最快的....

N = size(A,2); 
loc = zeros(size(B,1),1); 
for i=1:size(B,1) 
    idx{1} = find(A(:,1)==B(i,1)); 
    for k=2:N, 
     idx{k} = idx{k-1}(find(A(idx{k-1},k)==B(i,k))); 
    end 
    loc(i) = idx{end}; 
end 
C = histc(loc, 1:size(A,1));

导致：

>> tic, C4 = get_vector_index(A,B); toc 
Elapsed time is 1.314370 seconds. 

>> isequal(C, C4) 

ans = 

    1

另外请注意，使用intersect慢得多：

>> tic, [~,IA] = intersect(A,B,'rows'); C5 = histc(IA,1:size(A,1)); toc 
Elapsed time is 44.392593 seconds. 

>> isequal(C,C5) 

ans = 

    1

来源

2011-11-30 tdc

顺便说一下，我认识到我当前方法中的矩阵分解也是可能的 - 它可能（几乎）可能得到假的误报，但它会产生一个错误，因为每个B的向量只能有一个匹配在所有的情况下，我有。 – tdc

只是一个技巧 - 只要B中的所有行都在A中有一行，它就会起作用。另外，如果A中有两个几乎相同的行，则会产生问题。行为也与histc不同。 'histc'分类到*最近* bin中 - 当它刚好在垃圾箱中时，你做出了它。 – bdecaf

也要小心使用长度 - 它将始终返回最大尺寸。所以如果'p bdecaf

我居然把这个解决方案作为这个问题的编辑，而是接受一个答案的缘故，我会把解决方案在这里也：

N = size(A,2); 
loc = zeros(size(B,1),1); 
for i=1:size(B,1) 
    idx{1} = find(A(:,1)==B(i,1)); 
    for k=2:N, 
     idx{k} = idx{k-1}(find(A(idx{k-1},k)==B(i,k))); 
    end 
    loc(i) = idx{end}; 
end 
C = histc(loc, 1:size(A,1));

导致：

>> tic, C4 = get_vector_index(A,B); toc 
Elapsed time is 1.314370 seconds. 

>> isequal(C, C4) 

ans = 

    1

来源

2011-12-15 17:47:03 tdc

也许你可以规范他们，这样你检查他们的点积为1

A = rand(5,3); 
B = [A(2,:); A(1,:); A(2,:); A(3,:); A(3,:); A(4,:); A(5,:)]; 
A2=bsxfun(@times,A,1./sqrt(sum(A.^2,2))); %%% normalize A 
B2=bsxfun(@times,B,1./sqrt(sum(B.^2,2))) %%% normalize B 
sum(A2*B2'>1-10e-9,2) %%% check that the dotproduct is close to 1 

ans = 

    1 
    2 
    2 
    1 
    1

如果你需要的东西更快，但近似的，我推荐你使用FLANN库，它是用于快速近似近邻：

http://www.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN

来源

2011-11-30 12:14:34 Oli

我很喜欢这种方法，但矩阵乘法不适合内存（例如A是7871139x3和B是903x3） – tdc

我也看过FLANN，但我不认为我可以安装它，因为我没有sudoers帐户 - 从源代码编译需要cmake（这是没有安装，并安装，只是炸弹为我，大概是因为我不是一个sudoer），目前他们的预建库只是Ubuntu/Debian（服务器是红帽） – tdc

没有sudo权限就可以安装它。 – Oli

我会解决这个问题是这样的：

indices = zeros(size(A,1),1); 
for i=1:size(B,1) 
    distances = sum((repmat(B(i,:),size(A,1),1)-A).^2 ,2); 
    [md,im]=min(distances); 
    if md < 1e-9 
     indices(im) = indices(im)+1; 
    end 
end

如果您删除它，只是将它排入最近的垃圾箱。

来源

2011-11-30 12:27:19 bdecaf

稍微比@oli的方法慢 – tdc

确实是一个优雅。 – bdecaf

其实，这样做的一个简单的方法是：

sum(10e-9>pdist2(A',B'),2)

它计算所有成对距离和门限和计数。

来源

2011-11-30 12:31:29 Oli

这对我的方法有点改进（快5倍）... – tdc

Matlab histc with vector bin

回答

相关问题