计算numpy ndarray中元素的数量

如何计算ndarray中每个数据点的元素数量？计算numpy ndarray中元素的数量

我想要做的是对我的ndarray中至少存在N次的所有值运行OneHotEncoder。

我也想用另一个它不出现在数组中的元素（我们称之为new_value）替换出现少于N次的所有值。

因此，例如，我有：

import numpy as np 

a = np.array([[[2], [2,3], [3,34]], 
       [[3], [4,5], [3,34]], 
       [[3], [2,3], [3,4] ]]])

与阈值N = 2我想是这样的：

b = [OneHotEncoder(a[:,[i]])[0] if count(a[:,[i]])>2 
else OneHotEncoder(new_value) for i in range(a.shape(1)]

所以才明白，我想换人，不考虑onehotencoder和使用new_value = 10我的数组应该看起来像：

a = np.array([[[10], [2,3], [3,34]], 
       [[3], [10], [3,34]], 
       [[3], [2,3], [10] ]]])

来源

2013-07-24 user2616532

你真的需要有列表数组吗？这会非常严重地破坏numpy。通常由快速C函数调用处理的许多操作（比如相等比较）现在必须被中继到昂贵的Python调用。 @Ophion的代码按照陈述解决了你的问题，但是你应该认真考虑一下不同的方法（用np.nan的浮点数组，还是用例如-1表示缺失值的int整数），它们可以让你利用numpy的功能最充分的是不是一个更好的选择。 – Jaime

这个结构就像是考虑各种各样的bigrams/trigrams combinatinon 如果我有条目[3,2,1]，那么我想考虑unigrams [3]，[2]，[1]，但也可以是bigrams [ 3,2]和[2,1]，因此条目将变为[[3]，[2]，[1]，[3,2]，[2,1]] 我没有编写代码，我不想修改它，因为它非常复杂，我只是想看看性能（就修正后的预测而言）是否会增加对罕见事件的过滤并将它们全部放在同一类别中。但是可能你很厉害，我应该加快速度，因为我无论如何都在等待。 – user2616532

这样的事情呢？

第一计数unqiue元件的数量在一个阵列：

>>> a=np.random.randint(0,5,(3,3)) 
>>> a 
array([[0, 1, 4], 
     [0, 2, 4], 
     [2, 4, 0]]) 
>>> ua,uind=np.unique(a,return_inverse=True) 
>>> count=np.bincount(uind) 
>>> ua 
array([0, 1, 2, 4]) 
>>> count 
array([3, 1, 2, 3])

从ua和count阵列它表明0表示了3次，图1示出了1次，等等。

import numpy as np 

def mask_fewest(arr,thresh,replace): 
    ua,uind=np.unique(arr,return_inverse=True) 
    count=np.bincount(uind) 
    #Here ua has all of the unique elements, count will have the number of times 
    #each appears. 


    #@Jamie's suggestion to make the rep_mask faster. 
    rep_mask = np.in1d(uind, np.where(count < thresh)) 
    #Find which elements do not appear at least `thresh` times and create a mask 

    arr.flat[rep_mask]=replace 
    #Replace elements based on above mask. 

    return arr 


>>> a=np.random.randint(2,8,(4,4)) 
[[6 7 7 3] 
[7 5 4 3] 
[3 5 2 3] 
[3 3 7 7]] 


>>> mask_fewest(a,5,50) 
[[10 7 7 3] 
[ 7 5 10 3] 
[ 3 5 10 3] 
[ 3 3 7 7]]

对于上面的例子：让我知道你是否打算使用2D数组或3D数组。

>>> a 
[[[2] [2, 3] [3, 34]] 
[[3] [4, 5] [3, 34]] 
[[3] [2, 3] [3, 4]]] 


>>> mask_fewest(a,2,10) 
[[10 [2, 3] [3, 34]] 
[[3] 10 [3, 34]] 
[[3] [2, 3] 10]]

来源

2013-07-24 23:51:11 Daniel

非常感谢，但是当我写了[3,4]时，我的意思是一个有两个元素的数组，并且是的，我的数据集将会非常大 – user2616532

+1如果我有任何钱，我很快就会下注'np.count_unique'函数调用'np.unique'返回的索引中的'np.bincount'，并且'return_inverse = True'，这是一个我发现自己一遍又一遍地打字的结构。作为一个潜在的改进，我对你正在构建的二维数组有点困扰，并且为计算掩码而崩溃：这种欺骗通常非常严重。我发现对于大型数据集来说，速度要快得多，而对于真正的小数据集，速度要慢得多：'rep_mask = np.in1d（a，ua [count Jaime

@Jaime：感谢您的评论，我忘记了'np.in1d'。我一直在查找'np.intersect1d'，并知道我错过了一些东西。作为一个方面说明，我认为这将很难修改以实际回答OP的问题，因为他需要一个'object array'-它应该被删除吗？ – Daniel

计算numpy ndarray中元素的数量

回答

相关问题