分组时选择最高计数的分类变数，

我有如下表：分组时选择最高计数的分类变数，

我需要的是通过客户ID中，我得到了最常见的类别这样的方式聚集（最有效的方法猫），第二频率和第三频率。上述输出应该

most freq 2nd most freq 3rd most freq 
1  B    A    C 
2  A    C    Null 
3  B    C    Null 
4  C    A    Null

当在计数领带我真的不关心什么是第一，什么是第二。例如，对于客户1而言，第二大多数频率和第三大频率可以互换，因为它们中的每一个仅出现一次。

任何sql都会很好，最好是hive sql。

谢谢

来源

2017-10-16 criticalth

尝试使用group by两次，dense_rank()排序accorting到cat计数。其实我不是100％肯定的，但我想它也应该在蜂巢中工作。

select custId, 
    max(case when t.rn = 1 then cat end) as [most freq], 
    max(case when t.rn = 2 then cat end) as [2nd most freq], 
    max(case when t.rn = 3 then cat end) as [3th most freq] 
from 
(
    select custId, cat, dense_rank() over (partition by custId order by count(*) desc) rn 
    from your_table 
    group by custId, cat 
) t 
group by custId

demo

据我稍微加修改的方案的意见与蜂巢SQL

select custId, 
    max(case when t.rn = 1 then cat else null end) as most_freq, 
    max(case when t.rn = 2 then cat else null end) as 2nd_most_freq, 
    max(case when t.rn = 3 then cat else null end) as 3th_most_freq 
from 
(
    select custId, cat, dense_rank() over (partition by custId order by ct desc) rn 
    from (
    select custId, cat, count(*) ct 
    from your_table 
    group by custId, cat 
) your_table_with_counts 
) t 
group by custId

Hive SQL demo

来源

2017-10-16 13:30:15

使用'dense_rank'取代'row_number'符合，这样的关系唐如果它们存在，则不会以第2和第3最常见的值出现。 –

@VamsiPrabhala是的，谢谢 –

也删除'[]'为列别名，因为它们在Hive中不受支持。 –

分组时选择最高计数的分类变数，

回答

相关问题