2017-03-06 58 views
1

我有一个包含小时级数据的表。我想查找小时数和数组中所有小时的值。 输入表阵列中的Hive列数据

+-----+-----+-----+ 
| hour| col1| col2| 
+-----+-----+-----+ 
| 00 | 0.0 | a | 
| 04 | 0.1 | b | 
| 08 | 0.2 | c | 
| 12 | 0.0 | d | 
+-----+-----+-----+ 

如以下溶液建议我使用函数来获取上述列值的数组

SELECT COUNT(小时),map_values(str_to_map(CONCAT_WS( '',collect_set( concat_ws(':',reflect('java.util.UUID','randomUUID'),cast(col1 as string))))))as col1_arr, map_values(str_to_map(concat_ws(',',collect_set(concat_ws ':',reflect('java.util.UUID','randomUUID'),cast(col12 as string))))))as col2_arr from table;

我得到的输出,col2_arr中的值与col1_arr的顺序不同。请建议如何以相同的顺序获取不同列的数组/列表中的值。

+----------+-----------------+----------+ 
| count(hr)| col1_arr  | col2_arr | 
+----------+-----------------+----------+ 
| 4  | 0.0,0.1,0.2,0.0 | b,a,c,d | 
+----------+----------------+-----------+ 

需要的输出:

+----------+-----------------+----------+ 
| count(hr)| col1_arr  | col2_arr | 
+----------+-----------------+----------+ 
| 4  | 0.0,0.1,0.2,0.0 | a,b,c,d | 
+----------+----------------+-----------+ 

感谢

回答

0
with t as 
     ( 
      select inline 
        (
         array 
         (
          struct('00',0.0) 
          ,struct('04',0.1) 
          ,struct('08',0.2) 
          ,struct('12',0.0) 
         ) 
        ) as (hour,col1) 
     ) 

select count(*),collect_list(col1),max(col1) 
from t 
; 

+-----+-------------------+-----+ 
| _c0 |  _c1  | _c2 | 
+-----+-------------------+-----+ 
| 4 | [0.0,0.1,0.2,0.0] | 0.2 | 
+-----+-------------------+-----+ 

如果要保证阵列,使用中的元素的顺序 -

sort_array(collect_list(col1)) 

如果你想消除数组中的元素的副本,使用 -

collect_set(col1) 

保持重复的值,而不collect_list

with t as 
     ( 
      select inline 
        (
         array 
         (
          struct('00',0.0) 
          ,struct('04',0.0) 
          ,struct('08',0.1) 
          ,struct('12',0.1) 
         ) 
        ) as (hour,col1) 
     ) 

select map_values(str_to_map(concat_ws(',',collect_set(concat_ws(':',reflect('java.util.UUID','randomUUID'),cast(col1 as string)))))) 
from t 
; 

["0.0","0.0","0.1","0.1"] 
+0

感谢您的回应! 我想保留重复值,但collect_list在配置单元0.10中不可用。 任何其他选项将重复值保存在蜂巢0.10列表中? 我已经尝试collect_set这是消除我的重复值。 –

+0

我有一个想法,但它将不得不等待明天(注意自我:反射+ concat + collect_set + concat_ws + str_to_map + map_values) –

+0

查看更新的答案,并记住 - 你已经要求它:-) –