2017-09-25 43 views
0

假设我有以下的表格,检查字计数字符串和更少的计数删除的话 - 蜂巢

date_part    string_word       id 
2017-08-08  India America Advance Apartments   1 
2017-08-08  Apartments Planner Headlines    1 
2017-08-08  India America Headlines Gucci    1 
2017-08-08  Images Same Thing Africa     2 
2017-08-08  Images          2 
2017-08-07  India America Advance Apartments   2 
2017-08-07  Apartments Planner Headlines    3 
2017-08-07  India America Headlines Gucci    3 
2017-08-07  Images Same Thing Africa     3 
2017-08-07  Images          4 

现在我想找到字数每天和删除的话数量较少。为了找到字数,我写了下面的查询,

SELECT date_part, word, COUNT(*) as total_word_count 
FROM table_name LATERAL VIEW explode(split(string_word, ' ')) lTable as word 
where date_part > '2017-08-05' 
GROUP BY date_part, word 

这将给以下,

date_part  word  total_word_count 
2017-08-08  India   2 
2017-08-08  America   2 
2017-08-08  Advance   1 
2017-08-08  Apartments  2 
2017-08-08  Planner   1 
2017-08-08  Headlines  2 
2017-08-08  Gucci   1 
2017-08-08  Images   2 
2017-08-08  Same    1 
2017-08-08  Thing   1 
2017-08-08  Africa   1 
2017-08-07  India   2 
2017-08-07  America   2 
2017-08-07  Advance   1 
2017-08-07  Apartments  2 
2017-08-07  Planner   1 
2017-08-07  Headlines  2 
2017-08-07  Gucci   1 
2017-08-07  Images   2 
2017-08-07  Same    1 
2017-08-07  Thing   1 
2017-08-07  Africa   1 

现在我想用计数删除的话小于2,即用1字应该在每个日期删除计数。以下应该是输出,

date_part    string_word       id 
2017-08-08  India America Apartments     1 
2017-08-08  Apartments Headlines      1 
2017-08-08  India America Headlines     1 
2017-08-08  Images          2 
2017-08-08  Images          2 
2017-08-07  India America Apartments     2 
2017-08-07  Apartments Headlines      3 
2017-08-07  India America Headlines     3 
2017-08-07  Images          3 
2017-08-07  Images          4 

这里带有1计数的单词已被删除。这是我期望得到的输出,这也是每天都要做的。

有人可以帮我做这件事吗?

感谢

+0

加上'HAVING total_word_count> 1'到查询... –

+0

@usagi过滤是罚款。但是我想从原始表格中删除单词。只有一个以上的计数应该存在。剩下的话应该删除。这就是我正在看的问题 – haimen

回答

0
select  t.date_part 
      ,regexp_replace(t.string_word,concat('\\s?\\b(',e.words,')\\b'),'') as string_word 
      ,t.id 

from     table_name as t 

      join  (select  date_part 
            ,concat_ws('|',collect_list (col)) as words 

         from  (select  date_part 
               ,e.col 

            from  table_name t 
               lateral view explode(split(t.string_word,'\\s+')) e 

            group by date_part 
               ,e.col 

            having  count(*) = 1 
            ) e 

         group by date_part 
         ) e 

      on   e.date_part = 
         t.date_part 
; 

+-------------+---------------------------+-----+ 
| date_part |  string_word  | id | 
+-------------+---------------------------+-----+ 
| 2017-08-07 | India America Apartments | 2 | 
| 2017-08-07 | Apartments Headlines  | 3 | 
| 2017-08-07 | India America Headlines | 3 | 
| 2017-08-07 | Images     | 3 | 
| 2017-08-07 | Images     | 4 | 
| 2017-08-08 | India America Apartments | 1 | 
| 2017-08-08 | Apartments Headlines  | 1 | 
| 2017-08-08 | India America Headlines | 1 | 
| 2017-08-08 | Images     | 2 | 
| 2017-08-08 | Images     | 2 | 
+-------------+---------------------------+-----+