2015-11-20 58 views
1

我正在尝试在我的twitter数据中找到前10位提及(@xxxxx)。我创建了初始表twitter.full_text_ts并将其加载到我的数据中。查找数据中出现的前10位

create table twitter.full_text_ts as 
select id, cast(concat(substr(ts,1,10), ' ', substr(ts,12,8)) as timestamp) as ts, lat, lon, tweet 
from full_text; 

香港专业教育学院已经能够使用该查询(模式)

select id, ts, regexp_extract(lower(tweet), '(.*)@user_(\\S{8})([:| ])(.*)',2) as patterns 
from twitter.full_text_ts 
order by patterns desc 
limit 50; 

执行该提取提到在微博给了我

USER_a3ed4b5a 2010-03-07 03:46:23 fffed220 
USER_dc8cfa6f 2010-03-05 18:28:39 fffdabf9 
USER_dc8cfa6f 2010-03-05 18:32:55 fffdabf9 
USER_915e3f8c 2010-03-07 03:39:09 fffdabf9 
and so on... 

你可以看到fffed220等是提取模式。

现在我想要做的是计数每个这些提及(模式)发生的次数并输出前10位。例如,fffdabf9发生20次,fffxxxx发生17次等等。

+0

蜂房不会让我用用计regex_extract()模式? – Cale

回答

0

最可读的方式做,这将是你的第一个查询保存到一个临时表,然后做临时表GROUPBY:

create table tmp as 
--your query 

select patterns, count(*) n_mentions 
from tmp 
group by patterns 
order by count(*) desc 
limit 10; 
+0

它不会让我按模式分组造成它不是一列 – Cale

+0

它通过查询具有选定模式的临时表来工作,从(tt)tnt1 >限制10; – Cale

0
with mentions as 
(select id, ts, 
regexp_extract(lower(tweet), '(.*)@user_(\\S{8})([:| ])(.*)',2) as patterns 
from twitter.full_text_ts 
order by patterns desc 
limit 50) 
select patterns, count(*) 
from mentions 
group by patterns 
order by count(*) desc 
limit 10;