2012-07-31 81 views
0

请考虑下列表格。加入4个表格中的数据以计算几个加权分数

users拥有数以万计的Twitter用户;他们的tweets索引sp100_id,这是公司的id(请参阅sp100)鸣叫正在谈论。 tweets.class为每条推文保留指定的情绪类(1 =中性,2 =正数,3 =负数)。 tweets.rt保存推文已被转推的次数。最后,每个用户被赋予一个quality分数和follow评分,如下:

users      tweets 
------------------------- ----------------------------------------------- 
user_id quality follow  tweet_id sp100_id nyse_date user_id class rt 
------------------------- ----------------------------------------------- 
1  2.50 5.00  1  1  2011-03-12 1  1  0 
2  0.75 1.00  2  1  2011-03-13 1  2  2 
          3  1  2011-03-13 1  2  1 
daterange     4  1  2011-03-13 2  2  0 
----------------   5  1  2011-03-13 2  3  3 
_date      6  2  2011-03-12 2  2  3 
----------------   7  2  2011-03-12 2  2  0 
2011-03-11     8  2  2011-03-12 1  3  5 
2011-03-12     9  2  2011-03-13 2  2  0 
2011-03-13 

sp100 
---------------- 
sp100_id _name 
---------------- 
1   Alcoa 
2   Apple 

所需的输出是每sp100_id列表每_date的每加权阳性(class=2)和负极(class=3)鸣叫的量rt,“质量”和follow

sp100_id nyse_date pos-rt pos-quality pos-follow neg-rt neg-quality neg-follow 
-------------------------------------------------------------------------------- 
1   2011-03-11 0  0   0   0  0   0 
1   2011-03-12 0  0   0   0  0   0 
1   2011-03-13 5 (1) 5.75 (2) 11.00 (3) 3 (4) 0.75 (5) 1.00 (6) 
2   2011-03-11 0  0   0   0  0   0 
2   2011-03-12 3 (7) 5.00 (8) 10.00 (9) 5.00 2.50  2.50 
2   2011-03-13 0  0.75  1.00  0  0   0 
-------------------------------------------------------------------------------- 

(1) On 2011-03-13, 3 positive tweets for sp100_id 1. 1 tweet retweeted 2 times, 
    1 tweets retweeted 1 time and 1 tweet retweeted 0 times = 2x2+1x1+1x0 = 5 
(2) On 2011-03-13, 2 positive tweets made by user 1, who has quality 2.50 and 
    1 positive tweet made by user 2, who has quality 0.75 = 2x2.50+1x0.75 = 5.75 
(3) On 2011-03-13, 2 positive tweets made by user 1, who has follow 5.00 and 
    1 positive tweet made by user 2, who has follow 1 = 2x5.00+1x1.00 = 11.00 
(4) On 2011-03-13, 1 negative tweet made by user 2, retweeted 3 times = 1x3 = 3 
(5) On 2011-03-13, 1 negative tweet made by user 2, who has quality 0.75, thus 
    1x0.75 = 0.75 
(6) On 2011-03-13, 1 negative tweets made by user 2, who has follow 1.00 so 
    1x1.00 = 1.00 
(7) 1 positive tweet which has been retweeted 3 times, 1 positive tweet without 
    any retweets = 1x3+1x0 = 3 
(8) 2 positive tweets from user 2 x quality 2.50 = 5.00 
(9) 2 positive tweets x follow 5 = 10.00 

我试图解释自己尽可能好。谁可以帮助我构建正确的查询?正如你所看到的,还有没有推文(所有值为零)的日期,都需要包含在结果集中。我现在有这一点,但我有麻烦整理休息:通过正确的语法来代替

SELECT 
    s.sp100_id, 
    d._date, 
    COALESCE(c.pos-rt,0)  AS pos-rt, 
    COALESCE(c.pos-quality,0) AS pos-quality, 
    COALESCE(c.pos-follow,0) AS pos-follow, 
    COALESCE(c.neg-rt,0)  AS neg-rt, 
    COALESCE(c.neg-quality,0) AS neg-quality, 
    COALESCE(c.neg-follow,0) AS neg-follow 
FROM sp100 s 
CROSS JOIN daterange d 
LEFT JOIN (
    SELECT 
     sp100_id, 
     nyse_date, 
     COUNT(CASE class WHEN 2 THEN 1 END) * [rt]  AS pos-rt, 
     COUNT(CASE class WHEN 2 THEN 1 END) * [quality] AS pos-quality, 
     COUNT(CASE class WHEN 2 THEN 1 END) * [follow] AS pos-follow, 
     COUNT(CASE class WHEN 3 THEN 1 END) * [rt]  AS neg-rt, 
     COUNT(CASE class WHEN 3 THEN 1 END) * [quality] AS neg-quality, 
     COUNT(CASE class WHEN 3 THEN 1 END) * [follow] AS neg-follow 
    FROM tweets 
    GROUP BY sp100_id, nyse_date 
) c ON s.sp100_id = c.sp100_id AND d._date = c.nyse_date 
ORDER BY s.sp100_id, d._date ASC 

显然,[rt][quality][follow]需要,我不知道的COUNT(...)要么,因为它现在第一计数推文的数量,但它应该把每一条推文分开,并乘以它自己的转推数('rt')。

有人可以帮我吗?

+1

有一些问题了解你的表脚注(1):第一鸣叫转推了两次;为什么它对'pos-rt' 2 * 2而不是1 * 2的贡献,而另外两个推文(retweted一次和零次)分别贡献1 * 1和1 * 0? – eggyal 2012-07-31 17:30:07

+1

在脚注(8)中,我认为相关用户拥有'user_id = 2'且质量= 0.75,因此'pos-rt'应该是'1.5'?同样,对于脚注(9)'follow = 1.00',因此'pos-follow'应该是'2.00'? – eggyal 2012-07-31 17:45:44

+0

你在这两个帐户都是正确的:-) – Pr0no 2012-07-31 20:09:34

回答

2

假设我理解正确的问题(见上面我的意见),那么你只需要组连接表和SUM()相关领域,其中微博是可以使用IF()确定所需的类:

SELECT  sp100.sp100_id       AS `sp100_id`, 
      daterange._date       AS `nyse_date`, 
      SUM(IF(tweets.class=2, tweets.rt,  0)) AS `pos-rt`, 
      SUM(IF(tweets.class=2, users.quality, 0)) AS `pos-quality`, 
      SUM(IF(tweets.class=2, users.follow, 0)) AS `pos-follow`, 
      SUM(IF(tweets.class=3, tweets.rt,  0)) AS `neg-rt`, 
      SUM(IF(tweets.class=3, users.quality, 0)) AS `neg-quality`, 
      SUM(IF(tweets.class=3, users.follow, 0)) AS `neg-follow`  
FROM  sp100 
     JOIN daterange 
    LEFT JOIN tweets ON tweets.nyse_date = daterange._date 
        AND tweets.sp100_id = sp100.sp100_id 
    LEFT JOIN users ON tweets.user_id = users.user_id 
GROUP BY sp100.sp100_id, daterange._date 

请参阅sqlfiddle

[编辑]这里是EXPLAIN

id select_type table  type possible_keys    key  key_len ref      rows extra 
----------------------------------------------------------------------------------------------------------------------------------------------------------- 
1 SIMPLE  sp100  index NULL      PRIMARY 4  NULL      101 Using index; Using temporary; Using filesort 
1 SIMPLE  daterange index NULL      _date  3  NULL      147 Using index; Using join buffer 
1 SIMPLE  tweets ref query,nyse_date,sp100_id nyse_date 3  sentimeter.daterange._date 3815  
1 SIMPLE  users  eq_ref PRIMARY     PRIMARY 4  sentimeter.tweets.user_id  1  
+0

谢谢,这是辉煌的:-)虽然我的笔记本电脑(查询运行)超时,即使索引在所有领域。也许我必须首先在'tweets'中添加额外的列并填写质量等。然后,我不必通过加入'users'表来计算。 – Pr0no 2012-07-31 20:43:12

+1

@ Pr0no:看看'EXPLAIN'输出来查看MySQL的查询执行计划。有可能你的索引需要调整(简单地在每一列上建立索引都不会太多,因为MySQL只能使用一个索引进行查询:你最好建立合适的组合索引,但组成列的顺序将很重要) 。 – eggyal 2012-07-31 20:47:07

+0

请参阅http://sqlfiddle.com/#!2/c13fc我在'tweets'表中添加了必要的列,使'users'表过时了。你能否更新查询,以便我可以尝试并运行它?我也会为原始查询发布'EXPLAIN'。 – Pr0no 2012-07-31 21:02:24