LAG偏移

TL; DR：向下滚动到任务2LAG偏移

我处理以下数据集：

email,createdby,createdon 
[email protected],jsmith,2016-10-10 
[email protected],nsmythe,2016-09-09 
[email protected],vstark,2016-11-11 
[email protected],ajohnson,2015-02-03 
[email protected],elear,2015-01-01 
...

等。每封电子邮件都保证在数据集中至少有一个副本。

现在，有两个任务需要解决;我解决了其中一个，但我正在与另一个挣扎。现在我将介绍这两个任务的完整性。

TASK 1（解决）： 对于每一行，每封电子邮件，与与此电子邮件创建的第一个记录的用户名返回的附加列。

对于上述试样数据集合预期结果：

email,createdby,createdon,original_createdby 
[email protected],jsmith,2016-10-10,nsmythe 
[email protected],nsmythe,2016-09-09,nsmythe 
[email protected],vstark,2016-11-11,nsmythe 
[email protected],ajohnson,2015-02-03,elear 
[email protected],elear,2015-01-01,elear

代码以得到上面的：

;WITH q0 -- this is just a security measure in case there are unique emails in the data set 
      AS (SELECT t.email 
       FROM  t 
       GROUP BY t.email 
       HAVING COUNT(*) > 1) , 
     q1 
      AS (SELECT q0.email 
         , createdon 
         , createdby 
         , ROW_NUMBER() OVER (PARTITION BY q0.email ORDER BY createdon) rn 
       FROM  t 
       JOIN  q0 
         ON t.email = q0.email) 
    SELECT q1.email 
      , q1.createdon 
      , q1.createdby 
      , LAG(q1.createdby, q1.rn - 1) OVER (ORDER BY q1.email, q1.createdon) original_createdby 
    FROM q1 
    ORDER BY q1.email 
      , q1.rn

简要说明：我分区数据通过电子邮件设置，那么我在每个分区数目的行按创建日期排序，最后我从（rn-1）记录返回[createdby]值。完全按照预期工作。

现在，类似上面有任务2：

任务2： 对于每一行，每封电子邮件，返回创建的第一个重复的用户名。即其中rn = 2的用户名称。

预期结果：

email,createdby,createdon,first_dupl_createdby 
[email protected],jsmith,2016-10-10,jsmith 
[email protected],nsmythe,2016-09-09,jsmith 
[email protected],vstark,2016-11-11,jsmith 
[email protected],ajohnson,2015-02-03,ajohnson 
[email protected],elear,2015-01-01,ajohnson

我希望保持高性能，从而试图采用超前滞后功能：

WITH q0 
      AS (SELECT t.email 
       FROM  t 
       GROUP BY t.email 
       HAVING COUNT(*) > 1) , 
     q1 
      AS (SELECT q0.email 
         , createdon 
         , createdby 
         , ROW_NUMBER() OVER (PARTITION BY q0.email ORDER BY createdon) rn 
       FROM  t 
       JOIN  q0 
         ON t.email = q0.email) 
    SELECT q1.email 
      , q1.createdon 
      , q1.createdby 
      , q1.rn 
      , CASE q1.rn 
       WHEN 1 THEN LEAD(q1.createdby, 1) OVER (ORDER BY q1.email, q1.createdon) 
       ELSE LAG(q1.createdby, q1.rn - 2) OVER (ORDER BY q1.email, q1.createdon) 
      END AS first_dupl_createdby 
    FROM q1 
    ORDER BY q1.email 
      , q1.rn

说明：在每个分区的第一个记录，返回[createdby]来自以下记录（即来自包含第一个副本的记录）。对于同一分区中的所有其他记录，从[rn-2]记录前返回[createdby]（即对于rn = 2，我们保留在同一记录上，对于rn = 3，我们将返回1记录，对于rn = 4 - 2记录等）。

一个问题出现在

ELSE LAG(q1.createdby, q1.rn - 2)

操作。显然，对任何逻辑，尽管前面的行的存在（当1 THEN ...）时，ELSE块也评价RN = 1，导致传递给LAG功能的负的偏移值：

消息8730，等级16，状态2，行37 滞后和导联函数的偏移参数不能为负值。

当我注释到ELSE行时，整个事情都很好，但显然我没有在first_dupl_createdby列中得到任何结果。

问题：是否有任何方式重写上述CASE语句（在任务＃2中），以便它始终从每个分区中的rn = 2的记录返回值，但这是重要的位 - 没有进行自我JOIN操作（我知道我可以在单独的子查询中准备rn = 2的行，但是这意味着整个表上会有额外的扫描，并且还会运行不必要的自动JOIN）。

来源

2016-11-16 Piotr L

编辑你的问题，包括*结果*您希望得到您的样本数据。 –

这可能听起来很愚蠢，但是如果在'q1'中使用'ROW_NUMBER（）... + 2'作为'rn'呢？在你的'case'表达式中，你可以使用'CASE q1.rn当3 then ...... ELSE LAG（q1.createdby，q1.rn）' – Lamak

我想你可以简单地使用max窗口函数，因为你试图从rownumber = 2获取每个分区的值。

SELECT q1.email 
      , q1.createdon 
      , q1.createdby 
      , q1.rn 
      , max(case when rn=2 then q1.createdby end) over(partition by q1.email) first_dup_created_by 
FROM q1 
ORDER BY q1.email, q1.rn

您也可以使用类似的查询来获得第一场景的rownumber = 1的结果。

来源

2016-11-16 13:25:04

当你非常专注于特定的语言功能时这种情况：LAG/LEAD）你忘记了简单的事情。这是最明显的答案，我现在感到羞愧。谢谢。 –

你可以使用row_number()和条件聚合的各个电子邮件的信息：

select email, 
     max(case when seqnum = 1 then createdby end) as createdby_first, 
     max(case when seqnum = 2 then createdby end) as createdby_second 
from (select t.*, 
      row_number() over (partition by email order by createdon) as seqnum 
     from t 
    ) t 
group by email;

您可以join这一信息返回到原始数据，以获得您想要的信息。我不明白lag()自然会被用来解决这个问题。

来源

2016-11-16 13:24:49

/耸肩

; WITH duplicate_email_addresses AS (
    SELECT email 
    FROM t 
    GROUP 
     BY email 
    HAVING Count(*) > 1 
) 
, records_with_duplicate_email_addresses AS (
    SELECT email 
     , createdon 
     , createdby 
     , Row_Number() OVER (PARTITION BY email ORDER BY createdon) AS sequencer 
    FROM t 
    WHERE EXISTS (
      SELECT * 
      FROM duplicate_email_addresses 
      WHERE email = t.email 
     ) 
) 
, second_duplicate_record AS (-- Why do you need any more than this? 
    SELECT email 
     , createdon 
     , createdby 
    FROM records_with_duplicate_email_addresses 
    WHERE sequencer = 2 
) 
SELECT records_with_duplicate_email_addresses.email 
    , records_with_duplicate_email_addresses.createdon 
    , records_with_duplicate_email_addresses.createdby 
    , second_duplicate_record.createdby AS first_duplicate_createdby 
FROM records_with_duplicate_email_addresses 
INNER 
    JOIN second_duplicate_record 
    ON second_duplicate_record.email = records_with_duplicate_email_addresses.email 
;

来源

2016-11-16 13:30:43 gvee

这正是我试图避免（自加入），但感谢您全面的SQL格式/命名课程。 –

回答

相关问题