2017-02-23 368 views
1

在三个表格test_3,test_2test_1之间存在连接。GROUP BY使用CLOB数据

test_1test_3是主表,并且没有公共列。有加入表test_2test_1sr_idlast_updated_date
test_2sr_idsm_idtest_3sm_idsql_statementtest_3有clob数据导致所有的麻烦。

我必须找到与sm_id关联的最新sr_id。我的想法是使用一个聚合函数max(last_updated_date)并按它进行分组。 而且它没有发生,原因很多。

  1. 它包含的CLOB数据列是sql_statement。

  2. 我已经使用了一个我不熟悉的连接。

任何想法都会有所帮助。

WITH xx as (
    (select ANSWER ,sr_id AS ID from test 
    WHERE Q_ID in (SELECT Q_ID FROM test_2 WHERE field_id='LM_LRE_Q6') 
    ) 
) 
-- end of source data 


SELECT t.ID, t1.n, t1.SM_ID,seg_dtls.SEGMENTation_NAME ,to_char(mst.LAST_UPDATED_DATE,'dd-mon-yyyy hh24:mi:ss'),seg_dtls.sql_statement 
FROM xx t 
CROSS JOIN LATERAL (
     select LEVEL AS n, regexp_substr(t.answer, '\d+', 1, level) as SM_ID 
     from dual 
     connect by regexp_substr(t.answer, '\d+', 1, level) IS NOT NULL 
) t1 
left join test_1 mst 
on mst.sr_id=t.id 
right join test_3 seg_dtls 
on seg_dtls.sm_id=t1.sm_id; 

样本数据会看起来像

sr_id sm_id SEGMENTATION_NAME LAST_UPDATED_DATE 
1108197 958 test_not_in   05-feb-2017 23:56:59  
1108217 958 test_not_in   14-feb-2017 00:37:39 
1108218 958 test_not_in   14-feb-2017 01:39:50 
1108220 958 test_not_in   14-feb-2017 03:39:07 

和预期输出是

1108220 958 test_not_in   14-feb-2017 03:39:07 

我不张贴CLOB数据,因为它是巨大的。每行都包含CLOB数据。

table test_3 contains 
q_id  sr_id answer 
1009330 1108246 976~feb_24^941~Test_regionwithcountry 
1009330 1108247 941~Test_regionwithcountry_2016^787~Test_Request_28^976~feb_24 
1009330 1108239 972~test_emea 
1009330 1108240 972~test_emea^827~test_with_region_country 
1009330 1108251 981~MSE100579729 testing. 

和样本数据看起来像上述的test_3
回答包含SM_ID。我必须从这里拉它。
例如:

941~Test_regionwithcountry_2016^787~Test_Request_28^976~feb_24 
the sm_id is 941,787,976 

所以我已经拿出了上面提到的上面的查询。
再次来到左右连接,所有来自test_3的sm_id都是必需的,所以我在这里使用了正确的连接。

edit1:接受的答案给出了带有max(last_updated_date)的SEGMENTS的SR_ID。
我需要所有的SR_ID。所以,我使用MINUS运算符来获取那些不是最大值的(last_updated_date)。
我需要将该结果集附加到接受的答案。

这就是我所做的其他SR_ID。

select sr_id,segmentation_name,request_status from (with test_31 (q_id, sr_id, answer) as (
(SELECT Q_ID,SR_ID,ANSWER FROM test_3 WHERE Q_ID=(SELECT Q_ID FROM test_4 WHERE FIELD_ID='LM_LRE_Q6')) 
), 
answer_extraction as (
    select q_id, sr_id, 
    regexp_substr(regexp_substr(answer, '[^^]+', 1, level),'\d+') as sm_id 
    from test_31 
    connect by q_id = prior q_id 
    and sr_id = prior sr_id 
    and prior dbms_random.value is not null 
    and regexp_substr(answer, '[^^]+', 1, level) is not null 
) 
select sr_id, 
    sm_id, 
    segmentation_name, 
    LAST_UPDATED_DATE, 
    sql_statement,request_status 
from (
    select t1.sr_id, 
    t2.sm_id, 
    t2.segmentation_name, 
    t1.last_updated_date, 
    t2.sql_statement, 
    t1.request_status 

    from test_4 t4 
    join answer_extraction t3 on t3.q_id = t4.q_id 
    join test_2 t2 on t2.sm_id = t3.sm_id 
    join test1 t1 on t1.sr_id = t3.sr_id 
) 
) 
minus 

(select sr_id,segmentation_name , request_status from (with test_31 (q_id, sr_id, answer) as (
(SELECT Q_ID,SR_ID,ANSWER FROM test_3 WHERE Q_ID=(SELECT Q_ID FROM test_4 WHERE FIELD_ID='LM_LRE_Q6')) 
), 
answer_extraction as (
    select q_id, sr_id, 
    regexp_substr(regexp_substr(answer, '[^^]+', 1, level), '\d+') as sm_id 
    from test_31 
    connect by q_id = prior q_id 
    and sr_id = prior sr_id 
    and prior dbms_random.value is not null 
    and regexp_substr(answer, '[^^]+', 1, level) is not null 
) 
select sr_id, 
    segmentation_name, 
    sql_statement, 
    request_status 
from (
    select t1.sr_id, 
    t2.sm_id, 
    t2.segmentation_name, 
    t1.last_updated_date, 
    t2.sql_statement, 
    t1.request_status, 
    max(t1.last_updated_date) over (partition by t2.sm_id) as max_updated_date 
    from test_4 t4 
    join answer_extraction t3 on t3.q_id = t4.q_id 
    join test_2 t2 on t2.sm_id = t3.sm_id 
    join test_1 t1 on t1.sr_id = t3.sr_id 
) 
where last_updated_date = max_updated_date)); 

}

样本数据:
接受的答案给出以下输出与该段的最大(时间:LAST_UPDATED_DATELAST_UPDATED_TIME)。

1097661 Submitted o2k lad 30-NOV-15 01-DEC-16 62 CLOB DATA 

上面发布了查询GIVES下面的输出,它是带有其他更新日期的段的sr_id。

1097621 o2k lad Submitted 
    1097625 o2k lad Submitted 
    1097627 o2k lad Submitted 
    1097632 o2k lad Submitted 
    1097633 o2k lad Submitted 
    1097658 o2k lad Pending 
    1097640 o2k lad Submitted 
    1097644 o2k lad Submitted 
    1097646 o2k lad Submitted 

预期输出:

sr_id status  segment_name updated_date sql_statement other_sr_id 
1097661 Submitted o2k lad  30-NOV-15  CLOB DATA 1097618,1097621,1097625,1097627,1097632,1097633,1097658,1097640,1097644,1097646 

将二者结合起来的查询,以便最后一列包含所有旧sr_id。

+0

请邮寄样本输入数据和预期的输出。这对所有用户都有帮助。 – Tajinder

+1

你最初的计划使用'max(last_updated_date)'似乎比你的问题中的代码更有前途。也许你应该重新开始。 –

+0

我知道,但我需要的所有列,甚至包含一个CLOB – user3165555

回答

0

一个相当简单的选择是修改当前的查询来添加查找每个ID的最大日期解析函数,就像这样:

..., max(mst.last_updated_date) over (partition by id) as max_updated_date 

的总体思路的快速演示:

with cte (id, last_updated_date, sql_statement) as (
    select 1, date '2017-01-01', to_clob('stmt 1') from dual 
    union all select 1, date '2017-01-02', to_clob('stmt 2') from dual 
    union all select 1, date '2017-01-03', to_clob('stmt 3') from dual 
    union all select 2, date '2017-01-02', to_clob('stmt 4') from dual 
) 
select id, last_updated_date, sql_statement 
from (
    select id, last_updated_date, sql_statement, 
    max(last_updated_date) over (partition by id) as max_updated_date 
    from cte 
) 
where last_updated_date = max_updated_date; 

     ID LAST_UPDAT SQL_STATEMENT                 
---------- ---------- -------------------------------------------------------------------------------- 
     1 2017-01-03 stmt 3                   
     2 2017-01-02 stmt 4                   

您可以使用row_number()或rank()或dense_rank()来确定具有最早日期和过滤条件的行,但总体思路是相同的。

但是,您当前的查询不是很清楚(或在12c之前有效)以开始。与其试图猜测如何包含这样一个函数和过滤器,从基表重新开始可能会更简单,尽管这会对你正在做的事情做出很多假设,并且可能会忽略一些事情 - 如左和右连接 - 可能或可能不需要。

通过CTE的制作了一些数据:

with test_1 (sr_id, last_updated_date) as (
    select 1108197, timestamp '2017-02-05 23:56:59' from dual 
    union all select 1108217, timestamp '2017-02-14 00:37:39' from dual 
    union all select 1108218, timestamp '2017-02-14 01:39:50' from dual 
    union all select 1108220, timestamp '2017-02-14 03:39:07' from dual 
), 
test_2 (sm_id, segmentation_name, sql_statement) as (
    select 958, 'test_not_in', to_clob('select * from dual') from dual 
), 
test_3 (q_id, sr_id, answer) as (
    select 41, 1108197, 958 from dual 
    union all select 42, 1108217, 958 from dual 
    union all select 43, 1108218, 958 from dual 
    union all select 44, 1108220, 958 from dual 
), 
test_4 (q_id, field_id) as (
    select 41, 'LM_LRE_Q6' from dual 
    union all select 42, 'LM_LRE_Q6' from dual 
    union all select 43, 'LM_LRE_Q6' from dual 
    union all select 44, 'LM_LRE_Q6' from dual 
) 

那么这可以让你在问题中表现出相同的输出:

select t1.sr_id, 
    t2.sm_id, 
    t2.segmentation_name, 
    to_char(t1.last_updated_date, 'dd-mon-yyyy hh24:mi:ss') as last_updated_date, 
    t2.sql_statement 
from test_4 t4 
join test_3 t3 on t3.q_id = t4.q_id 
join test_2 t2 on t2.sm_id = t3.answer 
join test_1 t1 on t1.sr_id = t3.sr_id; 

    SR_ID SM_ID SEGMENTATIO LAST_UPDATED_DATE    SQL_STATEMENT                 
---------- ----- ----------- ----------------------------- -------------------------------------------------------------------------------- 
    1108197 958 test_not_in 05-feb-2017 23:56:59   select * from dual                
    1108217 958 test_not_in 14-feb-2017 00:37:39   select * from dual                
    1108218 958 test_not_in 14-feb-2017 01:39:50   select * from dual                
    1108220 958 test_not_in 14-feb-2017 03:39:07   select * from dual                

在野外假设接近正确,你会发现每行最近的日期为sm_id,如下所示:

您需要调整它来处理任何其他不明确的限制或要求(例如,包括您的左/右外连接)。

我故意忽略了将'答案'分成多个值的子查询。有可能你有一些可怕的东西,比如里面的分隔ID列表,这是一个数据模型问题。如果是这种情况,那么你仍然需要提取个人的价值;是这样的:

with answer_extraction as (
    select q_id, sr_id, regexp_substr(answer, '\d+', 1, level) as sm_id 
    from test_3 
    connect by q_id = prior q_id 
    and sr_id = prior sr_id 
    and prior dbms_random.value is not null 
    and regexp_substr(answer, '\d+', 1, level) is not null 
) 
select sr_id, 
    sm_id, 
    segmentation_name, 
    to_char(last_updated_date, 'dd-mon-yyyy hh24:mi:ss') as last_updated_date, 
    sql_statement 
from (
    select t1.sr_id, 
    t2.sm_id, 
    t2.segmentation_name, 
    t1.last_updated_date, 
    t2.sql_statement, 
    max(t1.last_updated_date) over (partition by t2.sm_id) as max_updated_date 
    from test_4 t4 
    join answer_extraction t3 on t3.q_id = t4.q_id 
    join test_2 t2 on t2.sm_id = t3.sm_id 
    join test_1 t1 on t1.sr_id = t3.sr_id 
) 
where last_updated_date = max_updated_date; 

基于对您添加test3实际内容,正则表达式是不是做得不错,你所需要的。您使用的模式会找到14个数字值,即任何数字:

with test_3 (q_id, sr_id, answer) as (
    select 1009330, 1108246, '976~feb_24^941~Test_regionwithcountry' from dual 
    union all select 1009330, 1108247, '941~Test_regionwithcountry_2016^787~Test_Request_28^976~feb_24' from dual 
    union all select 1009330, 1108239, '972~test_emea' from dual 
    union all select 1009330, 1108240, '972~test_emea^827~test_with_region_country' from dual 
    union all select 1009330, 1108251, '981~MSE100579729 testing.' from dual 
), 
answer_extraction as (
    select q_id, sr_id, regexp_substr(answer, '\d+', 1, level) as sm_id 
    from test_3 
    connect by q_id = prior q_id 
    and sr_id = prior sr_id 
    and prior dbms_random.value is not null 
    and regexp_substr(answer, '\d+', 1, level) is not null 
) 
select * from answer_extraction; 

     Q_ID  SR_ID SM_ID  
---------- ---------- ---------- 
    1009330 1108239 972  
    1009330 1108240 972  
    1009330 1108240 827  
    1009330 1108246 976  
    1009330 1108246 24   
    1009330 1108246 941  
    1009330 1108247 941  
    1009330 1108247 2016  
    1009330 1108247 787  
    1009330 1108247 28   
    1009330 1108247 976  
    1009330 1108247 24   
    1009330 1108251 981  
    1009330 1108251 100579729 

看来你只想要^分隔符和〜标记之间的位。拆分分隔字符串的常见方法是:

with test_3 (q_id, sr_id, answer) as (
    select 1009330, 1108246, '976~feb_24^941~Test_regionwithcountry' from dual 
    union all select 1009330, 1108247, '941~Test_regionwithcountry_2016^787~Test_Request_28^976~feb_24' from dual 
    union all select 1009330, 1108239, '972~test_emea' from dual 
    union all select 1009330, 1108240, '972~test_emea^827~test_with_region_country' from dual 
    union all select 1009330, 1108251, '981~MSE100579729 testing.' from dual 
), 
answer_extraction as (
    select q_id, sr_id, regexp_substr(answer, '[^^]+', 1, level) as sm_id 
    from test_3 
    connect by q_id = prior q_id 
    and sr_id = prior sr_id 
    and prior dbms_random.value is not null 
    and regexp_substr(answer, '[^^]+', 1, level) is not null 
) 
select * from answer_extraction; 

     Q_ID  SR_ID SM_ID         
---------- ---------- ---------------------------------------- 
    1009330 1108239 972~test_emea       
    1009330 1108240 972~test_emea       
    1009330 1108240 827~test_with_region_country    
    1009330 1108246 976~feb_24        
    1009330 1108246 941~Test_regionwithcountry    
    1009330 1108247 941~Test_regionwithcountry_2016   
    1009330 1108247 787~Test_Request_28      
    1009330 1108247 976~feb_24        
    1009330 1108251 981~MSE100579729 testing.    

但你需要得到的是,如第一部分借用原来的模式(其他的有效!):

column sm_id format a10 
with test_3 (q_id, sr_id, answer) as (
    select 1009330, 1108246, '976~feb_24^941~Test_regionwithcountry' from dual 
    union all select 1009330, 1108247, '941~Test_regionwithcountry_2016^787~Test_Request_28^976~feb_24' from dual 
    union all select 1009330, 1108239, '972~test_emea' from dual 
    union all select 1009330, 1108240, '972~test_emea^827~test_with_region_country' from dual 
    union all select 1009330, 1108251, '981~MSE100579729 testing.' from dual 
), 
answer_extraction as (
    select q_id, sr_id, 
    regexp_substr(regexp_substr(answer, '[^^]+', 1, level), '\d+') as sm_id 
    from test_3 
    connect by q_id = prior q_id 
    and sr_id = prior sr_id 
    and prior dbms_random.value is not null 
    and regexp_substr(answer, '[^^]+', 1, level) is not null 
) 
select * from answer_extraction; 

     Q_ID  SR_ID SM_ID  
---------- ---------- ---------- 
    1009330 1108239 972  
    1009330 1108240 972  
    1009330 1108240 827  
    1009330 1108246 976  
    1009330 1108246 941  
    1009330 1108247 941  
    1009330 1108247 787  
    1009330 1108247 976  
    1009330 1108251 981  

注意额外regexp_substr()仅在选择列表中,的connect by语句;并且提取sm_id仍然是一个字符串。如果test_2.sm_id是一个数字,则在该选择列表中的该对子字符串周围添加一个to_number()调用。

+0

谢谢亚历克斯。你所有的假设都是现实的。大部分问题出在表test_3上。我正在编辑这个问题以获得更多的理解。 – user3165555

+0

@ user3165555 - 你的答案值比我想象的更糟糕。我已经添加了一些关于如何提取你实际感兴趣的数字,我介于^和〜之间的数字。你可以使用修改后的'answer_extraction' CTE和我原来的代码的其余部分。 –

+0

谢谢Alex,我在这个过程中学到了很多东西。 – user3165555