2017-08-24 38 views
1

我试图从厂商表和vendor_address表找到使用多个字段的数据库复制的供应商。事情是我做的内心连接越少查询失去潜在的结果。虽然我在供应商ID中没有重复,但我希望找到类似的潜在供应商。SQL查找与几个字段(没有唯一ID)复制解决

这是到目前为止我的查询:

SELECT 
    o.vendor_id 
    ,o.vndr_name_shrt_user 
    ,O.COUNTRY 
    ,O.VENDOR_NAME_SHORT 
    ,B.POSTAL 
    ,B.ADDRESS1 
    ,SAME_ADDRESS_NB 
    ,SAME_POSTAL_NB 
    ,OC.SAME_SHORT_NAME 
    ,oc.SAME_USER_NUM 
FROM VENDOR o 

JOIN vendor_addr B ON o.VENDOR_ID = B.VENDOR_ID 

INNER JOIN (
    SELECT vndr_name_shrt_user, COUNT(*) AS SAME_USER_NUM 
    FROM VENDOR 
    WHERE COUNTRY = 'CANADA' 
    AND VENDOR_STATUS = 'A' 
    GROUP BY vndr_name_shrt_user 
    HAVING COUNT(*) > 1 
) oc on o.vndr_name_shrt_user = oc.vndr_name_shrt_user 

INNER JOIN (SELECT VENDOR_NAME_SHORT, COUNT(*) AS SAME_SHORT_NAME 
    FROM VENDOR 
    WHERE COUNTRY = 'CANADA' 
    AND VENDOR_STATUS = 'A' 
    GROUP BY VENDOR_NAME_SHORT 
    HAVING COUNT(*) > 1 
) oc on o.VENDOR_NAME_SHORT = oc.VENDOR_NAME_SHORT 

INNER JOIN (SELECT POSTAL, COUNT(*) AS SAME_POSTAL_NB 
    FROM vendor_addr 
    WHERE COUNTRY = 'CANADA' 
    AND COUNTRY ='CANADA' 
    AND POSTAL != ' ' 
    GROUP BY POSTAL 
    HAVING COUNT(*) > 1 
) oc on b.POSTAL = oc.POSTAL 

INNER JOIN (SELECT ADDRESS1, COUNT(*) AS SAME_ADDRESS_NB 
    FROM ps_vendor_addr 
    WHERE COUNTRY = 'CANADA' 
    AND COUNTRY ='CANADA' 
    AND ADDRESS1 != ' ' 
    GROUP BY ADDRESS1 
    HAVING COUNT(*) > 1 
) oc on b.ADDRESS1 = oc.ADDRESS1 
WHERE O.COUNTRY ='CANADA' 
    AND B.COUNTY = 'CANADA'; 
+1

你为什么内侧连接?对不希望丢失数据的地方使用左外连接。 – kazzi

+1

请提供[MCVE]包括DDL语句的一些示例数据和这些数据的预期输出你的表和DML语句。 – MT0

+0

谢谢你,好艰难 – DangerKev

回答

0

看来,如果你的连接是有点有趣,比一个更多的理由。首先,你必须内部连接,这将消除所有,但那些具有重复的所有迹象 - 这是一些你不想要的。此外,你似乎有相同的别名,OC,所有派生表 - 这不是真的会飞到这里,你会不会走得很远这一点。

而是做这种方式的,我建议你把你的基本的查询重复每个重复标志 - 如下(我删除了same_address_nb和same_postal_nb领域,你就会明白为什么):

select 
    o.vendor_id 
    ,o.vndr_name_shrt_user 
    ,O.COUNTRY 
    ,O.VENDOR_NAME_SHORT 
    ,B.POSTAL 
    ,B.ADDRESS1 
    ,OC.SAME_SHORT_NAME 
    ,oc.SAME_USER_NUM 
from VENDOR o 
JOIN vendor_addr B ON o.VENDOR_ID = B.VENDOR_ID 
WHERE O.COUNTRY ='CANADA' 
AND B.COUNTY = 'CANADA' 
AND ... 

对于这些重复的迹象每一个,你会添加如下嵌套查询到上面所示的椭圆 - 示例所示使用副本中vndr_name_shrt_user:

select 
    o.vendor_id 
    ,o.vndr_name_shrt_user 
    ,O.COUNTRY 
    ,O.VENDOR_NAME_SHORT 
    ,B.POSTAL 
    ,B.ADDRESS1 
    ,OC.SAME_SHORT_NAME 
    ,oc.SAME_USER_NUM 
    ,'SAME_USER_NUM' as duplicateFlag 
from VENDOR o 
JOIN vendor_addr B ON o.VENDOR_ID = B.VENDOR_ID 
WHERE O.COUNTRY ='CANADA' 
AND B.COUNTY = 'CANADA' 
AND o.vndr_name_shrt_user in 
(
    SELECT 
     vndr_name_shrt_user 
    FROM VENDOR 
    WHERE COUNTRY = o.country 
    AND VENDOR_STATUS = 'A' 
    GROUP BY vndr_name_shrt_user 
    HAVING COUNT(*) > 1 
) 

您可以UNION ALL这些查询在一起,然后看所有的重复。

作为一个方面说明,你在最后三个派生表曾经为country = 'canada'检查两次。

UPDATE:显示一个以上的重复标志

select 
    o.vendor_id 
    ,o.vndr_name_shrt_user 
    ,O.COUNTRY 
    ,O.VENDOR_NAME_SHORT 
    ,B.POSTAL 
    ,B.ADDRESS1 
    ,OC.SAME_SHORT_NAME 
    ,oc.SAME_USER_NUM 
    ,'SAME_USER_NUM' as duplicateFlag 
from VENDOR o 
JOIN vendor_addr B ON o.VENDOR_ID = B.VENDOR_ID 
WHERE O.COUNTRY ='CANADA' 
AND B.COUNTY = 'CANADA' 
AND o.vndr_name_shrt_user in 
(
    SELECT 
     vndr_name_shrt_user 
    FROM VENDOR 
    WHERE COUNTRY = o.country 
    AND VENDOR_STATUS = 'A' 
    GROUP BY vndr_name_shrt_user 
    HAVING COUNT(*) > 1 
) 

UNION ALL 

select 
    o.vendor_id 
    ,o.vndr_name_shrt_user 
    ,O.COUNTRY 
    ,O.VENDOR_NAME_SHORT 
    ,B.POSTAL 
    ,B.ADDRESS1 
    ,OC.SAME_SHORT_NAME 
    ,oc.SAME_USER_NUM 
    ,'VENDOR_NAME_SHORT' as duplicateFlag 
from VENDOR o 
JOIN vendor_addr B ON o.VENDOR_ID = B.VENDOR_ID 
WHERE O.COUNTRY ='CANADA' 
AND B.COUNTY = 'CANADA' 
AND o.VENDOR_NAME_SHORT in 
(
    SELECT 
     VENDOR_NAME_SHORT 
    FROM VENDOR 
    WHERE COUNTRY = o.country 
    AND VENDOR_STATUS = 'A' 
    GROUP BY VENDOR_NAME_SHORT 
    HAVING COUNT(*) > 1 
) 
+0

由于只有一个复制的标志使查询完整的dupicated标志的不是它还是我创建“SAME_USER_NUM”作为duplicateFlag2? – DangerKev

+0

你会把不同的重复标志的最后一列 - 我将用一个例子 – Eli

+0

更新查询我应该删除 OC.SAME_SHORT_NAME, oc.SAME_USER_NUM 正如我在原来的查询创建它们+我得到太多结果错误 非常感谢顺便说一句 – DangerKev

0

让具有不同的属性链式复制了一些有趣的数据:

CREATE TABLE data (ID, A, B, C) AS 
    SELECT 1, 1, 1, 1 FROM DUAL UNION ALL -- Related to #2 on column A 
    SELECT 2, 1, 2, 2 FROM DUAL UNION ALL -- Related to #1 on column A, #3 on B & C, #5 on C 
    SELECT 3, 2, 2, 2 FROM DUAL UNION ALL -- Related to #2 on columns B & C, #5 on C 
    SELECT 4, 3, 3, 3 FROM DUAL UNION ALL -- Related to #5 on column A 
    SELECT 5, 3, 4, 2 FROM DUAL UNION ALL -- Related to #2 and #3 on column C, #4 on A 
    SELECT 6, 5, 5, 5 FROM DUAL;   -- Unrelated 

现在,我们可以使用分析功能得到一些关系(没有任何连接):

SELECT d.*, 
     LEAST(
     FIRST_VALUE(id) OVER (PARTITION BY a ORDER BY id), 
     FIRST_VALUE(id) OVER (PARTITION BY b ORDER BY id), 
     FIRST_VALUE(id) OVER (PARTITION BY c ORDER BY id) 
     ) AS duplicate_of 
FROM data d; 

其中给出:

ID A B C DUPLICATE_OF 
-- - - - ------------ 
1 1 1 1   1 
2 1 2 2   1 
3 2 2 2   2 
4 3 3 3   4 
5 3 4 2   2 
6 5 5 5   6 

但是,这并不拿起#4与#5这是关系到#2,然后到#1 ...

这可以用一个分层查询发现:

SELECT id, a, b, c, 
     CONNECT_BY_ROOT(id) AS duplicate_of 
FROM data 
CONNECT BY NOCYCLE (PRIOR a = a OR PRIOR b = b OR PRIOR c = c); 

但是,这将使许多,许多重复的行(因为它不知道从哪里开始的层次从这样会反过来为选择每行根) - 而不是你可以使用第一查询给予分层查询起点时IDDUPLICATE_OF值是相同的:

SELECT id, a, b, c, 
     CONNECT_BY_ROOT(id) AS duplicate_of 
FROM (
    SELECT d.*, 
     LEAST(
      FIRST_VALUE(id) OVER (PARTITION BY a ORDER BY id), 
      FIRST_VALUE(id) OVER (PARTITION BY b ORDER BY id), 
      FIRST_VALUE(id) OVER (PARTITION BY c ORDER BY id) 
     ) AS duplicate_of 
    FROM data d 
) 
START WITH id = duplicate_of 
CONNECT BY NOCYCLE (PRIOR a = a OR PRIOR b = b OR PRIOR c = c); 

其中给出:

ID A B C DUPLICATE_OF 
-- - - - ------------ 
1 1 1 1   1 
2 1 2 2   1 
3 2 2 2   1 
4 3 3 3   1 
5 3 4 2   1 
1 1 1 1   4 
2 1 2 2   4 
3 2 2 2   4 
4 3 3 3   4 
5 3 4 2   4 
6 5 5 5   6 

仍然有一些行,因为局部极小的时发生的#4的搜索...这可以用一个简单GROUP BY被删除的重复:

SELECT id, a, b, c, 
     MIN(duplicate_of) AS duplicate_of 
FROM (
    SELECT id, a, b, c, 
     CONNECT_BY_ROOT(id) AS duplicate_of 
    FROM (
    SELECT d.*, 
      LEAST(
      FIRST_VALUE(id) OVER (PARTITION BY a ORDER BY id), 
      FIRST_VALUE(id) OVER (PARTITION BY b ORDER BY id), 
      FIRST_VALUE(id) OVER (PARTITION BY c ORDER BY id) 
      ) AS duplicate_of 
    FROM data d 
) 
    START WITH id = duplicate_of 
    CONNECT BY NOCYCLE (PRIOR a = a OR PRIOR b = b OR PRIOR c = c) 
) 
GROUP BY id, a, b, c; 

这给输出:

ID A B C DUPLICATE_OF 
-- - - - ------------ 
1 1 1 1   1 
2 1 2 2   1 
3 2 2 2   1 
4 3 3 3   1 
5 3 4 2   1 
6 5 5 5   6 
+0

试图解决现在 非常感谢 – DangerKev

+0

过程花费了大量的时间 – DangerKev

+0

SELECT VENDOR_ID,VENDOR_NAME_SHORT,VNDR_NAME_SHRT_USR,NAME1, MIN(duplicate_of)AS duplicate_of FROM( SELECT VENDOR_ID,VENDOR_NAME_SHORT,VNDR_NAME_SHRT_USR,NAME1, CONNECT_BY_ROOT(VENDOR_ID )AS duplicate_of FROM(SELECT D. *, – DangerKev