SQL：组装非重叠集合

我有一组连续整数，由type组织，编号为table1。所有值在1和10之间，包括在内。SQL：组装非重叠集合

table1: 
row_id set_id type min_value max_value 
1  1  a  1   3 
2  2  a  4   10 
3  3  a  6   10 
4  4  a  2   5 
5  5  b  1   9 
6  6  c  1   7 
7  7  c  3   10 
8  8  d  1   2 
9  9  d  3   3 
10  10  d  4   5 
11  11  d  7   10

在table2，每个type内，我想组装所有可能最大，非重叠集合（尽管这不能由任何set S上的正确type的填充间隙是没关系）。期望的输出：

table2: 
row_id type group_id set_id 
1  a  1   1 
2  a  1   2 
3  a  2   1 
4  a  2   3 
5  a  3   3 
6  a  3   4 
7  b  4   5 
8  c  5   6 
9  c  6   7 
10  d  7   8 
11  d  7   9 
12  d  7   10 
13  d  7   11

我目前的想法是使用的事实，有一个可能的数量有限的事实。步骤：

查找table1含价值1所有集合。将它们复制到table2。
找到table1中的所有组，包含值2，而不是在table2中。
加入从套（2）与上table1type，set_id，以及具有min_value大于group的最大max_value。
对于未加入（3）的（2）的集合，将它们插入到table2中。这些将开始新的group，稍后可能会延长。
对value s 3到10重复步骤（2）到（4）。

我认为这会工作，但它有很多的痛苦中最对接的步骤，尤其是对（2） - 寻找套不table2，以及（4） - 找到设置没有加入。

你知道更快，更有效的方法吗？我的实际数据有数百万个set s，成千上万个type s和数百个value（尽管幸运的是，如示例中那样，value是有界的），所以可伸缩性是必不可少的。

我正在使用PLSQL Developer与Oracle 10g（不像我之前所说 - 谢谢IT部门）。谢谢！

来源

2014-11-14 esa606

你的例子只有有两个元素的结果。可以有两个以上吗？ – 2014-11-14 19:32:36

是的！这只是为了简单。可能是从'1'到'10'的每个“值”都是一个单独的集合，因此是一个单独的元素。我会更新示例数据以显示此内容。 – esa606 2014-11-14 19:35:18

我认为你的过程很好，除了第1步和第2步。你可以用table1中的每一组初始化表。 – 2014-11-14 19:47:25

对于Oracle 10g，您不能使用递归CTE，但通过一些工作，您可以使用类似于connect by语法的方法。首先，你需要生成具有所有非重叠的链接，你可以做一个CTE或在线查看：

select t1.type, t1.set_id, t1.min_value, t1.max_value, 
    t2.set_id as next_set_id, t2.min_value as next_min_value, 
    t2.max_value as next_max_value, 
    row_number() over (order by t1.type, t1.set_id, t2.set_id) as group_id 
from table1 t1 
left join table1 t2 on t2.type = t1.type 
and t2.min_value > t1.max_value 
where not exists (
    select 1 
    from table1 t4 
    where t4.type = t1.type 
    and t4.min_value > t1.max_value 
    and t4.max_value < t2.min_value 
) 
order by t1.type, group_id, t1.set_id, t2.set_id;

这花了一个实验位，它当然有可能我已经错过或丢失有关过程中的规则的一些事情;但是，让你12的伪行，在我以前的答案，这允许开始a/1两个独立链同时限制d值的单链应遵循：

TYPE SET_ID MIN_VALUE MAX_VALUE NEXT_SET_ID NEXT_MIN_VALUE NEXT_MAX_VALUE GROUP_ID 
---- ------ ---------- ---------- ----------- -------------- -------------- -------- 
a   1   1   3   2    4    10  1 
a   1   1   3   3    6    10  2 
a   2   4   10             3 
a   3   6   10             4 
a   4   2   5   3    6    10  5 
b   5   1   9             6 
c   6   1   7             7 
c   7   3   10             8 
d   8   1   2   9    3    3  9 
d   9   3   3   10    4    5  10 
d  10   4   5   11    7    10  11 
d  11   7   10             12

这可以作为CTE;查询该用连接-通过循环：

with t as (
    ... -- same as above query 
) 
select t1.type, 
    dense_rank() over (partition by null 
    order by connect_by_root group_id) as group_id, 
    t1.set_id 
from t t1 
connect by type = prior type 
and set_id = prior next_set_id 
start with not exists (
    select 1 from table1 t2 
    where t2.type = t1.type 
    and t2.max_value < t1.min_value 
) 
and not exists (
    select 1 from t t3 
    where t3.type = t1.type 
    and t3.next_max_value < t1.next_min_value 
) 
order by t1.type, group_id, t1.min_value;

的dense_rank()使组ID相邻的;不确定你是否真的需要这些，或者他们的顺序很重要，所以它是真正的可选项。 connect_by_root给出链的起始组ID，因此虽然初始查询中有12行和12 group_id值，但它们并不全部出现在最终结果中。

连接通过两个prior值，类型和在初始查询中找到的下一个集ID。这创造了所有的连锁店，但拥有自己的连锁店也将包括更短的连锁店 - 对于d，你会看到8,9,10,11，而且还有9,10,11和10,11，您不希望它们作为单独的组。这些被start with条件消除，这可能会被简化。

这给：

TYPE GROUP_ID SET_ID 
---- -------- ------ 
a   1  1 
a   1  2 
a   2  1 
a   2  3 
a   3  4 
a   3  3 
b   4  5 
c   5  6 
c   6  7 
d   7  8 
d   7  9 
d   7  10 
d   7  11

SQL Fiddle demo。

来源

2014-11-18 10:28:05

辉煌，优雅，容易缩放到我的大型数据集。非常感谢！ – esa606 2014-11-18 17:58:16

如果您可以识别所有组及其起始set_id，那么您可以使用递归方法并在单个语句中执行此操作，而不是需要迭代填充表。但是，为了提高速度/效率和资源消耗，您需要对这两种方法进行基准测试 - 无论是针对您的数据量还是在系统的可用资源范围内进行扩展，都需要进行验证。

如果我的理解，当你决定开始您就可以使用查询一次识别所有这些新组：

with t as (
    select t1.type, t1.set_id, t1.min_value, t1.max_value, 
    t2.set_id as next_set_id, t2.min_value as next_min_value, 
    t2.max_value as next_max_value 
    from table1 t1 
    left join table1 t2 on t2.type = t1.type and t2.min_value > t1.max_value 
    where not exists (
    select 1 
    from table1 t3 
    where t3.type = t1.type 
    and t3.max_value < t1.min_value 
) 
) 
select t.type, t.set_id, t.min_value, t.max_value, 
    t.next_set_id, t.next_min_value, t.next_max_value, 
    row_number() over (order by t.type, t.min_value, t.next_min_value) as grp_id 
from t 
where not exists (
    select 1 from t t2 
    where t2.type = t.type 
    and t2.next_max_value < t.next_min_value 
) 
order by grp_id;

棘手位这里获得所有三组a，特别是二组以set_id = 1开头，但d只有一个组。内部select（在CTE中）通过not exists子句查找没有较低非重叠范围的集合，并且外连接到同一个表以获取不重叠的下一个集合，为您提供了两个以set_id = 1开头的组，还有四个以set_id = 9开头。然后，外部选择将忽略除第二个子句以外的最低非重叠项 - 但不必再次打到真正的表。

所以，让你：

TYPE SET_ID MIN_VALUE MAX_VALUE NEXT_SET_ID NEXT_MIN_VALUE NEXT_MAX_VALUE GRP_ID 
---- ------ ---------- ---------- ----------- -------------- -------------- ------ 
a   1   1   3   2    4    10  1 
a   1   1   3   3    6    10  2 
a   4   2   5   3    6    10  3 
b   5   1   9            4 
c   6   1   7            5 
c   7   3   10            6 
d   8   1   2   9    3    3  7

然后，您可以使用它作为锚定构件在recursive subquery factoring clause：

with t as (
    ... 
), 
r (type, set_id, min_value, max_value, 
    next_set_id, next_min_value, next_max_value, grp_id) as (
    select t.type, t.set_id, t.min_value, t.max_value, 
    t.next_set_id, t.next_min_value, t.next_max_value, 
    row_number() over (order by t.type, t.min_value, t.next_min_value) 
    from t 
    where not exists (
    select 1 from t t2 
    where t2.type = t.type 
    and t2.next_max_value < t.next_min_value 
) 
    ...

如果你离开了r CTE与和刚刚做你” d得到相同的七个小组。

递归部件然后使用从该查询以各组的下一个成员的下set_id其范围，并且重复外连接/不存在查找找到下一个（多个）集合再次;停车时没有下一组不重叠：

... 
    union all 
    select r.type, r.next_set_id, r.next_min_value, r.next_max_value, 
    t.set_id, t.min_value, t.max_value, r.grp_id 
    from r 
    left join table1 t 
    on t.type = r.type 
    and t.min_value > r.next_max_value 
    and not exists (
    select 1 from table1 t2 
    where t2.type = r.type 
    and t2.min_value > r.next_max_value 
    and t2.max_value < t.min_value 
) 
    where r.next_set_id is not null -- to stop looking when you reach a leaf node 
) 
...

最后您有一个基于递归CTE得到你想要的列和指定的顺序查询：

... 
select r.type, r.grp_id, r.set_id 
from r 
order by r.type, r.grp_id, r.min_value;

它得到：

TYPE  GRP_ID  SET_ID 
---- ---------- ---------- 
a    1   1 
a    1   2 
a    2   1 
a    2   3 
a    3   4 
a    3   3 
b    4   5 
c    5   6 
c    6   7 
d    7   8 
d    7   9 
d    7   10 
d    7   11

SQL Fiddle demo。

如果您想要显示每组的最小/最大值，并且可以跟踪并显示每组的最小/最大值。我只是显示了问题的列，但是。

来源

2014-11-17 15:14:49

这看起来非常棒！不幸的是，经过几个小时的努力，我意识到我的IT在版本上是错误的 - 真的是10g。有什么办法可以将它转换为10g，因为递归子查询因子是11gR2的新增功能？ – esa606 2014-11-17 21:17:41

SQL：组装非重叠集合

回答

相关问题