Postgresql - 按排列数匹配

我试图获得列出标签列表的最佳匹配项列表。与下面的数据：Postgresql - 按排列数匹配

DROP TABLE IF EXISTS testing_items; 
CREATE TEMP TABLE testing_items(
    id bigserial primary key, 
    tags text[] 
); 
CREATE INDEX ON testing_items using gin (tags); 

INSERT INTO testing_items (tags) VALUES ('{123,456, abc}'); 
INSERT INTO testing_items (tags) VALUES ('{222,333}'); 
INSERT INTO testing_items (tags) VALUES ('{222,555}'); 
INSERT INTO testing_items (tags) VALUES ('{222,123}'); 
INSERT INTO testing_items (tags) VALUES ('{222,123,555,666}');

我有标签222,555 and 666。我怎样才能得到这样的列表？

ps：必须使用GIN索引，因为会有大量记录。

id   matches 
--   ------- 
5   3 
3   2 
2   1 
4   1

编辑编号1不应该出现在列表中，因为不匹配任何标签

1   0

来源

2017-02-28 Eduardo

检查在这里：http://rextester.com/UTGO74511

如果您使用的是GIN指数，使用& &：

select * 
from testing_items 
where not (ARRAY['333','555','666'] && tags); 


id | tags 
--- ------------- 
1 123456abc 
4 222123

个

来源

2017-02-28 23:18:57 McNets

这只返回匹配所有标签的项目 – Eduardo

对不起，我误解了这个问题。 – McNets

看一看：http://stackoverflow.com/a/24330181/3270427 – McNets

UNNEST标签，过滤器嵌套的元素和聚集其余的：

select id, count(distinct u) as matches 
from (
    select id, u 
    from testing_items, 
    lateral unnest(tags) u 
    where u in ('222', '555', '666') 
    ) s 
group by 1 
order by 2 desc 

id | matches 
----+--------- 
    5 |  3 
    3 |  2 
    2 |  1 
    4 |  1 
(4 rows)

考虑所有问题的答案，似乎这个查询结合他们每个人好的一面：

select id, count(*) 
from testing_items, 
unnest(array['11','5','8']) u 
where tags @> array[u] 
group by id 
order by 2 desc, 1;

它在Eduardo的测试中表现最好。

来源

2017-02-28 23:21:17 klin

我相信IN子句不会使用GIN索引，因为缺少@>运算符 – Eduardo

对，要使用索引你必须反向比较像在paqash的版本（不错的尝试）。 – klin

下面是使用UNNEST和数组我的两分钱中包含：

select id, count(*) 
from (
    select unnest(array['222','555','666']) as tag, * 
    from testing_items 
) as w 
where tags @> array[tag] 
group by id 
order by 2 desc

结果：

+------+---------+ | id | count | |------+---------| | 5 | 3 | | 3 | 2 | | 2 | 1 | | 4 | 1 | +------+---------+

来源

2017-02-28 23:33:03 paqash

这是我的测试有10万条记录与3个标签每间0的随机数和100：

BEGIN; 
LOCK TABLE testing_items IN EXCLUSIVE MODE; 
INSERT INTO testing_items (tags) SELECT (ARRAY[trunc(random() * 99 + 1), trunc(random() * 99 + 1), trunc(random() * 99 + 1)]) FROM generate_series(1, 10000000) s; 
COMMIT;

我已经把ORDER BY c DESC, id LIMIT 5不等待大回应。

@paqash和@klin解决方案具有相似的性能。我的笔记本电脑运行它们与标签11,8和5

12秒但这运行4.6秒：

SELECT id, count(*) as c 
FROM (
SELECT id FROM testing_items WHERE tags @> '{11}' 
UNION ALL 
SELECT id FROM testing_items WHERE tags @> '{8}' 
UNION ALL 
SELECT id FROM testing_items WHERE tags @> '{5}' 
) as items 
GROUP BY id 
ORDER BY c DESC, id 
LIMIT 5

但我仍然认为这是一个更快的方法。

来源

2017-03-01 02:26:14 Eduardo

我认为你找到了正确的方法。 Unnest代价高昂，而右侧常数的简单查询充分利用GIN指数。 – klin

@>不要使用GIN索引 – McNets

解释说是 – Eduardo

Postgresql - 按排列数匹配

回答

相关问题