2017-08-31 71 views
3

我将我的Tensorflow图像分类器的结果保存在SQL数据库中。我有3张桌子。图像,类别和一个表格将两个与权重变量连接起来。有些图片没有关系,有些图片有很多。删除重复行但保持多对多关系

问题是我在图像表中有需要删除的重复行。但是如果重复的图像有一个或多个,我需要保留多对多的关系。

下面是一个例子:

表名:my_images

+----+------------+-----------------+ 
| ID | image_path | image_filename | 
+----+------------+-----------------+ 
| 1 | Film 1  | Film 1 001.jpg | 
| 2 | Film 1  | Film 1 001.jpg | 
| 3 | Film 1  | Film 1 002.jpg | 
| 4 | Film 1  | Film 1 002.jpg | 
| 5 | Film 1  | Film 1 003.jpg | 
| 6 | Film 1  | Film 1 003.jpg | 
+----+------------+-----------------+ 

表名:my_terms

+---------+------------+ 
| term_id | term_name | 
+---------+------------+ 
|  1 | cat  | 
|  2 | dog  | 
|  3 | automobile | 
+---------+------------+ 

表名:my_term_relationships

+----------+---------+---------+ 
| image_id | term_id | weight | 
+----------+---------+---------+ 
|  2 |  1 | 0.58516 | 
|  2 |  3 | 0.16721 | 
|  3 |  2 | 0.21475 | 
+----------+---------+---------+ 

所以在这个例子中,理想的结果是从my_images删除第1,4行和第5或6行。

+0

,因为它是一个很长的时间,因为我已经做了真正的SQL查询我不会张贴解答。 我会先创建一个删除dups的查询,就像这里第二个最常见的答案:https://stackoverflow.com/questions/4685173/delete-all-duplicate-rows-except-for-one-in-mysql 然后,我会添加到您的my_term_relationships中选定的ID必须存在的子查询。 希望它有帮助 – Logar

+0

顺便说一下,是否有可能在'my_term_relationships'中的不同id下引用了相同的image_filename?如果是的话,那么我的上述命题将不起作用。在这种情况下,我建议先清理'my_terms_relationships'表,以便在此表中只有每个image_filename有一个image_id。然后我的上述评论将是相关的我认为 – Logar

回答

0

您需要查询两组图像ID,并使用它们进行过滤。假设image_pathimage_filename是UNIQUE一起:

  1. 所有my_images ID,即不通过my_term_relationships引用,但相应的image_path + image_filename可能被引用。
  2. 唯一ID,属于image_path + image_filename对,在my_term_relationships中根本没有被引用。

在此查询请看:

DELETE FROM my_images 
WHERE 
    ID NOT IN (SELECT DISTINCT image_id FROM my_term_relationships) -- 1 
    AND 
    ID NOT IN (SELECT id FROM (
    SELECT MIN(ID) as id 
    FROM my_images 
    LEFT JOIN my_term_relationships ON ID = image_id 
    GROUP BY image_path,image_filename 
    HAVING COUNT(image_id) = 0 
    ) as u_ids -- 2 
); 

注意,你必须包裹my_images表中DELETE的其中一个子查询子句。阅读此线程解释:Can't specify target table for update in FROM clause

举例:从my_term_relationships去除重复行sqlfiddle


示例更新查询:

UPDATE my_term_relationships 
SET image_id = (
    select min(my_images.ID) 
    from my_images 
    join my_images as ref_image on (my_images.image_path = ref_image.image_path and my_images.image_filename = ref_image.image_filename) 
    where ref_image.ID = image_id 
); 
+0

运行此查询后,我仍然有一些重复的image_path + image_filename对。也许我在my_term_relationships中有指向重复图像的行。有没有合并这些方法? –

+0

然后,在删除行之前,您需要在my_term_relationships上运行UPDATE。 –

+0

我更新了小提琴:http://sqlfiddle.com/#!9/5c4e3e/3 –

1

方法一步一步来。

首先,找到重复的条目:

SELECT 
image_path, image_filename 
FROM my_images 
GROUP BY image_path, image_filename 
HAVING COUNT(*) > 1 

其次,碰到一些重复的所有行:

SELECT mi.* 
FROM my_images mi 
JOIN (
    SELECT 
    image_path, image_filename 
    FROM my_images 
    GROUP BY image_path, image_filename 
    HAVING COUNT(*) > 1 
) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename 

最后,得到的ID不删除。

SELECT MIN(ID) 
FROM my_images mi 
JOIN (
    SELECT 
    image_path, image_filename 
    FROM my_images 
    GROUP BY image_path, image_filename 
    HAVING COUNT(*) > 1 
) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename 
LEFT JOIN my_term_relationships mtr ON mi.ID = mtr.image_id 
WHERE mtr.image_id IS NULL 
GROUP BY mi.image_path, mi.image_filename 
HAVING COUNT(*) > 0 

检查带电作业,如果一切是正确的。如果是,请将其转换为删除语句。

DELETE my_images.* FROM my_images 
JOIN (
SELECT MIN(ID) AS ID 
FROM my_images mi 
JOIN (
    SELECT 
    image_path, image_filename 
    FROM my_images 
    GROUP BY image_path, image_filename 
    HAVING COUNT(*) > 1 
) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename 
LEFT JOIN my_term_relationships mtr ON mi.ID = mtr.image_id 
WHERE mtr.image_id IS NULL 
GROUP BY mi.image_path, mi.image_filename 
HAVING COUNT(*) > 0 
) sq USING(ID); 

编辑:还修复洛加尔提到的问题,DELETE语句前使用此UPDATE语句。

UPDATE my_term_relationships mtr 
JOIN (
    SELECT mi.ID, minID 
    FROM my_images mi 
    JOIN (
     SELECT 
     image_path, image_filename, MIN(ID) AS minID 
     FROM my_images 
     GROUP BY image_path, image_filename 
     HAVING COUNT(*) > 1 
    ) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename 
) sq ON mtr.image_id = sq.ID 
SET mtr.image_id = sq.minID; 
+0

再次,我相信如果你在'my_term_relationships'中引用了相同文件名的两个id,你将保留在my_images'中,我会添加第一个查询来更新'my_term_relationhips'中的id:为了明白我的意思,在您的小提琴中,更改 VALUE (1,1,0.58516), (2,3,0.16721), (3,2,0.21475)的值。 – Logar

+0

很好的答案,谢谢。但是,当我尝试删除查询时,我收到错误“你不能指定目标表'my_images'在FROM子句中更新” –

+0

@DavidApple修正了错误。 – fancyPants