2010-07-21 65 views
0

我有这样的父子关系分组父母

Paragraph 
--------- 
ParagraphID PK 
// other attributes ... 


Sentence 
-------- 
SentenceID PK 
ParagraphID FK -> Paragraph.ParagraphID 
Text   nvarchar(4000) 
Offset  int 
Score  int 
// other attributes ... 

我想发现是等价的段落;那是包含相同的一组句子的段落。如果两个句子具有相同的文本,偏移量和分数,则认为它们相同 - SentenceID/ParagraphID不是比较的一部分,如果两个段落包含相同的句子集合,则两个段落是等同的。

有人可以告诉我查询相等段落的查询是什么样子吗?

编辑:有约。 150K段落,和1.5M句子。输出应包含ParagraphID,以及与此相同的最低段ID。例如。如果1款和1款相等,则输出将

ParagraphID EquivParagraphID 
1   1 
2   1 

回答

1

总之,你需要为每个段落的签名,然后比较签名。你没有提到输出本身的性质。在这里,我的“M返回逗号分隔ParagraphId值行针对每一相同段落签名

With ParagraphSigs As 
    (
    Select P.ParagraphId 
     , Hashbytes('SHA1' 
       , (
        Select '|' + S1.Text 
         '|' + Cast(S1.Offset As varchar(10)) 
         '|' + Cast(S1.Score As varchar(10)) 
        From Sentence As S1 
        Where S1.ParagraphId = P.ParagraphId 
        Order By S1.SentenceId 
        For Xml Path('') 
        )) As Signature 
    From Paragraph As P 
    ) 
Select Stuff(
      (
      Select ', ' + Cast(PS1.ParagraphId As varchar(10)) 
      From ParagraphSigs As PS1 
      Where PS1.Signature = PS.Signature 
      For Xml Path('') 
      ), 1, 2, '') As Paragraph 
From ParagraphSigs As PS 
Group By PS.Signature 

鉴于您对所需的输出此外,还可以更改查询,像这样:

With ParagraphSigs As 
    (
    Select P.ParagraphId 
     , Hashbytes('SHA1' 
       , (
        Select '|' + S1.Text 
         '|' + Cast(S1.Offset As varchar(10)) 
         '|' + Cast(S1.Score As varchar(10)) 
        From Sentence As S1 
        Where S1.ParagraphId = P.ParagraphId 
        Order By S1.SentenceId 
        For Xml Path('') 
        )) As Signature 
    From Paragraph As P 
    ) 
Select P1.ParagraphId, P2.ParagraphId As EquivParagraphId 
From ParagraphSigs As P1 
    Left Join ParagraphSigs As P2 
     On P2.Signature = P1.Signature 
      And P2.ParagraphId <> P1.ParagraphId 

显然,可能有三个或四个段落共享相同的签名,所以要注意上述结果会给你一个匹配段落的笛卡尔积(例如(P1,P2),(P1,P3),(P2, P1),(P2,P3),(P3,P1),(P3,P2))。

最后在句子上有效地搜索。既然你有另外两个参数,你可以通过先在两个int列比较减少做生成的签名数量:

With ParagraphsNeedingSigs As 
    (
    Select P1.ParagraphId 
    From Paragraph As P1 
    Where Exists (
        Select 1 
        From Paragraph As P2 
        Where P2.ParagraphId <> P1.ParagraphId 
         And P2.Offset = P1.Offet 
         And P2.Score = P1.Score 
        ) 
    ) 
    , ParagraphSigs As 
    (
    Select P.ParagraphId 
     , Hashbytes('SHA1' 
       , (
        Select '|' + S1.Text 
         '|' + Cast(S1.Offset As varchar(10)) 
         '|' + Cast(S1.Score As varchar(10)) 
        From Sentence As S1 
        Where S1.ParagraphId = P.ParagraphId 
        Order By S1.SentenceId 
        For Xml Path('') 
        )) As Signature 
    From ParagraphsNeedingSigs As P 
    ) 
Select P.ParagraphId, P2.ParagraphId As EquivParagraphId 
From Paragraph As P 
    Left Join ParagraphSigs As P1 
     On P1.ParagraphId = P.ParagraphId 
    Left Join ParagraphSigs As P2 
     On P2.Signature = P1.Signature 
      And P2.ParagraphId <> P1.ParagraphId 
+0

谢谢你。我之前没有看到哈希函数,所以这让人大开眼界。即使是错误的匹配,它在统计上是不重要的,有没有一种方法使用键作为提示来平等段落,但仍然使用句子比较来完全确定?另请参阅有关所需输出的更新。 – mdma 2010-07-21 22:41:35

+0

@mdma - 我已更新我的帖子。简而言之,您可以根据两个int列筛选出没有潜在匹配的段落,然后为可能匹配的段落计算散列值。 – Thomas 2010-07-21 23:05:35

+0

我的最终查询与此非常相似。我也试着将它与INTERSECT结合起来,以消除散列冲突,并确保这些句子组合真的相等,但开销从几秒钟到几分钟(我从来没有让它完成)。相反,我将哈希值更改为参考实际数据以便它是唯一的。 – mdma 2010-07-22 17:46:08

1

你既然SQL 2008年上市(我不知道这句法可用在2005年),您可能可以使用EXCEPT或INTERSECT比较。它涉及相关的子查询,因此性能可能是一个问题。

SELECT 
    * 
FROM 
    Paragraph P 
WHERE 
    (SELECT COUNT(*) FROM 
(
    SELECT 
     S1.[Text], 
     S1.Offset, 
     S1.Score 
    FROM 
     Paragraph P1 
    INNER JOIN Sentence S1 ON 
     S1.ParagraphID = P1.ParagraphID 
    WHERE 
     P1.ParagraphID = P.ParagraphID 
    INTERSECT 
    SELECT 
     S2.[Text], 
     S2.Offset, 
     S2.Score 
    FROM 
     Paragraph P2 
    INNER JOIN Sentence S2 ON 
     S2.ParagraphID = P2.ParagraphID 
    WHERE 
     P2.ParagraphID > P.ParagraphID 
) SQ 
) = (SELECT COUNT(*) FROM Sentence P3 WHERE P3.ParagraphID = P.ParagraphID) 
+0

谢谢你。我试着运行查询,但5分钟后停止。有没有一种方法来改善表演?我已经添加了表格大小和期望的输出到问题。 – mdma 2010-07-21 22:44:15