总之,你需要为每个段落的签名,然后比较签名。你没有提到输出本身的性质。在这里,我的“M返回逗号分隔ParagraphId值行针对每一相同段落签名
With ParagraphSigs As
(
Select P.ParagraphId
, Hashbytes('SHA1'
, (
Select '|' + S1.Text
'|' + Cast(S1.Offset As varchar(10))
'|' + Cast(S1.Score As varchar(10))
From Sentence As S1
Where S1.ParagraphId = P.ParagraphId
Order By S1.SentenceId
For Xml Path('')
)) As Signature
From Paragraph As P
)
Select Stuff(
(
Select ', ' + Cast(PS1.ParagraphId As varchar(10))
From ParagraphSigs As PS1
Where PS1.Signature = PS.Signature
For Xml Path('')
), 1, 2, '') As Paragraph
From ParagraphSigs As PS
Group By PS.Signature
鉴于您对所需的输出此外,还可以更改查询,像这样:
With ParagraphSigs As
(
Select P.ParagraphId
, Hashbytes('SHA1'
, (
Select '|' + S1.Text
'|' + Cast(S1.Offset As varchar(10))
'|' + Cast(S1.Score As varchar(10))
From Sentence As S1
Where S1.ParagraphId = P.ParagraphId
Order By S1.SentenceId
For Xml Path('')
)) As Signature
From Paragraph As P
)
Select P1.ParagraphId, P2.ParagraphId As EquivParagraphId
From ParagraphSigs As P1
Left Join ParagraphSigs As P2
On P2.Signature = P1.Signature
And P2.ParagraphId <> P1.ParagraphId
显然,可能有三个或四个段落共享相同的签名,所以要注意上述结果会给你一个匹配段落的笛卡尔积(例如(P1,P2),(P1,P3),(P2, P1),(P2,P3),(P3,P1),(P3,P2))。
最后在句子上有效地搜索。既然你有另外两个参数,你可以通过先在两个int列比较减少做生成的签名数量:
With ParagraphsNeedingSigs As
(
Select P1.ParagraphId
From Paragraph As P1
Where Exists (
Select 1
From Paragraph As P2
Where P2.ParagraphId <> P1.ParagraphId
And P2.Offset = P1.Offet
And P2.Score = P1.Score
)
)
, ParagraphSigs As
(
Select P.ParagraphId
, Hashbytes('SHA1'
, (
Select '|' + S1.Text
'|' + Cast(S1.Offset As varchar(10))
'|' + Cast(S1.Score As varchar(10))
From Sentence As S1
Where S1.ParagraphId = P.ParagraphId
Order By S1.SentenceId
For Xml Path('')
)) As Signature
From ParagraphsNeedingSigs As P
)
Select P.ParagraphId, P2.ParagraphId As EquivParagraphId
From Paragraph As P
Left Join ParagraphSigs As P1
On P1.ParagraphId = P.ParagraphId
Left Join ParagraphSigs As P2
On P2.Signature = P1.Signature
And P2.ParagraphId <> P1.ParagraphId
谢谢你。我之前没有看到哈希函数,所以这让人大开眼界。即使是错误的匹配,它在统计上是不重要的,有没有一种方法使用键作为提示来平等段落,但仍然使用句子比较来完全确定?另请参阅有关所需输出的更新。 – mdma 2010-07-21 22:41:35
@mdma - 我已更新我的帖子。简而言之,您可以根据两个int列筛选出没有潜在匹配的段落,然后为可能匹配的段落计算散列值。 – Thomas 2010-07-21 23:05:35
我的最终查询与此非常相似。我也试着将它与INTERSECT结合起来,以消除散列冲突,并确保这些句子组合真的相等,但开销从几秒钟到几分钟(我从来没有让它完成)。相反,我将哈希值更改为参考实际数据以便它是唯一的。 – mdma 2010-07-22 17:46:08