2012-05-14 56 views
4

我在SQL Server 2008表中有一列,其中部分字符串被意外重复。查找并删除重复的子串

有没有人有一个快速和简单的方法来删除尾部重复的子字符串?

例如,

alpha\bravo\charlie\delta\charlie\delta 

应该

alpha\bravo\charlie\delta 
+0

我假设你还想消除多个模糊,例如, source ='alpha \ bravo \ alpha \ bravo \ alpha'变成'alpha \ bravo'? –

+0

@AaronBertrand:是的。 –

+0

你是否也想查找重复项,并且是整个字符串还是一个更大的子字符串? –

回答

7

如果您还没有一个数字表:

SET NOCOUNT ON; 
DECLARE @UpperLimit INT; 
SET @UpperLimit = 4000; 

WITH n(rn) AS 
(
    SELECT ROW_NUMBER() OVER (ORDER BY [object_id]) 
    FROM sys.all_columns 
) 
SELECT [Number] = rn - 1 
INTO dbo.Numbers FROM n 
WHERE rn <= @UpperLimit + 1; 

CREATE UNIQUE CLUSTERED INDEX n ON dbo.Numbers([Number]); 

现在通用的分割功能,将转将您的分隔字符串合并为一组:

CREATE FUNCTION dbo.SplitString 
(
    @List NVARCHAR(MAX), 
    @Delim CHAR(1) 
) 
RETURNS TABLE 
AS 
    RETURN (SELECT 
     rn, 
     vn = ROW_NUMBER() OVER (PARTITION BY [Value] ORDER BY rn), 
     [Value] 
     FROM 
     ( 
     SELECT 
      rn = ROW_NUMBER() OVER (ORDER BY CHARINDEX(@Delim, @List + @Delim)), 
      [Value] = LTRIM(RTRIM(SUBSTRING(@List, [Number], 
      CHARINDEX(@Delim, @List + @Delim, [Number]) - [Number]))) 
     FROM dbo.Numbers 
     WHERE Number <= LEN(@List) 
     AND SUBSTRING(@Delim + @List, [Number], 1) = @Delim 
    ) AS x 
    ); 
GO 

然后是把他们重新走到一起的功能:

CREATE FUNCTION dbo.DedupeString 
(
    @List NVARCHAR(MAX) 
) 
RETURNS NVARCHAR(MAX) 
AS 
BEGIN 
    RETURN (SELECT newval = STUFF((
    SELECT '\' + x.[Value] FROM dbo.SplitString(@List, '\') AS x 
     WHERE (x.vn = 1) 
     ORDER BY x.rn 
     FOR XML PATH, TYPE).value('.', 'nvarchar(max)'), 1, 1, '') 
    ); 
END 
GO 

使用范例:

SELECT dbo.DedupeString('alpha\bravo\bravo\charlie\delta\bravo\charlie\delta'); 

结果:

alpha\bravo\charlie\delta 

你也可以这样说:

UPDATE dbo.MessedUpTable 
    SET OopsColumn = dbo.DedupeString(OopsColumn); 

@MikaelEriksson可能会用更有效的方式使用XML来消除重复,但这是我可以在那之前提供的。 :-)

+0

这工作就像一个魅力。谢谢! –

+0

糟糕。没有我添加了常规的“走字符串”:)。我会给XML一些想法。 –

+0

我也添加了一个XML版本。要保留单词的顺序有点棘手。但我不认为使用XML来分割字符串会跳过数字表格。 –

4
create function RemoveDups(@S nvarchar(max)) returns nvarchar(max) 
as 
begin 
    declare @R nvarchar(max) 
    declare @W nvarchar(max) 
    set @R = '' 

    while len(@S) > 1 
    begin 
    -- Get the first word 
    set @W = left(@S, charindex('/', @S+'/')-1) 

    -- Add word to result if not already added 
    if '/'[email protected] not like '%/'[email protected]+'/%' 
    begin 
     set @R = @R + @W + '/' 
    end 

    -- Remove first word 
    set @S = stuff(@S, 1, charindex('/', @S+'/'), '') 
    end 

    return left(@R, len(@R)- 1) 
end 

按照Aaron Bertrand的要求。然而,我将不会就最快执行什么提出任何要求。

-- Table to replace in 
declare @T table 
(
    ID int identity, 
    Value nvarchar(max) 
) 

-- Add some sample data 
insert into @T values ('alpha/beta/alpha/gamma/delta/gamma/delta/alpha') 
insert into @T values ('delta/beta/alpha/beta/alpha/gamma/delta/gamma/delta/alpha') 

-- Update the column 
update T 
set Value = NewValue 
from (
     select T1.ID, 
       Value, 
       stuff((select '/' + T4.Value 
        from (
          select T3.X.value('.', 'nvarchar(max)') as Value, 
            row_number() over(order by T3.X) as rn 
          from T2.X.nodes('/x') as T3(X) 
         ) as T4 
        group by T4.Value 
        order by min(T4.rn) 
        for xml path(''), type).value('.', 'nvarchar(max)'), 1, 1, '') as NewValue 
     from @T as T1 
     cross apply (select cast('<x>'+replace(T1.Value, '/', '</x><x>')+'</x>' as xml)) as T2(X) 
    ) as T 

select * 
from @T 
+1

那里非常聪明的XML操作 –