它会帮助使用Levenshtein Distance等算法来比较两个序列之间的相似性吗?
https://en.wikipedia.org/wiki/Levenshtein_distance
特别是在Oracle中,你可以使用utl_match
。
例如:
--Find closest names based on UTL_MATCH.EDIT_DISTANCE.
with names as
(
--Names data.
select column_value name
from table(sys.odcivarchar2list('Adeline','Alana','Alice','Amalia','Amelia','Annabel',
'Annabelle','Beatrice','Beatrix','Bella','Cara','Catherine','Cecelia','Cecilia',
'Charlotte','Clara','Elena','Elisabeth','Elise','Elizabeth','Ella','Eloise','Emilie',
'Emily','Emmeline','Finola','Fiona','Gwendolen','Gwendolyn','Hallie','Helena','Holly',
'Isabeau','Isabel','Isabelle','Isobel','Juliet','Juliette','Katherine','Kirsten',
'Kristen','Lara','Laura','Lilian','Lillian','Lily','Louise','Lucie','Lucy',
'Madeleine','Madeline','Mara','Millie','Nora','Norah','Sara','Scarlett','Serafina',
'Seraphina','Sofia','Sophia','Stella','Susanna','Susannah','Tahlia','Talia','Thalia',
'Viola','Violet','Vivian','Vivien','Zara'))
)
--Name with the closest matches.
select name1, edit_distance, listagg(name2, ',') within group (order by name2) names
from
(
--Compare strings.
select names1.name name1, names2.name name2
,utl_match.edit_distance(names1.name, names2.name) edit_distance
,min(utl_match.edit_distance(names1.name, names2.name))
over (partition by names1.name) min_edit_distance
from names names1
cross join names names2
--This cross join could get expensive. It may help to add conditions here to
--filter out obvious non-matches. For example, maybe throw out rows where the
--string length is vastly different?
where names1.name <> names2.name
order by 1, 3, 2
)
where edit_distance = min_edit_distance
group by name1, edit_distance
order by 1;
结果:
NAME1 EDIT_DISTANCE NAMES
----- ------------- -----
Adeline 2 Madeline
Alana 2 Clara,Elena
Alice 2 Elise
Amalia 1 Amelia
Amelia 1 Amalia
Annabel 2 Annabelle
Annabelle 2 Annabel
Beatrice 2 Beatrix
Beatrix 2 Beatrice
Bella 2 Ella,Stella
Cara 1 Clara,Lara,Mara,Sara,Zara
Catherine 1 Katherine
Cecelia 1 Cecilia
Cecilia 1 Cecelia
Charlotte 4 Scarlett
Clara 1 Cara
Elena 2 Alana,Ella,Helena
Elisabeth 1 Elizabeth
Elise 1 Eloise
Elizabeth 1 Elisabeth
Ella 2 Bella,Elena
Eloise 1 Elise
Emilie 2 Emily
Emily 2 Emilie,Lily
Emmeline 3 Adeline,Emilie,Madeline
Finola 2 Fiona,Viola
Fiona 2 Finola,Viola
Gwendolen 1 Gwendolyn
Gwendolyn 1 Gwendolen
Hallie 2 Millie
Helena 2 Elena
Holly 3 Bella,Ella,Emily,Hallie,Lily
Isabeau 2 Isabel
Isabel 1 Isobel
Isabelle 2 Isabel
Isobel 1 Isabel
Juliet 2 Juliette
Juliette 2 Juliet
Katherine 1 Catherine
Kirsten 2 Kristen
Kristen 2 Kirsten
Lara 1 Cara,Laura,Mara,Sara,Zara
Laura 1 Lara
Lilian 1 Lillian
Lillian 1 Lilian
Lily 2 Emily,Lucy
Louise 3 Elise,Eloise,Lucie
Lucie 2 Lucy
Lucy 2 Lily,Lucie
Madeleine 1 Madeline
Madeline 1 Madeleine
Mara 1 Cara,Lara,Sara,Zara
Millie 2 Hallie
Nora 1 Norah
Norah 1 Nora
Sara 1 Cara,Lara,Mara,Zara
Scarlett 4 Charlotte
Serafina 2 Seraphina
Seraphina 2 Serafina
Sofia 2 Sophia
Sophia 2 Sofia
Stella 2 Bella
Susanna 1 Susannah
Susannah 1 Susanna
Tahlia 1 Talia
Talia 1 Tahlia,Thalia
Thalia 1 Talia
Viola 2 Finola,Fiona,Violet
Violet 2 Viola
Vivian 1 Vivien
Vivien 1 Vivian
Zara 1 Cara,Lara,Mara,Sara
你看https://en.wikipedia.org/wiki/Soundex的** **异体部分?祝你好运。 – shellter