2016-05-16 50 views
3

如何编写查询以识别名称相似的名称(可能包含非英文名称)? Soundex似乎不善于处理非英文名称。如何编写查询来识别具有类似声音的名称?

该代码应该能够识别,例如以下(或大多数)是具有类似声音的名称?


Helena - Elena 
Violet - Viola 
Beatrix - Beatrice 
Madeline - Madeleine (ma-duh-LINE vs ma-duh-LEN) 
Alice - Elise 
Madeline - Adeline 
Kristen - Kirsten 
Lily - Millie 
Charlotte - Scarlett 
Zara/Lara/Sara/Mara 
Elena - Alana 
Emily - Emmeline 
Amelia - Amalia 
Stella - Bella - Ella 
Isabel - Isabeau 
Holly - Hallie 
Laura - Lara 
Fiona - Finola 
Louise - Eloise 
Cara - Clara 
Susanna vs Susannah 
Nora vs Norah 
Talia vs Tahlia vs Thalia 
Catherine vs Katherine 
Cecilia vs Cecelia 
Lucy vs Lucie 
Vivian vs Vivien 
Lillian vs Lilian 
Gwendolen vs Gwendolyn 
Sofia vs Sophia 
Isabel vs Isobel vs Isabelle 
Seraphina vs Serafina 
Juliet vs Juliette 
Annabel vs Annabelle 
Emily vs Emilie 
Elisabeth vs Elizabeth 
...and non-English names too. 
+0

你看https://en.wikipedia.org/wiki/Soundex的** **异体部分?祝你好运。 – shellter

回答

3

它会帮助使用Levenshtein Distance等算法来比较两个序列之间的相似性吗?

https://en.wikipedia.org/wiki/Levenshtein_distance

特别是在Oracle中,你可以使用utl_match

例如:

--Find closest names based on UTL_MATCH.EDIT_DISTANCE. 
with names as 
(
    --Names data. 
    select column_value name 
    from table(sys.odcivarchar2list('Adeline','Alana','Alice','Amalia','Amelia','Annabel', 
    'Annabelle','Beatrice','Beatrix','Bella','Cara','Catherine','Cecelia','Cecilia', 
    'Charlotte','Clara','Elena','Elisabeth','Elise','Elizabeth','Ella','Eloise','Emilie', 
    'Emily','Emmeline','Finola','Fiona','Gwendolen','Gwendolyn','Hallie','Helena','Holly', 
    'Isabeau','Isabel','Isabelle','Isobel','Juliet','Juliette','Katherine','Kirsten', 
    'Kristen','Lara','Laura','Lilian','Lillian','Lily','Louise','Lucie','Lucy', 
    'Madeleine','Madeline','Mara','Millie','Nora','Norah','Sara','Scarlett','Serafina', 
    'Seraphina','Sofia','Sophia','Stella','Susanna','Susannah','Tahlia','Talia','Thalia', 
    'Viola','Violet','Vivian','Vivien','Zara')) 
) 
--Name with the closest matches. 
select name1, edit_distance, listagg(name2, ',') within group (order by name2) names 
from 
(
    --Compare strings. 
    select names1.name name1, names2.name name2 
     ,utl_match.edit_distance(names1.name, names2.name) edit_distance 
     ,min(utl_match.edit_distance(names1.name, names2.name)) 
      over (partition by names1.name) min_edit_distance 
    from names names1 
    cross join names names2 
    --This cross join could get expensive. It may help to add conditions here to 
    --filter out obvious non-matches. For example, maybe throw out rows where the 
    --string length is vastly different? 
    where names1.name <> names2.name 
    order by 1, 3, 2 
) 
where edit_distance = min_edit_distance 
group by name1, edit_distance 
order by 1; 

结果:

NAME1  EDIT_DISTANCE NAMES 
-----  ------------- ----- 
Adeline 2    Madeline 
Alana  2    Clara,Elena 
Alice  2    Elise 
Amalia  1    Amelia 
Amelia  1    Amalia 
Annabel 2    Annabelle 
Annabelle 2    Annabel 
Beatrice 2    Beatrix 
Beatrix 2    Beatrice 
Bella  2    Ella,Stella 
Cara  1    Clara,Lara,Mara,Sara,Zara 
Catherine 1    Katherine 
Cecelia 1    Cecilia 
Cecilia 1    Cecelia 
Charlotte 4    Scarlett 
Clara  1    Cara 
Elena  2    Alana,Ella,Helena 
Elisabeth 1    Elizabeth 
Elise  1    Eloise 
Elizabeth 1    Elisabeth 
Ella  2    Bella,Elena 
Eloise  1    Elise 
Emilie  2    Emily 
Emily  2    Emilie,Lily 
Emmeline 3    Adeline,Emilie,Madeline 
Finola  2    Fiona,Viola 
Fiona  2    Finola,Viola 
Gwendolen 1    Gwendolyn 
Gwendolyn 1    Gwendolen 
Hallie  2    Millie 
Helena  2    Elena 
Holly  3    Bella,Ella,Emily,Hallie,Lily 
Isabeau 2    Isabel 
Isabel  1    Isobel 
Isabelle 2    Isabel 
Isobel  1    Isabel 
Juliet  2    Juliette 
Juliette 2    Juliet 
Katherine 1    Catherine 
Kirsten 2    Kristen 
Kristen 2    Kirsten 
Lara  1    Cara,Laura,Mara,Sara,Zara 
Laura  1    Lara 
Lilian  1    Lillian 
Lillian 1    Lilian 
Lily  2    Emily,Lucy 
Louise  3    Elise,Eloise,Lucie 
Lucie  2    Lucy 
Lucy  2    Lily,Lucie 
Madeleine 1    Madeline 
Madeline 1    Madeleine 
Mara  1    Cara,Lara,Sara,Zara 
Millie  2    Hallie 
Nora  1    Norah 
Norah  1    Nora 
Sara  1    Cara,Lara,Mara,Zara 
Scarlett 4    Charlotte 
Serafina 2    Seraphina 
Seraphina 2    Serafina 
Sofia  2    Sophia 
Sophia  2    Sofia 
Stella  2    Bella 
Susanna 1    Susannah 
Susannah 1    Susanna 
Tahlia  1    Talia 
Talia  1    Tahlia,Thalia 
Thalia  1    Talia 
Viola  2    Finola,Fiona,Violet 
Violet  2    Viola 
Vivian  1    Vivien 
Vivien  1    Vivian 
Zara  1    Cara,Lara,Mara,Sara 
+0

谢谢!!这帮了很多! –