2017-03-22 82 views
0

我有两张表格,包含姓名和地址等客户信息。两张表格之间的模糊匹配

ID Name Full Address 
1 Anurag 123 CA USA 5001 
2 Mike ABC CA USA 5002 
3 Jason ZYZ TX USA 5003 
4 Roshan HBC MS USA 5004 
5 Tony UYS VT USA 5005 

New_ID Name   Full Address 
111 Anurag CH  123 3 Floor CA USA 5001 
112 Mike Martin  ABC 2 floorCA USA 5002 
113 Jason Bond  ABC farms USA 4008 
114 Roshan Kappor HBC MS USA 5004 
115 Tony Smith  UYS VT USA 5005 
116 Anurag   123 CA USA 5001 

我想知道根据完整地址在上述两个表格之间进行模糊匹配的最佳方式是什么。模糊匹配应该像模糊Vlookup一样工作,并且应该只给我提供一个最佳匹配。

Desired Output 

ID Name Full Address New ID Name   Full Address  Match Score 
1 Anurag 123 CA USA 5001 116  Anurag   123 CA USA 5001  100 
2 Mike ABC CA USA 5002 112  Mike Martin ABC2floorCA USA 5002 90 
3 Jason ZYZ TX USA 5003 113  Jason Bond  ABC farms USA 4008  89 
4 Roshan HBC MS USA 5004 114  Roshan Kappor HBC MS USA 5004   90 
5 Tony UYS VT USA 5005 115  Tony Smith  UYS VT USA 5005   90 
+0

的Oracle的哪个版本的行?你有没有看过Oracle Text?看看这里https://community.oracle.com/thread/3583139并在这里https://docs.oracle.com/cd/E11882_01/text.112/e24436/csql.htm#CCREF0104。做到这一点的能力应该在那里...... – sers

回答

3

尝试UTL_MATCH包。有两个函数可以计算字符串之间的相似度。

要做的步骤。

1)在UTL_MATCH.EDIT_DISTANCE_SIMILARITY(t1.full_adress,t2.full_adress) > 0.0上加入t1到t2是相似性的百分比。我建议将它设置为50或更多。

2)的重复数据删除与row_number()

3)只返回与相似度最高百分比

with tab_1 (ID,name,full_adress) as(
select 1 ,'Anurag' ,'123 CA USA 5001' from dual union all 
select 2 ,'Mike' ,'ABC CA USA 5002' from dual union all 
select 3 ,'Jason' ,'ZYZ TX USA 5003' from dual union all 
select 4 ,'Roshan' ,'HBC MS USA 5004' from dual union all 
select 5 ,'Tony' ,'UYS VT USA 5005' from dual), 
tab_2 (ID_2,name_2,full_adress_2) as (
select 111 ,'Anurag CH'  ,'123 3 Floor CA USA 5001' from dual union all 
select 112 ,'Mike Martin'  ,'ABC 2 floorCA USA 5002' from dual union all 
select 113 ,'Jason Bond'  ,'ABC farms USA 4008' from dual union all 
select 114 ,'Roshan Kappor' ,'HBC MS USA 5004' from dual union all 
select 115 ,'Tony Smith'  ,'UYS VT USA 5005' from dual union all 
select 116 ,'Anurag'   ,'123 CA USA 5001' from dual) 
select * from (
select t1.*,t2.*, UTL_MATCH.EDIT_DISTANCE_SIMILARITY(t1.full_adress,t2.full_adress_2) SIMILARITY_PERCENT, row_number() over(partition by t1.id order by UTL_MATCH.EDIT_DISTANCE_SIMILARITY(t1.full_adress,t2.full_adress_2) desc) rn_rank from tab_1 t1 
join tab_2 t2 on UTL_MATCH.EDIT_DISTANCE_SIMILARITY(t1.full_adress,t2.full_adress_2) > 0 
) where rn_rank = 1