2011-01-27 77 views
0

我有很大的表(超过100万行),这些行有不同来源的产品名称和价格。Oracle数据库中的部分匹配

有很多同名产品,但价格不同。

这是问题所在,

我们有相同的产品多次在行,但他们的名字不会是相同的,例如

Row Product name    price 
----- ----------------------- ---- 
Row 1 : XYZ - size information $a 
Row 2. XYZ -Brand information $b 
Row 3. xyz      $c 

我想它的价格有所不同的所有产品。如果名称是行相同,则我可以很容易地去为自己加入成为Table1.Product_Name = Table1.Product_name和Table1.Price!= Table2.Price

但这不会在这种情况下:(

可以工作任何一个提出一个解决方案,这

回答

3

你可以尝试使用regexp_replace进入正确的方向:

create table tq84_products (
    name varchar2(50), 
    price varchar2(5) 
); 

三种产品:

  • XYZ
  • ABCD这
  • efghi

的ABCD有两条记录具有相同的价格和所有其他有不同的价格。

insert into tq84_products values (' XYZ - size information', '$a'); 
insert into tq84_products values ('XYZ - brand information', '$b'); 
insert into tq84_products values ('xyz'     , '$c'); 

insert into tq84_products values ('Product ABCD'   , '$d'); 
insert into tq84_products values ('Abcd is the best'  , '$d'); 

insert into tq84_products values ('efghi is cheap'   , '$f'); 
insert into tq84_products values ('no, efghi is expensive' , '$g'); 

停止词 SELECT语句删除通常在产品名称中找到的单词。

with split_into_words as (
     select 
     name, 
     price, 
     upper (
     regexp_replace(name, 
          '\W*' || 
         '(\w+)?\W?+' || 
         '(\w+)?\W?+' || 
         '(\w+)?\W?+' || 
         '(\w+)?\W?+' || 
         '(\w+)?\W?+' || 
         '(\w+)?\W?+' || 
         '(\w+)?\W?+' || 
         '(\w+)?\W?+' || 
         '(\w+)?'  || 
         '.*', 
         '\' || submatch.counter 
        ) 
     )       word 
     from 
      tq84_products, 
      (select 
       rownum counter 
      from 
       dual 
      connect by 
       level < 10 
      ) submatch 
), 
    stop_words as (
    select 'IS'   word from dual union all 
    select 'BRAND'  word from dual union all 
    select 'INFORMATION' word from dual 
) 
    select 
    w1.price, 
    w2.price, 
    w1.name, 
    w2.name 
-- substr(w1.word, 1, 30)    common_word, 
-- count(*) over (partition by w1.name) cnt 
    from 
    split_into_words w1, 
    split_into_words w2 
    where 
    w1.word = w2.word and 
    w1.name < w2.name and 
    w1.word is not null and 
    w2.word is not null and 
    w1.word not in (select word from stop_words) and 
    w2.word not in (select word from stop_words) and 
    w1.price != w2.price; 

这则选择

$a $b  XYZ - size information       XYZ - brand information 
$b $c XYZ - brand information       xyz 
$a $c  XYZ - size information       xyz 
$f $g efghi is cheap          no, efghi is expensive 

那么,是不是返回ABCD,而其他人。

+0

我会试试这个。 – onsy 2011-01-27 08:55:11