2010-02-18 90 views
1

解析地址我要创建一个循环,并与正则表达式 填充任何的4个变量与正则表达式

$address, $street, $town, $lot 

的循环将被馈送可以具有信息在它 像下面的行的字符串

  • '123 any street, mytown'
  • 'Lot 4 another road, thattown'
  • 'Lot 2 96 other road, her town'
  • 'this ave, this town'
  • 'yourtown'

因为逗号后什么是$town我想

(.*), (.*) 

然后第一次捕捉可能与(Lot \d*) (.*), (.*) 进行检查,如果第一捕获以数字开头,那么它的地址(如果带有空格的词它的$street) 如果一个字,它只是$town

+0

参见http://stackoverflow.com/questions/642602/regular-expression- for-parsing-mailing-addresses http://stackoverflow.com/questions/16413/parse-usable-street-address-city-state-zip-from-a-string等 – 2010-02-18 15:17:19

回答

7

看看Geo::StreetAddress::US如果这些是美国的地址。

即使他们不是,这个模块的来源应该让你知道解析自由形式街道地址的过程。

这里是处理您发布的地址(更新,早期版本合并批次和数量成一个字符串)的脚本:

#!/usr/bin/perl 

use strict; use warnings; 

local $/ = ""; 

my @addresses; 

while (my $address = <DATA>) { 
    chomp $address; 
    $address =~ s/\s+/ /g; 
    my (%address, $rest); 
    ($address{town}, $rest) = map { scalar reverse } 
         split(/ ?, ?/, reverse($address), 2); 

    { 
     no warnings 'uninitialized'; 
     @address{qw(lot number street)} = 
      $rest =~ /^(?:(Lot [0-9]))?(?:([0-9]+))?(.+)\z/; 
    } 
    push @addresses, \%address; 
} 

use Data::Dumper; 
print Dumper \@addresses; 

__DATA__ 
123 any street, 
mytown 

Lot 4 another road, 
thattown 

Lot 2 96 other road, 
her town 

yourtown 

street, 
town 

输出:

$VAR1 = [ 
      { 
      'lot' => undef, 
      'number' => '123', 
      'street' => 'any street', 
      'town' => 'mytown' 
      }, 
      { 
      'lot' => 'Lot 4', 
      'number' => undef, 
      'street' => 'another road', 
      'town' => 'thattown' 
      }, 
      { 
      'lot' => 'Lot 2', 
      'number' => '96', 
      'street' => 'other road', 
      'town' => 'her town' 
      }, 
      { 
      'lot' => undef, 
      'number' => undef, 
      'street' => undef, 
      'town' => 'yourtown' 
      }, 
      { 
      'lot' => undef, 
      'number' => undef, 
      'street' => 'street', 
      'town' => 'town' 
      } 
     ];
7

我建议你不要试图在一个正则表达式中做所有这些,因为它很难验证它的正确性。

首先,我会在逗号分割。无论逗号后面是$镇,如果没有逗号,整个字符串就是$镇。

然后我会检查是否有任何批量信息并从字符串中提取它。

然后,我会寻找街道/大道编号和名称。

分而治之:)

1

这应该分隔成3个部分 - 你怎么区分地址/街?

(Lot \d*)? ?([^,]*,)? ?(.*) 

这里是你的例子

('', '123 any street,', 'mytown') 
('Lot 4', 'another road,', 'thattown') 
('Lot 2', '96 other road,', 'her town') 
('', 'this ave,', 'this town') 
('', '', 'yourtown') 

如果我理解正确的故障,这个地址/街头分离以及

(Lot \d*)? ?(\d*) ?([^,]*,)? ?(.*) 

('', '123', 'any street,', 'mytown') 
('Lot 4', '', 'another road,', 'thattown') 
('Lot 2', '96', 'other road,', 'her town') 
('', '', 'this ave,', 'this town') 
('', '', '', 'yourtown') 
+0

房屋号码并不那么简单,他们可以在他们之后(或者甚至是IIRC之前)或者在他们之后有1/2的信件。 – ysth 2010-02-18 17:28:19

+0

@ysth,我们测试案例以覆盖那些。扩展正则表达式并不难 - 猜测需求是。 – 2010-02-18 19:45:18

0

我无法企及的最后一个但前3个你可以使用这样的事情:

if (preg_match('/(?:Lot (\d*)|)(?: |)(?:(\d*)|) (.*), (.*)/m', $subject, $regs)) { 
    $result = $regs[1]; 
} else { 
    $result = ""; 
} 

这是测试正则表达式:

(?:Lot (\d*)|)(?: |)(?:(\d*)|) (.*), (.*) 

您可以使用RegexBuddy使用它来测试:link

0

地理::的StreetAddress ::美国对于简单地址是很好的,但它可能会在较难的例子上失去上下文。它会解析街道名称直到找到郊区。因此,以“46 7th St. Johns Park”,“St.”消耗得太快,街道类型被错误地分配到'公园','CA'的街道成为郊区。

2 Smith St Suburb NJ 12345    2 Smith   St Suburb   NJ 12345 
25 MIRROR LAKE DR LITTLE EGG HARBOR 25 MIRROR LAKE DR Hbr NJ      0 
74B Old Bohema Rd N, St. Johns Park 74 B Old Bohema Rd St Johns Park CA 95472 
74 Mt Baw Baw Rd Suite C Some Park C 74 Mt Baw Baw Rd S Park CA      0 
74 Old Bohema Rd Bldg A Some Park CA 74 Old Bohema Rd B Park CA      0 
74 Old Bohema Rd Rm 123A Some Park C 74 Old Bohema Rd R Park CA      0 
Lot 74 Old Bohema Rd Some Park CA 95 0 Old Bohema Rd S Park CA      0 
22 Glen Alpine Way Some Park CA 9547 22 Glen Alpine Way Park CA      0 
4/6 Bohema Rd, St. Johns Park CA 954 4 6 Bohema  Rd St Johns Park CA 95472 
46 The Parade, St. Johns Park CA 954 46 The     Parade     0 
46 7th St. Johns Park CA 95472   46 7th St Johns Park CA      0 
46 B Avenue Johns Park CA 95472  46 B Avenue Johns Park CA      0 
46 Avenue C Johns Park CA 95472  46 Avenue C Johns Park CA      0 
46 Broadway Johns Park CA 95472  46 Broadway Johns Park CA      0 
46 State Route 19 Johns Park CA 9547 46 State Route 19 Park CA      0 
46 John F Kennedy Drive Johns Park C 46 John F Kennedy Park CA      0 
PO Box 213 Somewhere IO 1234   0 Somewhere   IO      0 
1 BEACH DR SE # 2410 ST PETERSBURG F 1 BEACH DR SE # 2 St PETERSBURG  FL 33701 
# 123 12 BEACH DR SE ST PETERSBURG F 12 BEACH DR SE  St PETERSBURG  FL 33701 
46 Broad Street #12 Suburb CA 95472 46 Broad   St       0 

我开发了一个Perl模块,可以识别许多这些更难模式https://metacpan.org/release/Lingua-EN-AddressParse。它承认成语,如“游行”,第n街,子房产地址,如“46大街#12”等等。