2011-12-08 52 views
5

我把它的权利在那里:我最不善于用正则表达式。我试图想出一个解决我的问题,但我真的不知道他们很多。 。 。正则表达式匹配对象的尺寸

想象一些句子大意如下:

  • 您好等等等等。它大约11 1/2“x 32”。
  • 尺寸为8 x 10-3/5!
  • 可能在22“×17”的地区。
  • 卷是相当大的:42 1/2" 。X60码
  • 它们都是由5.76 8帧
  • 是啊,也许是周围84厘米长
  • 我想13/19" 。
  • 不,它实际上可能是86厘米。

我想要尽可能干净地从这些句子中提取项目维度。在理想的正则表达式将输出以下:

  • 11 1/2" ×32"
  • 8×10-3/5
  • 22" ×17"
  • 42 1/2" ×60码
  • 5.76 8
  • 84厘米
  • 13/19"
  • 86厘米

我想象在以下规则适用一个世界:

  • 以下是有效的单位:{cm, mm, yd, yards, ", ', feet},但我更喜欢的是考虑了单位的任意一组,而不是一个明确的解决方案上述单位的解决方案。
  • 的尺寸总是数值描述的,可以或可以不具有以下它单元,并且可以或可以不具有一个分数或小数部分。由它自己组成的小数部分是允许的,例如,4/5"
  • 小数部分始终有一个/分离分子/分母,并且可以假设存在部分之间没有空间(不过,如果有人需要,在考虑到这是伟大的!)。
  • 尺寸可以是一维或二维的,在这种情况下,可以假设以下是可接受用于分离两个维度:{x, by}。如果维度是仅一维它必须具有从上面的设置,即单元,22 cm是OK,.333不是,也不是4.33 oz

为了向您展示我是如何无用的,我正在使用正则表达式(并且显示我至少已经尝试过!),所以我得到了这一点。 。 。

[1-9]+[/ ][x1-9] 

更新(2)

你们是非常快速和有效!我要补充的是没有被覆盖下面的正则表达式测试用例多打少:

  • 倒数第二个测试用例是12码X。
  • 最后一个测试用例是99厘米乘。
  • 此句子没有尺寸:342/5553/222。
  • 三维? 22“x 17”x 12 cm
  • 这是一个产品代码:c720与另一个数字83 x更好。
  • 自己的数字21.
  • 卷不应该匹配0.332盎司。

这些应该导致以下(#表示没有应匹配):

  • 12码
  • 99厘米
  • 22" ×17" × 12厘米

我下面M42's答案适应,到:

\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet)(?:\s*x\s*|\s*by\s*)?(?:\d+(?:\.\d+)?[\s*-]*(?:\d+(?:\/\d+)?)?(?:cm|mm|yd|"|'|feet)?)? 

不过,虽然可以解决一些新的测试用例现在不能匹配以下他人。它报告:

  • 11 1/2" ×32" PASS
  • (无)FAIL
  • 22" ×17" PASS
  • 42 1/2" ×60码PASS
  • (无)FAIL
  • 84厘米PASS
  • 13/19" PASS
  • 86厘米PASS
  • 22" PASS
  • (无)FAIL
  • (无)FAIL

  • 12码X FAIL

  • 99厘米通过FAIL
  • 22" ×17" [,并且还,但分别'12厘米'] FAIL
  • PASS
  • PASS
+0

Coud请您提供输入字符串,什么是预期的输出中? – Toto

+0

当然。我已经在这里为您提供了更简单的格式:http://pastebin.com/txfJs8LX非常感谢! – Edwardr

回答

5

新版本中,目标附近,2失败的测试

#!/usr/local/bin/perl 
use Modern::Perl; 
use Test::More; 

my $re1 = qr/\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet)/; 
my $re2 = qr/(?:\s*x\s*|\s*by\s*)/; 
my $re3 = qr/\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet|frames)/; 
my @out = (
'11 1/2" x 32"', 
'8 x 10-3/5', 
'22" x 17"', 
'42 1/2" x 60 yd', 
'5.76 by 8 frames', 
'84cm', 
'13/19"', 
'86 cm', 
'12 yd', 
'99 cm', 
'no match', 
'22" x 17" x 12 cm', 
'no match', 
'no match', 
'no match', 
); 
my $i = 0; 
my $xx = '22" x 17"'; 
while(<DATA>) { 
    chomp; 
    if (/($re1(?:$re2$re3)?(?:$re2$re1)?)/) { 
     ok($1 eq $out[$i], $1 . ' in ' . $_); 
    } else { 
     ok($out[$i] eq 'no match', ' got "no match" in '.$_); 
    } 
    $i++; 
} 
done_testing; 


__DATA__ 
Hello blah blah. It's around 11 1/2" x 32". 
The dimensions are 8 x 10-3/5! 
Probably somewhere in the region of 22" x 17". 
The roll is quite large: 42 1/2" x 60 yd. 
They are all 5.76 by 8 frames. 
Yeah, maybe it's around 84cm long. 
I think about 13/19". 
No, it's probably 86 cm actually. 
The last but one test case is 12 yd x. 
The last test case is 99 cm by. 
This sentence doesn't have dimensions in it: 342/5553/222. 
Three dimensions? 22" x 17" x 12 cm 
This is a product code: c720 with another number 83 x better. 
A number on its own 21. 
A volume shouldn't match 0.332 oz. 

输出:

# Failed test ' got "no match" in The dimensions are 8 x 10-3/5!' 
# at C:\tests\perl\test6.pl line 42. 
# Failed test ' got "no match" in They are all 5.76 by 8 frames.' 
# at C:\tests\perl\test6.pl line 42. 
# Looks like you failed 2 tests of 15. 
ok 1 - 11 1/2" x 32" in Hello blah blah. It's around 11 1/2" x 32". 
not ok 2 - got "no match" in The dimensions are 8 x 10-3/5! 
ok 3 - 22" x 17" in Probably somewhere in the region of 22" x 17". 
ok 4 - 42 1/2" x 60 yd in The roll is quite large: 42 1/2" x 60 yd. 
not ok 5 - got "no match" in They are all 5.76 by 8 frames. 
ok 6 - 84cm in Yeah, maybe it's around 84cm long. 
ok 7 - 13/19" in I think about 13/19". 
ok 8 - 86 cm in No, it's probably 86 cm actually. 
ok 9 - 12 yd in The last but one test case is 12 yd x. 
ok 10 - 99 cm in The last test case is 99 cm by. 
ok 11 - got "no match" in This sentence doesn't have dimensions in it: 342/5553/222. 
ok 12 - 22" x 17" x 12 cm in Three dimensions? 22" x 17" x 12 cm 
ok 13 - got "no match" in This is a product code: c720 with another number 83 x better. 
ok 14 - got "no match" in A number on its own 21. 
ok 15 - got "no match" in A volume shouldn't match 0.332 oz. 
1..15 

似乎难以企及的5.76 by 8 frames但不0.332 oz,有时你必须将单位和数字与单位相匹配。

对不起,我无法做得更好。

+0

这个匹配了所有内容,包括23.3之后的12码。但是,如何改进它以避免以下情况? “12 yd x”目前与你的正则表达式匹配,但我认为如果在这种情况下只匹配12码,这是最好的。谢谢! – Edwardr

+0

我试图让你的答案适应一些更一般的情况,但失败了。 。 。相应更新问题。 – Edwardr

2

是许多可能的解决方案(因为它仅使用正则表达式的基本语法应该是NLP兼容):

foundMatch = Regex.IsMatch(SubjectString, @"\d+(?: |cm|\.|""|/)[\d/""x -]*(?:\b(?:by\s*\d+|cm|yd)\b)?"); 

会得到你的结果:)

说明:

" 
\d    # Match a single digit 0..9 
    +    # Between one and unlimited times, as many times as possible, giving back as needed (greedy) 
(?:   # Match the regular expression below 
        # Match either the regular expression below (attempting the next alternative only if this one fails) 
     \   # Match the character “ ” literally 
    |    # Or match regular expression number 2 below (attempting the next alternative only if this one fails) 
     cm   # Match the characters “cm” literally 
    |    # Or match regular expression number 3 below (attempting the next alternative only if this one fails) 
     \.   # Match the character “.” literally 
    |    # Or match regular expression number 4 below (attempting the next alternative only if this one fails) 
     ""   # Match the character “""” literally 
    |    # Or match regular expression number 5 below (the entire group fails if this one fails to match) 
    /   # Match the character “/” literally 
) 
[\d/""x -]  # Match a single character present in the list below 
        # A single digit 0..9 
        # One of the characters “/""x” 
        # The character “ ” 
        # The character “-” 
    *    # Between zero and unlimited times, as many times as possible, giving back as needed (greedy) 
(?:    # Match the regular expression below 
    \b    # Assert position at a word boundary 
    (?:   # Match the regular expression below 
        # Match either the regular expression below (attempting the next alternative only if this one fails) 
     by  # Match the characters “by” literally 
     \s  # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.) 
      *  # Between zero and unlimited times, as many times as possible, giving back as needed (greedy) 
     \d  # Match a single digit 0..9 
      +  # Between one and unlimited times, as many times as possible, giving back as needed (greedy) 
     |   # Or match regular expression number 2 below (attempting the next alternative only if this one fails) 
     cm  # Match the characters “cm” literally 
     |   # Or match regular expression number 3 below (the entire group fails if this one fails to match) 
     yd  # Match the characters “yd” literally 
    ) 
    \b    # Assert position at a word boundary 
)?    # Between zero and one times, as many times as possible, giving back as needed (greedy) 
" 
+0

哇,谢谢!它并不完全符合我所有想象中的情况。例如,如果第一个维度以mm,cm,yd等结尾,则它不匹配。我想我可以制定如何适应它。 :-) – Edwardr

+0

@Edwardr我用你的例子,但你可以扩展它我想:) – FailedDev

1

这就是我可以用'Perl'中的正则表达式得到的所有东西。尝试以使其适应你的正则表达式味:

\d.*\d(?:\s+\S+|\S+) 

说明:

\d  # One digit. 
.*  # Any number of characters. 
\d  # One digit. All joined means to find all content between first and last digit. 
\s+\S+ # A non-space characters after some space. It tries to match any unit like 'cm' or 'yd'. 
|   # Or. Select one of two expressions between parentheses. 
\S+  # Any number of non-space characters. It tries to match double-quotes, or units joined to the 
      # last number. 

我的测试:

内容 脚本

。PL

use warnings; 
use strict; 

while (<DATA>) { 
     print qq[$1\n] if m/(\d.*\d(\s+\S+|\S+))/ 
} 

__DATA__ 
Hello blah blah. It's around 11 1/2" x 32". 
The dimensions are 8 x 10-3/5! 
Probably somewhere in the region of 22" x 17". 
The roll is quite large: 42 1/2" x 60 yd. 
They are all 5.76 by 8 frames. 
Yeah, maybe it's around 84cm long. 
I think about 13/19". 
No, it's probably 86 cm actually. 

运行脚本:

perl script.pl 

结果:

11 1/2" x 32". 
8 x 10-3/5! 
22" x 17". 
42 1/2" x 60 yd. 
5.76 by 8 frames. 
84cm 
13/19". 
86 cm