匹配对象尺寸的正则表达式



我要把它放在那里:我对正则表达式很糟糕。我试图想出一个办法来解决我的问题,但我真的不太了解它们…

想象下列句子:

  • 你好等等等等。大概是11 1/2" x 32"。
  • 尺寸是8 x 10-3/5!
  • 可能在22" x 17"区域的某处。
  • 轧辊相当大:42 1/2"x 60码。
  • 它们都是5.76 × 8帧。
  • 是啊,可能有84厘米长。
  • 我想是13/19"。
  • 不,实际上可能是86厘米。

我希望尽可能清晰地从这些句子中提取项目维度。在理想情况下,正则表达式将输出如下内容:

  • 11 1/2" x 32"
  • 8 × 10-3/5
  • 22" x 17"
  • 42 1/2" x 60码
  • <
  • 84厘米/gh>
  • 13/19 "
  • <
  • 86厘米/gh>

我想象这样一个世界:

  • 以下是有效的单位:{cm, mm, yd, yards, ", ', feet},尽管我更喜欢考虑任意一组单位的解决方案,而不是上述单位的明确解决方案。
  • 一个维度总是用数字来描述的,它后面可能有也可能没有单位,可能有也可能没有小数部分。允许由自己的小数部分组成,例如4/5"
  • 小数部分总是有一个/分隔分子/分母,人们可以假设部分之间没有空间(尽管如果有人考虑到这一点,那就太好了!)。
  • 维度可以是一维的,也可以是二维的,在这种情况下,可以假设以下方式可以用于分隔两个维度:{x, by}。如果维度仅为一维,则必须具有上述集合中的单位,即22 cm可以,.333不行,4.33 oz也不行。

向您展示我使用正则表达式是多么的无用(并且显示我至少尝试过!),我到这里了…

[1-9]+[/ ][x1-9]

更新(2)

你们真是又快又有效率!我将添加一些额外的测试用例,它们没有被下面的正则表达式所覆盖:

  • 最后一个测试用例是12码x。
  • 最后一个测试用例是99厘米长。
  • 这个句子里没有尺寸:342/5553/222
  • 三维?22" x 17" x 12 cm
  • 这是一个产品代码:c720加上另一个数字83倍。
  • 一个独立的数字21.
  • 体积不应该匹配0.332 oz。

结果如下(#表示没有匹配):

    12码
  • <
  • 99厘米/gh>
  • #
  • 22" x 17" x 12 cm
  • #
  • #
  • #

我把M42的回答改编如下:

d+(?:.d+)?[s-]*(?:d+)?(?:/d+)?(?:cm|mm|yd|"|'|feet)(?:s*xs*|s*bys*)?(?:d+(?:.d+)?[s*-]*(?:d+(?:/d+)?)?(?:cm|mm|yd|"|'|feet)?)?

但是,虽然这解决了一些新的测试用例,但它现在无法匹配下面的其他测试用例。它报告:

  • 11 1/2" x 32" PASS
  • (什么)失败
  • 22" x 17" PASS
  • 42 1/2" x 60码PASS
  • (什么)失败
  • 84厘米通过
  • 13/19"通过
  • 86 cm PASS
  • 22"通过
  • (什么)失败
  • (什么)失败

  • 12 yd x FAIL

  • 99 cm by FAIL
  • 22" x 17"[和也,但单独'12 cm'] FAIL
  • 通过

  • 通过

新版本,接近目标,2次测试失败

#!/usr/local/bin/perl 
use Modern::Perl;
use Test::More;
my $re1 = qr/d+(?:.d+)?[s-]*(?:d+)?(?:/d+)?(?:cm|mm|yd|"|'|feet)/;
my $re2 = qr/(?:s*xs*|s*bys*)/;
my $re3 = qr/d+(?:.d+)?[s-]*(?:d+)?(?:/d+)?(?:cm|mm|yd|"|'|feet|frames)/;
my @out = (
'11 1/2" x 32"',
'8 x 10-3/5',
'22" x 17"',
'42 1/2" x 60 yd',
'5.76 by 8 frames',
'84cm',
'13/19"',
'86 cm',
'12 yd',
'99 cm',
'no match',
'22" x 17" x 12 cm',
'no match',
'no match',
'no match',
);
my $i = 0;
my $xx = '22" x 17"';
while(<DATA>) {
    chomp;
    if (/($re1(?:$re2$re3)?(?:$re2$re1)?)/) {
        ok($1 eq $out[$i], $1 . ' in ' . $_);
    } else {
        ok($out[$i] eq 'no match', ' got "no match" in '.$_);
    }
    $i++;
}
done_testing;

__DATA__
Hello blah blah. It's around 11 1/2" x 32".
The dimensions are 8 x 10-3/5!
Probably somewhere in the region of 22" x 17".
The roll is quite large: 42 1/2" x 60 yd.
They are all 5.76 by 8 frames.
Yeah, maybe it's around 84cm long.
I think about 13/19".
No, it's probably 86 cm actually.
The last but one test case is 12 yd x.
The last test case is 99 cm by.
This sentence doesn't have dimensions in it: 342 / 5553 / 222.
Three dimensions? 22" x 17" x 12 cm
This is a product code: c720 with another number 83 x better.  
A number on its own 21.
A volume shouldn't match 0.332 oz.
输出:

#   Failed test ' got "no match" in The dimensions are 8 x 10-3/5!'
#   at C:testsperltest6.pl line 42.
#   Failed test ' got "no match" in They are all 5.76 by 8 frames.'
#   at C:testsperltest6.pl line 42.
# Looks like you failed 2 tests of 15.
ok 1 - 11 1/2" x 32" in Hello blah blah. It's around 11 1/2" x 32".
not ok 2 -  got "no match" in The dimensions are 8 x 10-3/5!
ok 3 - 22" x 17" in Probably somewhere in the region of 22" x 17".
ok 4 - 42 1/2" x 60 yd in The roll is quite large: 42 1/2" x 60 yd.
not ok 5 -  got "no match" in They are all 5.76 by 8 frames.
ok 6 - 84cm in Yeah, maybe it's around 84cm long.
ok 7 - 13/19" in I think about 13/19".
ok 8 - 86 cm in No, it's probably 86 cm actually.
ok 9 - 12 yd in The last but one test case is 12 yd x.
ok 10 - 99 cm in The last test case is 99 cm by.
ok 11 -  got "no match" in This sentence doesn't have dimensions in it: 342 / 5553 / 222.
ok 12 - 22" x 17" x 12 cm in Three dimensions? 22" x 17" x 12 cm
ok 13 -  got "no match" in This is a product code: c720 with another number 83 x better.  
ok 14 -  got "no match" in A number on its own 21.
ok 15 -  got "no match" in A volume shouldn't match 0.332 oz.
1..15

似乎很难匹配5.76 by 8 frames而不是0.332 oz,有时你必须匹配有单位的数字和没有单位的数字。

对不起,我不能做得更好。

许多可能的解决方案之一(应该是nlp兼容的,因为它只使用基本的regex语法):

foundMatch = Regex.IsMatch(SubjectString, @"d+(?: |cm|.|""|/)[d/""x -]*(?:b(?:bys*d+|cm|yd)b)?");

将得到您的结果:)

解释:

"
d             # Match a single digit 0..9
   +              # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?:            # Match the regular expression below
                  # Match either the regular expression below (attempting the next alternative only if this one fails)
                 # Match the character “ ” literally
   |              # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
      cm          # Match the characters “cm” literally
   |              # Or match regular expression number 3 below (attempting the next alternative only if this one fails)
      .          # Match the character “.” literally
   |              # Or match regular expression number 4 below (attempting the next alternative only if this one fails)
      ""          # Match the character “""” literally
   |              # Or match regular expression number 5 below (the entire group fails if this one fails to match)
      /           # Match the character “/” literally
)
[d/""x -]        # Match a single character present in the list below
                  # A single digit 0..9
                  # One of the characters “/""x”
                  # The character “ ”
                  # The character “-”
   *              # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
(?:               # Match the regular expression below
   b             # Assert position at a word boundary
   (?:            # Match the regular expression below
                  # Match either the regular expression below (attempting the next alternative only if this one fails)
         by       # Match the characters “by” literally
         s       # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
            *     # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
         d       # Match a single digit 0..9
            +     # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
      |           # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
         cm       # Match the characters “cm” literally
      |           # Or match regular expression number 3 below (the entire group fails if this one fails to match)
         yd       # Match the characters “yd” literally
   )
   b             # Assert position at a word boundary
)?                # Between zero and one times, as many times as possible, giving back as needed (greedy)
"

这是我在'Perl'中使用正则表达式所能得到的所有内容。试着让它适应你的正则表达式风格:

d.*d(?:s+S+|S+)

解释:

d        # One digit.
.*        # Any number of characters.
d        # One digit. All joined means to find all content between first and last digit.
s+S+    # A non-space characters after some space. It tries to match any unit like 'cm' or 'yd'.
|         # Or. Select one of two expressions between parentheses.
S+       # Any number of non-space characters. It tries to match double-quotes, or units joined to the 
          # last number.

我的测试:

script.pl:

use warnings;
use strict;
while ( <DATA> ) {
        print qq[$1n] if m/(d.*d(s+S+|S+))/
}
__DATA__
Hello blah blah. It's around 11 1/2" x 32".
The dimensions are 8 x 10-3/5!
Probably somewhere in the region of 22" x 17".
The roll is quite large: 42 1/2" x 60 yd.
They are all 5.76 by 8 frames.
Yeah, maybe it's around 84cm long.
I think about 13/19".
No, it's probably 86 cm actually.

运行脚本

perl script.pl
结果:

11 1/2" x 32".
8 x 10-3/5!
22" x 17".
42 1/2" x 60 yd.
5.76 by 8 frames.
84cm
13/19".
86 cm

相关内容

  • 没有找到相关文章

最新更新