为什么这个正则表达式没有像我期望的那样

基本上和其他人一样，我对正则表达式只有传递的知识。

尽管如此，我认为这将是相当直接的，然而它并没有按照我认为它应该的方式工作。

Sections+(d+.d+)s+([^n]+)

在我看来，上面的表达式应该是匹配的:

"Section"一词，
后面跟一个或多个空格，
后面跟着一些数字，一个点和一些其他数字，
后面加一些空格，
后面跟着一些不包括换行符的文本

当我像这样在Rubular上测试我的正则表达式时，为什么它不能匹配中的任何 ?

Section 2.1  Expenses of the Initial Public Offering  
Section 2.2  Termination of Professional Services Agreement  
Section 2.3  Individual Noteholders Fee  
Section 2.4  Proceeds to the Company  
Section 2.5  Repayment of Notes and Redemption of Preferred Stock

一段时间以来，我第一次意识到，关于正则表达式，我根本没有意识到一些基本的东西。有人愿意开导我吗?

字符串中有不间断的空格字符(U+00A0)。这在正则表达式的"whitespace"修饰符中可能不起作用。

这些不换行的空格字符用于标记(如HTML:  )，表示不应该插入自动换行符。

维基百科参考

使用您提供的链接，我注意到如果您在示例文本中"替换"一行上的空格(用空格)，则正则表达式匹配。它看起来几乎像一个bug在正则表达式检查?

要明白我的意思，把样本留在那里，只使用s+作为你的正则表达式。并不是每个空格都匹配。

在Perl中，它可以工作:

use strict;
use warnings;
my @list = ( "Section 2.1  Expenses of the Initial Public Offering",
             "Section 2.2  Termination of Professional Services Agreement",
             "Section 2.3  Individual Noteholders Fee",
             "Section 2.4  Proceeds to the Company",
             "Section 2.5  Repayment of Notes and Redemption of Preferred Stock",
           );
foreach my $item (@list)
{
    print "$item:n($1) <<$2>>n" if ($item =~ m/Sections+(d+.d+)s+([^n]+)/);
}

输出:

Section 2.1  Expenses of the Initial Public Offering:
(2.1) <<Expenses of the Initial Public Offering>>
Section 2.2  Termination of Professional Services Agreement:
(2.2) <<Termination of Professional Services Agreement>>
Section 2.3  Individual Noteholders Fee:
(2.3) <<Individual Noteholders Fee>>
Section 2.4  Proceeds to the Company:
(2.4) <<Proceeds to the Company>>
Section 2.5  Repayment of Notes and Redemption of Preferred Stock:
(2.5) <<Repayment of Notes and Redemption of Preferred Stock>>

这使我推断您没有使用Perl，或者您正在使用Perl，但没有将表达式正确地嵌入到匹配中。在这两者中，我认为您更有可能没有使用Perl。

我修改了Perl脚本来读取标准输入。

while (<>)
{
    chomp;
    print "$_:n";
    print "($1) <<$2>>n" if ($_ =~ m/Sections+(d+.d+)s+([^n]+)/);
}

当我提供包含UTF-8 U+00A0 (0xC2 0xA0)代替空格的标准输入时，MacOS X 10.7.1上的Perl 5.14.1也不识别正则表达式。但是，当我调整脚本以在while循环之前包含这一行时，它确实像预期的那样工作:

binmode(STDIN, ':utf8');

相关内容

最新更新

热门标签：