使用 Perl 正则表达式过滤 MIDDLE DOT Unicode 字符的正确语法是什么？

我正试图找出正确的语法，从字符串中过滤出MIDDLE DOT Unicode字符（U+00B7），并保留原始字符串

     $_ =~ s/test_of_character (.*[^x{00b7}])/$1/gi;

从上面的代码中，我不知道如何在删除字符串中的中间点之前保留原始字符串。

要从字符串中删除所有Unicode MIDDLE DOT字符，可以编写

s/N{MIDDLE DOT}//g

或

tr/N{MIDDLE DOT}//d

我不清楚"保持原始字符串"是什么意思，但如果你想保持$_不变，并从它的副本中删除MIDDLE DOT字符，那么你可以写

(my $modified = $_) =~ s/N{MIDDLE DOT}//g

或

my $modified = s/N{MIDDLE DOT}//gr

如果您使用Perl和Unicode，您应该阅读以下手册：

Perl Unicode
Perl Unicode简介
Perl Unicode教程

第一个显示了您可以使用以下表示法编写Unicode代码点，如U+00B7：

N{U+00B7}

您也可以使用Unicode字符名：

N{MIDDLE DOT}

剩下的就是基本的正则表达式处理。如果你需要保留原始字符串，那么如果你的Perl足够现代（添加到Perl 5.14.0中），你可以使用/r修饰符来表示正则表达式。或者（对于旧版本的Perl），你也可以复制字符串并编辑副本，就像下面的$altans一样。

#!/usr/bin/env perl
use strict;
use warnings;
use feature 'unicode_strings';
use utf8;
binmode(STDOUT, ":utf8");
my $string = "This is some text with a ·•· middle dot or four N{U+00B7}N{MIDDLE DOT} in it";
print "string = $stringn";
my $answer = ($string =~ s/N{MIDDLE DOT}//gr);
my $altans;
($altans = $string) =~ s/N{U+00B7}//g;
# Fix grammar!
$answer =~ s/bab/no/;
$answer =~ s/ or four //;
print "string = $stringn";
print "answer = $answern";
print "altans = $altansn";

输出：

string = This is some text with a ·•· middle dot or four ·· in it
string = This is some text with a ·•· middle dot or four ·· in it
answer = This is some text with no • middle dot in it
altans = This is some text with a • middle dot or four  in it

请注意，"中间大圆点"是U+2022，BULLET。

池上在评论中指出：

注意，x{00B7}和xB7将与N{U+00B7}匹配相同的字符。

事实上，正如上面代码的扩展所示：

#!/usr/bin/env perl
use strict;
use warnings;
use feature 'unicode_strings';
use utf8;
binmode(STDOUT, ":utf8");
my $string = "This is some text with a ·•· middle dot or four N{U+00B7}N{MIDDLE DOT} in it";
print "string = $stringn";
my $answer = ($string =~ s/N{MIDDLE DOT}//gr);
my $altans;
($altans = $string) =~ s/N{U+00B7}//g;
# Fix grammar!
$answer =~ s/bab/no/;
$answer =~ s/ or four //;
print "string = $stringn";
print "answer = $answern";
print "altans = $altansn";
my $extan1 = $string;
$extan1 =~ s/xB7//g;
print "extan1 = $extan1n";
my $extan2 = $string;
$extan2 =~ s/x{00B7}//g;
$extan2 =~ s/x{0065}//g;
$extan2 =~ s/x{2022}//g;
print "extan2 = $extan2n";

输出：

string = This is some text with a ·•· middle dot or four ·· in it
string = This is some text with a ·•· middle dot or four ·· in it
answer = This is some text with no • middle dot in it
altans = This is some text with a • middle dot or four  in it
extan1 = This is some text with a • middle dot or four  in it
extan2 = This is som txt with a  middl dot or four  in it

这是Perl:TMTOWTDI——有不止一种方法可以做到！

这是一个使用您自己的正则表达式的一般答案，稍微修改了

$_ =~ s/([^x{00b7}]*+)x{00b7}+/$1/g;

反向（优选）等效为

$_ =~ s/x{00b7}+//g;

相关内容

最新更新

热门标签：