如何在 perl 中使用正则表达式将文本拆分为"steps"？

我正试图将文本拆分为"步骤"；假设我的文本是

my $steps = "1.Do this. 2.Then do that. 3.And then maybe that. 4.Complete!"

我希望输出为：

"1.Do this."
"2.Then do that."
"3.And then maybe that."
"4.Complete!"

我对regex不是很好，所以帮助会很棒！

我试过很多组合，比如：

split /(sd.)/

但它将编号与文本分开

我确实会使用split。但是，您需要使用前瞻法将该数字从匹配中排除。

my @steps = split /s+(?=d+.)/, $steps;

所有步骤描述都以数字开头，后跟句点，然后是非数字，直到下一个数字。所以捕捉所有这样的模式

my @s = $steps =~ / [0-9]+. [^0-9]+ /xg; 
say for @s;

只有在步骤描述中肯定没有数字的情况下，这才有效，就像任何依赖于匹配数字的方法一样(对于十进制数字，即使后面跟着一个句点(^†

如果里面可能有数字，我们需要更多地了解文本的结构。

另一种需要考虑的定界模式是结束句子的标点符号(在这些例子中是.和!(，如果步骤的描述中没有这样的字符，并且没有多个句子

my @s = $steps =~ / [0-9]+. .*? [.!] /xg;

根据需要增加结束项目描述的模式列表，例如使用?和/或."序列，因为标点符号通常位于引号内^†

如果一个项目可以有多个句子，或者在句子中间使用句末标点符号(可能是引号的一部分(，那么通过组合脚注来收紧项目结尾的条件——句末标点和，后跟数字+句号

my @s = $steps =~ /[0-9]+. .*? (?: ."|!"|[.!]) (?=s+[0-9]+. | z)/xg;

如果这还不够好，那么我们真的需要对该文本进行更精确的描述。

^†使用"；数字周期"；定义项目描述的模式，如

/ [0-9]+. .*? (?=s+[0-9]+. | z) /xg;

(或在split中的前瞻中(失败，出现类似的文本

1. Only $2.50或1. Version 2.4.1。。。

^†要包含1. Do "this."和2. Or "that!"等文本，我们需要

/ [0-9]+. .*? (?: ." | !" | [.!?]) /xg;

下面的示例代码演示了regex在一行代码中填充%steps哈希的功能。

一旦获得数据，你就可以随心所欲地对其进行骰子和切片。

检查样品是否符合您的问题。

use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my($str,%steps,$re);
$str   = '1.Do this. 2.Then do that. 3.And then maybe that. 4.Complete!';
$re    = qr/(d+).(D+)./;
%steps = $str =~ /$re/g;
say Dumper(%steps);
say "$_. $steps{$_}" for sort keys %steps;

输出

$VAR1 = {
'1' => 'Do this',
'2' => 'Then do that',
'3' => 'And then maybe that'
};
1. Do this
2. Then do that
3. And then maybe that

相关内容

最新更新

热门标签：