我有一个解析良好的多段文档列表(所有段落用\n\n分隔,句子用"."分隔),我想将这些文档拆分成句子,并附上一个表示文档中段落编号的数字。例如,(两段)输入为:
First sentence of the 1st paragraph. Second sentence of the 1st paragraph. nn
First sentence of the 2nd paragraph. Second sentence of the 2nd paragraph. nn
理想情况下,输出应该是:
1 First sentence of the 1st paragraph.
1 Second sentence of the 1st paragraph.
2 First sentence of the 2nd paragraph.
2 Second sentence of the 2nd paragraph.
我熟悉Perl中的Lingua::Pensiones包,它可以将文档拆分为句子。但是,它与段落编号不兼容。因此,我想知道是否有其他方法可以实现上述目标(文档中没有缩写)。非常感谢您的帮助。谢谢
如果可以将句点.
作为分隔符,则可以执行以下操作:
perl -00 -nlwe 'print qq($. $_) for split /(?<=.)/' yourfile.txt
说明:
-00
将输入记录分隔符设置为空字符串,这是段落模式-l
将输出记录分隔符设置为输入记录分隔符,在本例中转换为两个换行符
然后,我们简单地用一个lookbacking断言分割句号,并打印出句子,在句子前面加上行号。
正如您提到的Lingua::Sentences
,我认为可以对该模块的原始输出进行一点操作,以获得所需的
use Lingua::Sentence;
my @paragraphs = split /n{2,}/, $splitter->split($text);
foreach my $index (0..$#paragraphs) {
my $paragraph = join "nn", map { $index+1 . " $_" }
split /n/, $paragraphs[$index];
print "$paragraphnn";
}