Perl在regex之前拆分函数,而不是在regex之后


#!/usr/bin/perl
use warnings ;
use strict ;
use Data::Dumper qw(Dumper);
my $str = " 30th Mar 2020 5:53:18 pm Elvis Presly: BJ: Bloomberg Runs30th Mar 2020 5:53:27 pm Elvis Presly: DS: ICE DATA = INC101848366130th Mar 2020 6:42:43 pm Boris Putin: Cortese's ICE logs is for the Bloomberg Runs issue30th Mar 2020 6:43:28 pm Elvis Presly: yeap31st Mar 2020 4:11:22 am Indie Rock: VK : RE: XS2018777099 & XS2018777172 - INC1018491954 31st Mar 2020 6:31:17 am Dash Riprock: NW: RE: SABSM 6.125 YTW - INC101849584331st Mar 2020 6:52:06 am Dash Riprock: KB: RE: Cpty issue for Trader on CDS STATS bookings - SDS 42625375 - PENDING ROKOS CAPITAL MANAGEMENT (JERSEY) LP - INC101849631331st Mar 2020 7:26:40 am Dash Riprock: AP: RE: Rolling 7yrs - INC101849710231st Mar 2020 7:45:36 am Dash Riprock: JK: RE: Chris White books - INC101849738031st Mar 2020 8:11:10 am Charlie Brown: KB: RE: BOOKBUILDER Allocs Delays - urgent - INC101849791631st Mar 2020 8:21:15 am Charlie Brown: VK: RE: Can you get me set up to view TRAX History?  - INC101849813331st Mar 2020 8:30:36 am Charlie Brown: WJ: RE: Bulking Booking P&L - INC101849829231st Mar 2020
";

#my @words = split / /, $str ;
my @words = split /(d+th|st|rd)/, $str ;

print Dumper @words;

拆分应该做什么

$VAR1 = [
' ',
'30th',
' Mar 2020 5:53:18 pm Elvis Presly: BJ: Bloomberg Runs',
'30th',
' Mar 2020 5:53:27 pm Elvis Presly: DS: ICE DATA = INC',
'101848366130th', (this did not split  - it happens) 
' Mar 2020 6:42:43 pm Boris Putin: Cortese's ICE logs is for the Bloomberg Runs 
issue',
'30th',
' Mar 2020 6:43:

然而,我真正需要的是在日期之前结束的行,所以数据被列为

$VAR1 = [
' ',
'30th Mar 2020 5:53:18 pm Elvis Presly: BJ: Bloomberg Runs',
'30th Mar 2020 5:53:27 pm Elvis Presly: DS: ICE DATA = INC',
'101848366130th',
' Mar 2020 6:42:43 pm Boris Putin: Cortese's ICE logs is for the Bloomberg Runs 
issue',
'30th Mar 2020 6:43:

使用split时,需要指定要保留的位之间的。在这种情况下,分隔符是一个0长度的字符串,后面跟一个日期。为此,您可以使用以下方法:

split /(?=d+(?:th|st|nd|rd))/, $str 

您还需要nd作为"秒"。

split /(?=d{2}(?:st|nd|rd|th) w{3} d{4})/a, $str;

/a强制对d进行ASCII解释,因此类似"६"不匹配。

零先行断言(?=)用于在匹配之前,即在日期开始时拆分字符串(感谢ikegami的想法(。

你也可以使用

split /(d{2}(?:st|nd|rd|th) w{3} d{4}(?: d{1,2}:d{2}:d{2} [ap]m)?) ?/a

分隔时间戳:

$VAR1 = [
' ',
'30th Mar 2020 5:53:18 pm',
'Basant Jain: BJ: Bloomberg Runs',
'30th Mar 2020 5:53:27 pm',
'Basant Jain: DS: ICE DATA = INC1018483661',
# ...
'31st Mar 2020 8:30:36 am',
'Kishan Bholah: WJ: RE: Bulking Booking P&L - INC1018498292',
'31st Mar 2020'
];

请注意,d+st|nd并不能满足您的需要:d+只是第一个备选方案的一部分,您需要使用括号将内容分组在一起。我使用了不匹配的变体(?:...)来防止split在返回值中包含匹配项。

不清楚OP.使用split的原因

作为替代,可以使用替代在日期之前注入新行

添加split以将线放置到线的阵列中以获得期望的结果。

use strict;
use warnings;
use feature 'say';
my $str = " 30th Mar 2020 5:53:18 pm Elvis Presly: BJ: Bloomberg Runs30th Mar 2020 5:53:27 pm Elvis Presly: DS: ICE DATA = INC101848366130th Mar 2020 6:42:43 pm Boris Putin: Cortese's ICE logs is for the Bloomberg Runs issue30th Mar 2020 6:43:28 pm Elvis Presly: yeap31st Mar 2020 4:11:22 am Indie Rock: VK : RE: XS2018777099 & XS2018777172 - INC1018491954 31st Mar 2020 6:31:17 am Dash Riprock: NW: RE: SABSM 6.125 YTW - INC101849584331st Mar 2020 6:52:06 am Dash Riprock: KB: RE: Cpty issue for Trader on CDS STATS bookings - SDS 42625375 - PENDING ROKOS CAPITAL MANAGEMENT (JERSEY) LP - INC101849631331st Mar 2020 7:26:40 am Dash Riprock: AP: RE: Rolling 7yrs - INC101849710231st Mar 2020 7:45:36 am Dash Riprock: JK: RE: Chris White books - INC101849738031st Mar 2020 8:11:10 am Charlie Brown: KB: RE: BOOKBUILDER Allocs Delays - urgent - INC101849791631st Mar 2020 8:21:15 am Charlie Brown: VK: RE: Can you get me set up to view TRAX History?  - INC101849813331st Mar 2020 8:30:36 am Charlie Brown: WJ: RE: Bulking Booking P&L - INC101849829231st Mar 2020
";
$str =~ s/(d{1,2}(th|st|nd|rd) w{3} d{4})/n$1/g;
say $str;
my @lines = split "n", $str;
say Dumper(@lines);

最新更新