使用正则表达式解析文件



我有一个大的文本文件(基本上是一个csv文件,但它有很多不同的部分,文件对我来说不像一个适当的csv),文件的一部分如下:

7.27.27.2. Frame Counts: 2
Timestamp,Transmitted,Received Seconds,Frames,
1.818,"47,702","24,026"
2.847,"121,038","66,424"
3.818,"192,749","105,993"
4.851,"270,454","147,068"
5.817,"343,582","184,994"
6.818,"422,937","227,679"
7.847,"494,787","268,220"
8.847,"568,388","307,350"
9.818,"636,640","344,092"
10.824,"712,211","383,849"
11.846,"786,823","423,941"
12.818,"863,526","465,542"
13.847,"936,019","504,298"
14.847,"1,007,358","543,600"
15.847,"1,072,079","578,770"
16.847,"1,135,907","613,742"
17.847,"1,204,749","649,329"
18.817,"1,269,150","684,052"
19.817,"1,340,923","720,234"
20.860,"1,409,920","758,060"
21.847,"1,480,912","798,166"
22.101,"1,491,235","803,900"
23.108,"1,491,235","803,900"
7.27.28. Frame Rate
Rates can vary due to round-off errors in calculations. Timestamp,Transmit rate,Receive rate Seconds,Frames/s,
1.818,"39,450","39,390"
2.847,"112,400","112,500"
3.818,"114,600","114,600"
4.851,"115,000","115,000"
5.817,"115,000","114,900"
6.818,"121,900","121,600"
7.847,"109,200","109,500"
8.847,"112,700","112,600"
9.818,"108,100","108,200"
10.824,"114,700","114,600"
11.846,"112,200","112,200"
12.818,"121,700","121,700"
13.847,"108,100","108,100"
14.847,"110,600","110,600"
15.847,"99,900","99,770"
16.847,"98,790","98,910"
17.847,"104,400","104,400"
18.817,"102,200","102,300"
19.817,"108,000","108,000"
20.860,"102,400","102,400"
21.847,"112,500","112,600"
22.101,"63,410","63,470"
23.108,0.00,0.00
7.27.28.1. Frame Rate: 1




Test Model: IPSEC-JENKINS Version: 53 Result: canceled Date: June 10, 2022 5:10:46 AM PDT Test Duration: 00:00:25.436
7. Test Results for IPSEC
7.1. Component Description Component: Application Simulator

Component,Resource Used IPSEC,np3-0
7.2. Test Component Criteria Number,Description 1,The total number of sessions opened must reach the specified target within the allotted time.: (maxConcurrentAppFlows>=sessions.target) 2,The total number of failed application transactions must be no more than 5 percent of the attempted application transactions.: ((appUnsuccessful*100)<=(appAttempted*5)) 3,The session rate must reach the specified target within the allotted time.: (maxAppFlowRate>=sessions.targetPerSecond)
7.3. Settings Parameter,Value Resource Percentage,50 Application Profile,MixCISCO MIX 4451 Delay Start,00:00:00 Data Rate/Data Rate Unlimited,false Data Rate/Data Rate Scope,Limit Aggregate Throughput Data Rate/Data Rate Unit,Megabits / Second Data Rate/Data Rate Type,Constant Data Rate/Minimum Data Rate,10000 Data Rate/Maximum Data Rate,10000 Session/Super Flow Configuration/Maximum Simultaneous Super Flows,1030 Session/Super Flow Configuration/Maximum Simultaneous Active Flows,0 Session/Super Flow Configuration/Maximum Super Flows Per Second,1030 Session/Super Flow Configuration/Unlimited Super Flow Open Rate,false Session/Super Flow Configuration/Unlimited Super Flow Close Rate,false Session/Super Flow Configuration/Target Minimum Simultaneous Flows,1 Session/Super Flow Configuration/Target Minimum Super Flows Per Second,1 Session/Super Flow Configuration/Target Number of Successful Matches,0 Session/Super Flow Configuration/Engine Selection,Advanced (Max Features) Session/Super Flow Configuration/Performance Emphasis,Balanced Session/Super Flow Configuration/Resource Allocation Override,Automatic Session/Super Flow Configuration/Statistic Detail,Maximum App Configuration/Remove all DNS actions,false App Configuration/Streams Per Super Flow,1 App Configuration/Content Fidelity,Normal App Configuration/Replace Streams at Runtime,true Source Port/Port Distribution Type,Random Source Port/Minimum Port Number,1024 Source Port/Maximum Port Number,65535 TCP Configuration/Maximum Segment Size (MSS),1260 TCP Configuration/Aging Time Data Type,Seconds TCP Configuration/Aging Time,0 TCP Configuration/Reset at End,false TCP Configuration/Retry Quantum,500 TCP Configuration/Retry Count,3 TCP Configuration/Delay ACKs,true TCP Configuration/Disable Piggy-back data on ACK (experimental),false TCP Configuration/Delayed ACKs ms,0 TCP Configuration/ACK every N (experimental),0 TCP Configuration/Initial Receive Window,5792 TCP Configuration/TCP Window Scale,0 TCP Configuration/Dynamic Receive Window Size,true TCP Configuration/Add Segment Timestamps,true TCP Configuration/Piggy-back Data on 3-way Handshake ACK,false TCP Configuration/Piggy-back Data on Shutdown FIN,false TCP Configuration/Initial Congestion Window,4 TCP Configuration/Explicit Congestion Notification,Support ECN TCP Configuration/Raw Flags,-1 TCP Configuration/Connect Delay,0 TCP Configuration/TCP Keepalive Timer,0 TCP Configuration/4-way Close,false TCP Configuration/Send PSH with all data segments,false IPv4 Configuration/TTL,32 IPv4 Configuration/TOS/DSCP,0x0 IPv6 Configuration/Hop Limit,64 IPv6 Configuration/Traffic Class,0x0 IPv6 Configuration/Flow Label,0x0 SSL Configuration/Session Reuse Capacity,Low SSL Configuration/Server Record Length,0 SSL Configuration/Client Record Length,0 Ramp Up Profile/Ramp Up Profile Type,Calculated Ramp Up Profile/Min Connection Rate,1 Ramp Up Profile/Max Connection Rate,1 Ramp Up Profile/Increment n Connections per Interval,1 Ramp Up Profile/Fixed Time Interval,00:00:01 Session Ramp Distribution/Ramp Up Behavior,Full Open Session Ramp Distribution/SYN Only Retry Mode,Obey Retry Count Session Ramp Distribution/Ramp Up Duration,00:00:00 Session Ramp Distribution/Steady-State Behavior,Open and Close Sessions Session Ramp Distribution/Steady-State Time Interval,00:02:15 Session Ramp Distribution/Ramp Down Behavior,Full Close Session Ramp Distribution/Ramp Down Time Interval,00:00:05 Experimental Advanced Settings/TCP Segments Credit,32 Experimental Advanced Settings/Send maximum size segments when possible,false Load Profile/,None Preset the component was created from,Appsim Default
7.4. App Profile Summary Weighted by flows Name,Weight,% Bandwidth,% Flows,Bytes,Flows,Seed CISCO MARCH G729 - DIA,"15,392",,,,,1 CISCO MARCH HTTP APPLICATION - DIA,"6,453",,,,,1 CISCO MARCH HTTP 32K GET - DIA,"14,969",,,,,1 CISCO MARCH HTTPS 16K - DIA,"31,729",,,,,1 CISCO MARCH CITRIX - DIA,282,,,,,1 CISCO MARCH HTTPS 64K - DIA,"9,130",,,,,1 CISCO MARCH MS-EXCHANGE - DIA,"13,212",,,,,1 CISCO MARCH HTTPS Live Streaming - DIA,584,,,,,1 CISCO MARCH HTTPS 1024K - DIA,617,,,,,1 CISCO MARCH H264 Video New - DIA,"6,576",,,,,1 CISCO MARCH POP3BANDWIDTH,95,,,,,1 CISCO MARCH SMTP,956,,,,,1
7.5. Traffic Appearance Traffic was addressed as defined in the "IPSEC-CURIE" network neighborhood. Interface,Traffic Direction,Network Domain,VLAN,Address Range 1,Client,CLIENT,,2.0.0.10
- 2.0.0.109 2,Server,SERVER,,5.0.0.10 - 5.0.0.109
7.6. Component Results Component,Result IPSEC,canceled
7.7. Application Aggregate Flows
There may be slices in this graph that are too small to be displayed. Protocol,Aggregate Flows (Flows),Aggregate Flows (%) SMTP,242,1.101% RTP,295,1.342% DNS,185,0.842% POP3-Advanced,25,0.114% HTTP,"17,440",79.345% Citrix,69,0.314% Microsoft Exchange,"3,724",16.943%

我想提取第7.27.28节的内容,它是这样的:

1.818,"39,450","39,390"
2.847,"112,400","112,500"
3.818,"114,600","114,600"
4.851,"115,000","115,000"
5.817,"115,000","114,900"
6.818,"121,900","121,600"
7.847,"109,200","109,500"
8.847,"112,700","112,600"
9.818,"108,100","108,200"
10.824,"114,700","114,600"
11.846,"112,200","112,200"
12.818,"121,700","121,700"
13.847,"108,100","108,100"
14.847,"110,600","110,600"
15.847,"99,900","99,770"
16.847,"98,790","98,910"
17.847,"104,400","104,400"
18.817,"102,200","102,300"
19.817,"108,000","108,000"
20.860,"102,400","102,400"
21.847,"112,500","112,600"
22.101,"63,410","63,470"
23.108,0.00,0.00

要阅读上述数据,我正在考虑使用正则表达式,然后使用csv解析该节,但下面的代码不起作用:

pattern = r"""7.27.28. Frame Rate
Rates can vary due to round-off errors in calculations.
Timestamp,Transmit rate,Receive rate
Seconds,Frames/s,
(.*)
7.27.28.1. Frame Rate: 1"""
match = re.search(pattern, all_of_it)
print(match.group(1))

请让我知道正确的模式或有其他方法提取数据?

这不是正则表达式的答案,但可能仍然有用。

这里的关键技巧是使用text.split("nn")在空白行上进行分区,然后使用startswith选择感兴趣的段。
text = """
7.27.27.2. Frame Counts: 2
Timestamp,Transmitted,Received Seconds,Frames,
1.818,"47,702","24,026"
2.847,"121,038","66,424"
3.818,"192,749","105,993"
4.851,"270,454","147,068"
5.817,"343,582","184,994"
6.818,"422,937","227,679"
7.847,"494,787","268,220"
8.847,"568,388","307,350"
9.818,"636,640","344,092"
10.824,"712,211","383,849"
11.846,"786,823","423,941"
12.818,"863,526","465,542"
13.847,"936,019","504,298"
14.847,"1,007,358","543,600"
15.847,"1,072,079","578,770"
16.847,"1,135,907","613,742"
17.847,"1,204,749","649,329"
18.817,"1,269,150","684,052"
19.817,"1,340,923","720,234"
20.860,"1,409,920","758,060"
21.847,"1,480,912","798,166"
22.101,"1,491,235","803,900"
23.108,"1,491,235","803,900"
7.27.28. Frame Rate
Rates can vary due to round-off errors in calculations. Timestamp,Transmit rate,Receive rate Seconds,Frames/s,
1.818,"39,450","39,390"
2.847,"112,400","112,500"
3.818,"114,600","114,600"
4.851,"115,000","115,000"
5.817,"115,000","114,900"
6.818,"121,900","121,600"
7.847,"109,200","109,500"
8.847,"112,700","112,600"
9.818,"108,100","108,200"
10.824,"114,700","114,600"
11.846,"112,200","112,200"
12.818,"121,700","121,700"
13.847,"108,100","108,100"
14.847,"110,600","110,600"
15.847,"99,900","99,770"
16.847,"98,790","98,910"
17.847,"104,400","104,400"
18.817,"102,200","102,300"
19.817,"108,000","108,000"
20.860,"102,400","102,400"
21.847,"112,500","112,600"
22.101,"63,410","63,470"
23.108,0.00,0.00
7.27.28.1. Frame Rate: 1




Test Model: IPSEC-JENKINS Version: 53 Result: canceled Date: June 10, 2022 5:10:46 AM PDT Test Duration: 00:00:25.436
7. Test Results for IPSEC
7.1. Component Description Component: Application Simulator

Component,Resource Used IPSEC,np3-0
7.2. Test Component Criteria Number,Description 1,The total number of sessions opened must reach the specified target within the allotted time.: (maxConcurrentAppFlows>=sessions.target) 2,The total number of failed application transactions must be no more than 5 percent of the attempted application transactions.: ((appUnsuccessful*100)<=(appAttempted*5)) 3,The session rate must reach the specified target within the allotted time.: (maxAppFlowRate>=sessions.targetPerSecond)
7.3. Settings Parameter,Value Resource Percentage,50 Application Profile,MixCISCO MIX 4451 Delay Start,00:00:00 Data Rate/Data Rate Unlimited,false Data Rate/Data Rate Scope,Limit Aggregate Throughput Data Rate/Data Rate Unit,Megabits / Second Data Rate/Data Rate Type,Constant Data Rate/Minimum Data Rate,10000 Data Rate/Maximum Data Rate,10000 Session/Super Flow Configuration/Maximum Simultaneous Super Flows,1030 Session/Super Flow Configuration/Maximum Simultaneous Active Flows,0 Session/Super Flow Configuration/Maximum Super Flows Per Second,1030 Session/Super Flow Configuration/Unlimited Super Flow Open Rate,false Session/Super Flow Configuration/Unlimited Super Flow Close Rate,false Session/Super Flow Configuration/Target Minimum Simultaneous Flows,1 Session/Super Flow Configuration/Target Minimum Super Flows Per Second,1 Session/Super Flow Configuration/Target Number of Successful Matches,0 Session/Super Flow Configuration/Engine Selection,Advanced (Max Features) Session/Super Flow Configuration/Performance Emphasis,Balanced Session/Super Flow Configuration/Resource Allocation Override,Automatic Session/Super Flow Configuration/Statistic Detail,Maximum App Configuration/Remove all DNS actions,false App Configuration/Streams Per Super Flow,1 App Configuration/Content Fidelity,Normal App Configuration/Replace Streams at Runtime,true Source Port/Port Distribution Type,Random Source Port/Minimum Port Number,1024 Source Port/Maximum Port Number,65535 TCP Configuration/Maximum Segment Size (MSS),1260 TCP Configuration/Aging Time Data Type,Seconds TCP Configuration/Aging Time,0 TCP Configuration/Reset at End,false TCP Configuration/Retry Quantum,500 TCP Configuration/Retry Count,3 TCP Configuration/Delay ACKs,true TCP Configuration/Disable Piggy-back data on ACK (experimental),false TCP Configuration/Delayed ACKs ms,0 TCP Configuration/ACK every N (experimental),0 TCP Configuration/Initial Receive Window,5792 TCP Configuration/TCP Window Scale,0 TCP Configuration/Dynamic Receive Window Size,true TCP Configuration/Add Segment Timestamps,true TCP Configuration/Piggy-back Data on 3-way Handshake ACK,false TCP Configuration/Piggy-back Data on Shutdown FIN,false TCP Configuration/Initial Congestion Window,4 TCP Configuration/Explicit Congestion Notification,Support ECN TCP Configuration/Raw Flags,-1 TCP Configuration/Connect Delay,0 TCP Configuration/TCP Keepalive Timer,0 TCP Configuration/4-way Close,false TCP Configuration/Send PSH with all data segments,false IPv4 Configuration/TTL,32 IPv4 Configuration/TOS/DSCP,0x0 IPv6 Configuration/Hop Limit,64 IPv6 Configuration/Traffic Class,0x0 IPv6 Configuration/Flow Label,0x0 SSL Configuration/Session Reuse Capacity,Low SSL Configuration/Server Record Length,0 SSL Configuration/Client Record Length,0 Ramp Up Profile/Ramp Up Profile Type,Calculated Ramp Up Profile/Min Connection Rate,1 Ramp Up Profile/Max Connection Rate,1 Ramp Up Profile/Increment n Connections per Interval,1 Ramp Up Profile/Fixed Time Interval,00:00:01 Session Ramp Distribution/Ramp Up Behavior,Full Open Session Ramp Distribution/SYN Only Retry Mode,Obey Retry Count Session Ramp Distribution/Ramp Up Duration,00:00:00 Session Ramp Distribution/Steady-State Behavior,Open and Close Sessions Session Ramp Distribution/Steady-State Time Interval,00:02:15 Session Ramp Distribution/Ramp Down Behavior,Full Close Session Ramp Distribution/Ramp Down Time Interval,00:00:05 Experimental Advanced Settings/TCP Segments Credit,32 Experimental Advanced Settings/Send maximum size segments when possible,false Load Profile/,None Preset the component was created from,Appsim Default
7.4. App Profile Summary Weighted by flows Name,Weight,% Bandwidth,% Flows,Bytes,Flows,Seed CISCO MARCH G729 - DIA,"15,392",,,,,1 CISCO MARCH HTTP APPLICATION - DIA,"6,453",,,,,1 CISCO MARCH HTTP 32K GET - DIA,"14,969",,,,,1 CISCO MARCH HTTPS 16K - DIA,"31,729",,,,,1 CISCO MARCH CITRIX - DIA,282,,,,,1 CISCO MARCH HTTPS 64K - DIA,"9,130",,,,,1 CISCO MARCH MS-EXCHANGE - DIA,"13,212",,,,,1 CISCO MARCH HTTPS Live Streaming - DIA,584,,,,,1 CISCO MARCH HTTPS 1024K - DIA,617,,,,,1 CISCO MARCH H264 Video New - DIA,"6,576",,,,,1 CISCO MARCH POP3BANDWIDTH,95,,,,,1 CISCO MARCH SMTP,956,,,,,1
7.5. Traffic Appearance Traffic was addressed as defined in the "IPSEC-CURIE" network neighborhood. Interface,Traffic Direction,Network Domain,VLAN,Address Range 1,Client,CLIENT,,2.0.0.10
- 2.0.0.109 2,Server,SERVER,,5.0.0.10 - 5.0.0.109
7.6. Component Results Component,Result IPSEC,canceled
7.7. Application Aggregate Flows
There may be slices in this graph that are too small to be displayed. Protocol,Aggregate Flows (Flows),Aggregate Flows (%) SMTP,242,1.101% RTP,295,1.342% DNS,185,0.842% POP3-Advanced,25,0.114% HTTP,"17,440",79.345% Citrix,69,0.314% Microsoft Exchange,"3,724",16.943%
"""
from io import StringIO
from pandas import read_csv
for line in text.split("nn"):
if line.startswith("Rates"):
break
line = line.replace("Rates can vary due to round-off errors in calculations. ", "")
df = read_csv(StringIO(line))

有更好的解决方案(基于正则表达式)。可能仍然有一种方法可以编写更少的正则表达式,但我不是专家!不好意思,变量命名不好!

import re 
text = "all your text"
LONG_LINE = "Rates can vary due to round-off errors in calculations. Timestamp,Transmit rate,Receive rate Seconds,Frames/s,"
LAST_ROW = "7.27.28.1. Frame Rate: 1"
regex = re.compile(f"({LONG_LINE})(.*)({LAST_ROW})", re.MULTILINE|re.DOTALL)
m = regex.search(text)
your_section = "".join(m.groups(2)[1])

regex2 = re.compile("(^d)(.*)", re.MULTILINE|re.DOTALL)
m2 = regex2.search(your_section)
print("".join(m2.groups()).strip())
1.818,"39,450","39,390"
2.847,"112,400","112,500"
3.818,"114,600","114,600"
4.851,"115,000","115,000"
5.817,"115,000","114,900"
6.818,"121,900","121,600"
7.847,"109,200","109,500"
8.847,"112,700","112,600"
9.818,"108,100","108,200"
10.824,"114,700","114,600"
11.846,"112,200","112,200"
12.818,"121,700","121,700"
13.847,"108,100","108,100"
14.847,"110,600","110,600"
15.847,"99,900","99,770"
16.847,"98,790","98,910"
17.847,"104,400","104,400"
18.817,"102,200","102,300"
19.817,"108,000","108,000"
20.860,"102,400","102,400"
21.847,"112,500","112,600"
22.101,"63,410","63,470"
23.108,0.00,0.00

最新更新