Python用多行文本将文本文件转换为pandas dataframe



我有一个纯文本文件中的协议转储,格式如下:

Frame 380: 19 bytes on wire (152 bits), 19 bytes captured (152 bits)
Bluetooth HCI H4
[Direction: Sent (0x00)]
HCI Packet Type: ACL Data (0x02)
0000  02 0b 00 0e 00 0a 00 01 00 05 0e 06 00 07 07 00   ................
0010  00 00 00                                          ...
Frame 381: 8 bytes on wire (64 bits), 8 bytes captured (64 bits)
Bluetooth HCI H4
[Direction: Rcvd (0x01)]
HCI Packet Type: HCI Event (0x04)
0000  04 13 05 01 0b 00 01 00                           ........
Frame 382: 23 bytes on wire (184 bits), 23 bytes captured (184 bits)
Bluetooth HCI H4
[Direction: Rcvd (0x01)]
HCI Packet Type: ACL Data (0x02)
0000  02 0b 20 12 00 0e 00 01 00 05 12 0a 00 47 00 00   .. ..........G..
0010  00 00 00 01 02 00 04                              .......

在这个简化的示例中,帧号380、381等是文本格式中每个帧的第一行的一部分。我想将其转换为pandas数据框架,格式如下:

FrameNumber                                   Details                                  
|---------------------------------------------------------------------------------------|
|            | Frame 380: 19 bytes on wire (152 bits), 19 bytes captured (152 bits)     |
|            | Bluetooth HCI H4                                                         |
|   380      |     [Direction: Sent (0x00)]                                             |
|            |     HCI Packet Type: ACL Data (0x02)                                     |
|            | 0000  02 0b 00 0e 00 0a 00 01 00 05 0e 06 00 07 07 00   ................ |
|            | 0010  00 00 00                                                           |
|---------------------------------------------------------------------------------------|
|            | Frame 381: 8 bytes on wire (64 bits), 8 bytes captured (64 bits)         |
|            | Bluetooth HCI H4                                                         |
|   381      |     [Direction: Rcvd (0x01)]                                             |
|            |     HCI Packet Type: HCI Event (0x04)                                    |
|            | 0000  04 13 05 01 0b 00 01 00                           ........         |
|---------------------------------------------------------------------------------------|
|            | Frame 382: 23 bytes on wire (184 bits), 23 bytes captured (184 bits)     |
|            | Bluetooth HCI H4                                                         |
|   382      |     [Direction: Rcvd (0x01)]                                             |
|            |     HCI Packet Type: ACL Data (0x02)                                     |
|            | 0000  02 0b 20 12 00 0e 00 01 00 05 12 0a 00 47 00 00   .. ..........G.. |
|            | 0010  00 00 00 01 02 00 04                              .......          |
+---------------------------------------------------------------------------------------+

我尝试使用pandasread_csv(),但由于我对多行正则表达式选择的知识有限,我无法解决这个问题。谁能帮我想出一个简单的方法来解决这个问题?

另一种解决方案,使用re模块:

import re
import pandas as pd

all_data = []
with open("data.txt", "r") as f_in:
for (g, n) in re.findall(
r"^(Frame (d+).*?)s*(?=^Frame d+|Z)", f_in.read(), flags=re.M | re.S
):
all_data.append({"FrameNumber": int(n), "Details": g})
df = pd.DataFrame(all_data)
print(df)

打印:

|    |   FrameNumber | Details                                                                  |
|---:|--------------:|:-------------------------------------------------------------------------|
|  0 |           380 | Frame 380: 19 bytes on wire (152 bits), 19 bytes captured (152 bits)     |
|    |               | Bluetooth HCI H4                                                         |
|    |               |     [Direction: Sent (0x00)]                                             |
|    |               |     HCI Packet Type: ACL Data (0x02)                                     |
|    |               | 0000  02 0b 00 0e 00 0a 00 01 00 05 0e 06 00 07 07 00   ................ |
|    |               | 0010  00 00 00                                          ...              |
|  1 |           381 | Frame 381: 8 bytes on wire (64 bits), 8 bytes captured (64 bits)         |
|    |               | Bluetooth HCI H4                                                         |
|    |               |     [Direction: Rcvd (0x01)]                                             |
|    |               |     HCI Packet Type: HCI Event (0x04)                                    |
|    |               | 0000  04 13 05 01 0b 00 01 00                           ........         |
|  2 |           382 | Frame 382: 23 bytes on wire (184 bits), 23 bytes captured (184 bits)     |
|    |               | Bluetooth HCI H4                                                         |
|    |               |     [Direction: Rcvd (0x01)]                                             |
|    |               |     HCI Packet Type: ACL Data (0x02)                                     |
|    |               | 0000  02 0b 20 12 00 0e 00 01 00 05 12 0a 00 47 00 00   .. ..........G.. |
|    |               | 0010  00 00 00 01 02 00 04                              .......          |

extractgroupby:

df = pd.read_fwf("input2.txt", header=None, names=["Details"])
df["FrameNumber"] = (df["Details"].str.extract(r"(Frame d+)", expand=False)
.where(df["Details"].str.startswith(r"Frame")).ffill())
out = df.groupby("FrameNumber", as_index=False).agg("n".join)

输出:

+---------------+--------------------------------------------------------------------------+
| FrameNumber   | Details                                                                  |
|---------------+--------------------------------------------------------------------------|
| Frame 380     | Frame 380: 19 bytes on wire (152 bits), 19 bytes captured (152 bits)     |
|               | Bluetooth HCI H4                                                         |
|               | [Direction: Sent (0x00)]                                                 |
|               | HCI Packet Type: ACL Data (0x02)                                         |
|               | 0000  02 0b 00 0e 00 0a 00 01 00 05 0e 06 00 07 07 00   ................ |
|               | 0010  00 00 00                                          ...              |
| Frame 381     | Frame 381: 8 bytes on wire (64 bits), 8 bytes captured (64 bits)         |
|               | Bluetooth HCI H4                                                         |
|               | [Direction: Rcvd (0x01)]                                                 |
|               | HCI Packet Type: HCI Event (0x04)                                        |
|               | 0000  04 13 05 01 0b 00 01 00                           ........         |
| Frame 382     | Frame 382: 23 bytes on wire (184 bits), 23 bytes captured (184 bits)     |
|               | Bluetooth HCI H4                                                         |
|               | [Direction: Rcvd (0x01)]                                                 |
|               | HCI Packet Type: ACL Data (0x02)                                         |
|               | 0000  02 0b 20 12 00 0e 00 01 00 05 12 0a 00 47 00 00   .. ..........G.. |
|               | 0010  00 00 00 01 02 00 04                              .......          |

相关内容

  • 没有找到相关文章

最新更新