读取带有异常分隔符的文本文件到熊猫数据帧



我有一个文本文件,如下所示:

Hypothesis:
drink
Reference:
Drake
WER:
100.0
Time:
2.416645050048828
"---------------------------"
Hypothesis:
Ed Sheeran
Reference:
Ed Sheeran
WER:
0.0
Time:
2.854194164276123

当我试图把它读成熊猫时。以["Hypothesis", "Reference","WER","Time"]作为列的数据帧,它将返回错误。

我试过:

txt= pd.read_csv("/home/kolagaza/Desktop/IAIS_en.txt", sep="---------------------------", header = None, engine='python')
data.columns = ["Hypothesis", "Reference","WER","Time"]

我认为如果不先进行一些预处理,您将无法将该文本文件直接读取到熊猫DataFrame中。一种方法是将您的输入转换为熊猫records格式,即字典列表,如下所示:

[{'Hypothesis': 'drink', 'Reference': 'Drake', 'WER': '100.0', 'Time': '2.416645050048828'},
{'Hypothesis': 'Ed Sheeran','Reference': 'Ed Sheeran', 'WER': '0.0', 'Time': '2.854194164276123'}]

我尝试了以下代码,它对我有用(我复制了您的示例文本文件(:

import pandas as pd
records = []
with open ("/home/kolagaza/Desktop/IAIS_en.txt", "r") as fh:
# remove blank lines and whitespaces
lines = [line.strip() for line in fh.readlines() if line != "n"]
# this next line creates a list where each element will represent one line in the final dataframe
lines = ",".join(lines).replace(':,', ':').split('"---------------------------"')
# now convert each line into a record
for line in lines:
record = {}
for keyval in line.split(','):
if len(keyval) > 0:
key, val = keyval.split(':')
record[key] = val
records.append(record)
df = pd.DataFrame(records)

最新更新