我需要从具有以下格式的SQL输出文件读取到python或Pandas数据框架,什么可能是最好的方法?
-[ RECORD 1 ]--------------------------------
a | test
b | test
c | test
-[ RECORD 2 ]--------------------------------
a | test
b | test
c | test
这段代码将把输入文件转换为"normal"它不是通用的,所以因为你的例子可能是人为的(你可能没有真正的列称为a
,b
,c
和值都是test
),可能需要调整-但这是一个开始。我想它是受到sed的启发,所以必须持保留态度!
1)将文件转换为常规CSV文件
def transform_to_csv(in_file_path, out_file_path):
line = None
column_names = []
values = []
first_record = True
with open(in_file_path) as infile:
with open (out_file_path, "w") as outfile:
infile.readline() #skip first line
while True:
line = infile.readline().rstrip("n")
if not line:
# write the last record
outfile.write(",".join(values) + "n")
break
elif line.startswith("-"):
# finished with a record
if(first_record):
outfile.write(",".join(column_names) + "n")
first_record = False
outfile.write(",".join(values) + "n")
values = []
else:
# accumulating fields for the next record
name, value = tuple(line.split("|"))
values.append(value.strip())
if(first_record):
column_names.append(name.strip())
我们得到一个csv格式的新文件:
a,b,c
test,test,test
test,test,test
2)现在做正常的熊猫的东西
import pandas as pd
infile = "in.txt"
outfile = "out.csv"
transform_to_csv(infile, outfile)
df = pd.read_csv("out.csv")
print(df.head())