将txt文件转换为csv文件，将特定行分隔为列

我目前正在尝试拥有这样的数据(...只是意味着有更多的行，不需要在这里发布整个文件)

376 932
noms sommets
0000 Abbesses
0001 Alexandre Dumas
0002 Alma Marceau
...
0375 Étienne Marcel
coord sommets
0000 308 536
0001 472 386
0002 193 404
...
0375 347 412
arcs values
0 238 41
0 159 46
1 12 36
1 235 44
...
367 366 120.0

转换为csv时，数据应该是这样的，数据应该有三列

tbody> <<tr>

nom	somemets	coord somemets
0000	女修道院院长	308 536

from pathlib import Path
import pandas as pd
f = Path("metro")
lines = [[], [], []]
file_num = -1
for line in f.read_text().split("n"):
if not line:
continue
cells = line.split(maxsplit=1)
if cells[0] in ["noms", "coord", "arcs"]:
file_num += 1
if file_num >= 0:
lines[file_num].append(cells)

def get_df(data):
df1 = pd.DataFrame(data)
df1.columns = df1.iloc[0]
df1 = df1.drop(index=0)
df1.columns.name = None
return df1

df1 = get_df(lines[0])
df2 = get_df(lines[1])
df3 = get_df(lines[2])
df2.columns = [df1.columns[0], " ".join(df2.columns)]
res = pd.merge(df1, df2, how="outer", on="noms")
#    noms          sommets coord sommets
# 0  0000         Abbesses       308 536
# 1  0001  Alexandre Dumas       472 386
# 2  0002         NaN       193 404
res.to_csv("metro.csv")

编辑:要解决编码问题，将您想要的编码传递给read_text()。

for line in f.read_text(encoding="latin-1").split("n"):
...

编辑:你没有说你想如何处理"弧值"下的列，所以我留下了df3。

如果没有导入，也可以这样做。

由于数据中的噪音，有一些安全检查。

另外，我正在使用字典，因为它们在试图查找键/值对时非常快。

with open("metro", encoding="latin-1") as infile:
data = infile.read().splitlines()
nom_start = "noms sommets"
coord_start = "coord sommets"
end = "arcs values"
mode = None
# use a dict as lookups on dicts are stupidly fast.
result = {}
for line in data:
# this one is needed due to the first line
if mode == None:
if line == nom_start:
mode = nom_start
continue
line = line.strip()
# safety check
if line != "":
if line == end:
# skip the end data
break
key, value = line.split(maxsplit=1)
if mode == nom_start:
if line != coord_start:
result[key] = {"sommets": value}
else:
mode = coord_start
else:
result[key]["coord sommets"] = value

# CSV separator
SEP = ";"
with open("output.csv", "w", encoding="latin-1") as outfile:
# CSV header
outfile.write(f"noms{SEP}sommets{SEP}coord sommetsn")
for key, val in result.items():
outfile.write(f'{key}{SEP}{val["sommets"]}{SEP}{val["coord sommets"]}n')

相当有趣的问题。我假设文件包含比示例更多的列或键/变量集。因此，您不希望硬编码列名。

我将创建一个新的空数据框，然后逐行读取输入文件，检查它是否是下一个新列名(不是以数字开头)，用这些新值构建一个字典，然后继续将该字典作为新列合并到新数据框中。

所以我会这样做:

import pandas as pd
# create an Empty DataFrame object
df_new = pd.DataFrame({"recordkey": []})
# read all input lines
inputfilename = "inputfile.txt"
file1 = open(inputfilename, 'r')
Lines = file1.readlines()
tmpdict = {}
colname = ""
# iterate through all lines
for idx in range(len(Lines)):
line = Lines[idx]
# this is assuming all keys are exactly 4 digits
iscolname = not (line[:4].isdigit())

if not iscolname:
# split on the first space for key and value
tmp = line.split(" ", 1)
getkey = tmp[0].strip()
getvalue = tmp[1].strip()
# add to dictionary
tmpdict[getkey] = getvalue
# new column or last line
if iscolname or idx == len(Lines)-1:
# new column (except skip for first line of file)
if colname != "":
# create new column from dictionary
df_tmp = pd.DataFrame(tmpdict.items(), columns=["recordkey", colname])
df_new = df_new.merge(df_tmp, how='outer', on='recordkey')
# keep new column name
colname = line.strip()
tmpdict = {}
# display dataframe
print(df_new)
# write dataframe to csv
fileoutput = "outputfile.csv"
df_new.to_csv(fileoutput, sep=",", index=False)

相关内容

最新更新

热门标签：