我有一个txt文件,看起来像这样:
Quod equidem non reprehendo;
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quibus natura iure responderit non esse verum aliunde finem beate vivendi, a se principia rei gerendae peti; Quae enim adhuc protulisti, popularia sunt, ego autem a te elegantiora desidero. Duo Reges: constructio interrete. Tum Lucius: Mihi vero ista valde probata sunt, quod item fratri puto. Bestiarum vero nullum iudicium puto. Nihil enim iam habes, quod ad corpus referas; Deinde prima illa, quae in congressu solemus: Quid tu, inquit, huc? Et homini, qui ceteris animantibus plurimum praestat, praecipue a natura nihil datum esse dicemus?
=========================================================================
Planet Number festival animal
colour book
Mercury First firecrack phone
Venus Last kite computer
Earth Country rangoli tv
Jupiter C.COD bomb
---------------------------------------------------------------------
11 4526 diwali dog
holi bigb
12 Joe diwali 111
45 Doe sankaranti acer
65 UK diwali pan
67 22 diwali
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Planet Number festival animal
colour book
Mercury First firecrack phone
Venus Last kite computer
Earth Country rangoli tv
Jupiter C.COD bomb
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
45 5637 ganesh tiger
holi cinema
67 micael holi 222
78 john diwali xamoi
90 france diwali hp
34 34 diwali
我想把这个文本文件转换成csv格式。我想展示的输出:输出:输出
我的代码:
from itertools import groupby, chain
with open("file.txt", "r") as fin,
open("file.csv", "w") as fout:
for key, group in groupby(fin, key=lambda line: bool(line.strip())):
if key:
zipped = zip(*(line.rstrip().split() for line in group))
fout.write(",".join(chain(*zipped)) + "n")
这将满足您的要求。这只是一个收集字段的问题,直到我们得到写入它们的触发器,AND忽略开头的文本,AND忽略除第一个之外的所有标题。
fin = open('file.txt')
fout = open('file.csv','w')
gather = []
skipping = True
first = True
for line in fin:
if skipping:
skipping = line.find('====') < 0
elif line.find('----') >= 0:
if gather and (first or gather[0] != 'Planet'):
print( ','.join(gather), file=fout )
gather = []
first = False
else:
gather.extend( line.strip().split() )
if gather:
print( ','.join(gather), file=fout )
文件的相关块似乎有一个大致固定宽度的列结构,因此您可以尝试在它们上使用pandas.read_fwf
:
from io import StringIO
from itertools import groupby
import pandas as pd
def keep(line): return bool(line.strip()) and not line.startswith("---")
with open('file.txt', 'r') as fin,
open('file.csv','w') as fout:
while True:
if next(fin).startswith("==="): break
first = True
for key, group in groupby(fin, key=keep):
if key:
line = ",".join(
pd.read_fwf(StringIO("".join(group)), header=None)
.stack().sort_index(level=1).dropna().astype(str)
.str.replace(r"^(-?d+).0+$", r"1", regex=True)
) + "n"
if first:
header, first = line, False
fout.write(line)
elif line != header:
fout.write(line)
file.csv
:中的结果
Planet,Mercury,Venus,Earth,Jupiter,Number,First,Last,Country,C.COD,festival,colour,firecrack,kite,rangoli,bomb,animal,book,phone,computer,tv
11,12,45,65,67,4526,Joe,Doe,UK,22,diwali,holi,diwali,sankaranti,diwali,diwali,dog,bigb,111,acer,pan
45,67,78,90,34,5637,micael,john,france,34,ganesh,holi,holi,diwali,diwali,diwali,tiger,cinema,222,xamoi,hp
如果您不关心数字格式,则可以删除.str.replace(r"^(-?d+).0+$", r"1", regex=True)
。
但是:这真的是你文件的真实格式吗?
我相信您可以使用Pandas-lib将txt文件转换为csv
# importing panda library
import pandas as pd
# readinag given csv file
# and creating dataframe
dataframe1 = pd.read_csv("input_file.txt")
# storing this dataframe in a csv file
dataframe1.to_csv('output_file.csv',
index = None)