我有一个有683500行的.txt文件,每7行有一个不同的人,其中包含:
- ID
- 名称
- 工作岗位
- 日期1(年-月(
- 日期2(年-月(
- 付款总额
- 服务时间
我想阅读.txt并输出(可以是json、csv、txt,甚至在数据库中(7列中的每个人,例如:
ID Name Work position Date 1 Date 2 Gross payment Service time
ID Name Work position Date 1 Date 2 Gross payment Service time
ID Name Work position Date 1 Date 2 Gross payment Service time
ID Name Work position Date 1 Date 2 Gross payment Service time
文本中的示例:
00000000886
MANUEL DE JESUS SUBERVI PEñA
MAESTRO MEDIA GENERAL
2006-08
2021-09
30556.04
15.7
00000000086
MANUEL DE JESUS SUERVI PEèA
MAESTRO MEDIA GENICAL
2006-01
2022-09
30556.04
15.7
000100000086
RO媒体概述
2006-01
2021-09
30556.04
15.7
import csv
#opening file
file = open (r"C:UsersRedfordDocumentsProyecto automatizaciondata1.txt") #open file
counter = 0
total_lines = len(file.readlines()) #count lines
#print('Total lines:', x)
#reading from file
content = file.read()
colist = content.split ()
print(colist)
#read data from data1.txt and write in data2.txt
lines = open (r"C:UsersRedfordDocumentsProyecto automatizaciondata1.txt")
arr = []
with open('data2.txt', 'w') as f:
for line in lines:
#arr.append(line)
f.write (line)
我是编程新手,不知道如何将逻辑转换为代码。
您的代码不会收集多行代码来将它们写入一行。
使用这种方法:
- 逐行读取文件
- 将不带的每一行收集到列表中
- 如果列表长度达到7,则写入csv并清除列表
- 重复直到完成
创建数据文件:
with open ("t.txt","w") as f:
f.write("""00000000886nMANUEL DE JESUS SUBERVI PEÑAnMAESTRO MEDIA GENERALn2006-08n2021-09n30,556.04n15.7
00000000086nMANUEL DE JESUS SUBERVI PEÑAnMAESTRO MEDIA GENERALn2006-01n2021-09n30,556.04n15.7
00100000086nMANUEL DE JESUS SUBERVI PEÑAnMAESTRO MEDIA GENERALn2006-01n2021-09n30,556.04n15.7""")
程序:
import csv
with open("t.csv","w",newline="") as wr, open("t.txt") as r:
# create a csv writer
writer = csv.writer(wr)
# uncomment if you want a header over your data
# h = ["ID","Name","Work position","Date 1","Date 2",
# "Gross payment","Service time"]
# writer.writerow(h)
person = []
for line in r: # could use enumerate as well, this works ok
# collect line data minus the n into list
person.append(line.strip())
# this person is finished, write, clear list
if len(person) == 7:
# leveraged the csv module writer, look it up if you need
# to customize it further regarding quoting etc
writer.writerow(person)
person = [] # reset list for next person
# something went wrong, your file is inconsistent, write remainder
if person:
writer.writerow(person)
print(open("t.csv").read())
输出:
00000000886,MANUEL DE JESUS SUBERVI PEÑA,MAESTRO MEDIA GENERAL,2006-08,2021-09,"30,556.04",15.7
00000000086,MANUEL DE JESUS SUBERVI PEÑA,MAESTRO MEDIA GENERAL,2006-01,2021-09,"30,556.04",15.7
00100000086,MANUEL DE JESUS SUBERVI PEÑA,MAESTRO MEDIA GENERAL,2006-01,2021-09,"30,556.04",15.7
读取:csv模块-写入
";总付款";需要引用,因为它包含一个','
,它是csv的分隔符-模块会自动执行此操作。
在@PatrickArtner的精彩回答之上,我想提出一个基于itertools
的解决方案:
import csv
import itertools
def file_grouper_itertools(
in_filepath="t.txt",
out_filepath="t.csv",
size=7):
with open(in_filepath, 'r') as in_file,
open(out_filepath, 'w') as out_file:
writer = csv.writer(out_file)
args = [iter(in_file)] * size
for block in itertools.zip_longest(*args, fillvalue=' '):
# equivalent, for the given input, to:
# block = [x.rstrip('n') for x in block]
block = ''.join(block).rstrip('n').split('n')
writer.writerow(block)
这里的想法是在所需大小的块中循环。对于较大的组大小,这会变得更快,因为执行主循环的周期较少。
运行一些微观基准测试表明,与手动循环(调整为函数(相比,您的用例应该从这种方法中受益:
import csv
def file_grouper_manual(
in_filepath="t.txt",
out_filepath="t.csv",
size=7):
with open(in_filepath, 'r') as in_file,
open(out_filepath, 'w') as out_file:
writer = csv.writer(out_file)
block = []
for line in in_file:
block.append(line.rstrip('n'))
if len(block) == size:
writer.writerow(block)
block = []
if block:
writer.writerow(block)
基准:
n = 100_000
k = 7
with open ("t.txt", "w") as f:
for i in range(n):
f.write("n".join(["0123456"] * k))
%timeit file_grouper_manual()
# 1 loop, best of 5: 325 ms per loop
%timeit file_grouper_itertools()
# 1 loop, best of 5: 230 ms per loop
或者,您可以使用Pandas,这非常方便,但要求所有输入都能放入可用内存(在您的情况下这应该不是问题,但可以用于更大的输入(:
import numpy as np
import pandas as pd
def file_grouper_pandas(in_filepath="t.txt", out_filepath="t.csv", size=7):
with open(in_filepath) as in_filepath:
data = [x.rstrip('n') for x in in_filepath.readlines()]
df = pd.DataFrame(np.array(data).reshape((-1, size)), columns=list(range(size)))
# consistent with the other solutions
df.to_csv(out_filepath, header=False, index=False)
%timeit file_grouper_pandas()
# 1 loop, best of 5: 666 ms per loop
如果您对表和数据做了大量的工作,NumPy和Pandas是非常有用的库。
import numpy as np
import pandas as pd
columns = ['ID', 'Name' , 'Work position', 'Date 1 (year - month)', 'Date 2 (year - month)',
'Gross payment', 'Service time']
with open('oldfile.txt', 'r') as stream:
# read file into a list of lines
lines = stream.readlines()
# remove newline character from each element of the list.
lines = [line.strip('n') for line in lines]
# Figure out how many rows there will be in the table
number_of_people = len(lines)/7
# Split data into rows
data = np.array_split(lines, number_of_people)
# Convert data to pandas dataframe
df = pd.DataFrame(data, columns = columns)
一旦将数据转换为Pandas数据帧,就可以轻松地将其输出为列出的任何格式。例如,要输出到csv,可以执行以下操作:
df.to_csv('newfile.csv')
或者对于json,它应该是:
df.to_json('newfile.csv')