如何将每n行添加到n列中

我有一个有683500行的.txt文件，每7行有一个不同的人，其中包含：

ID
名称
工作岗位
日期1(年-月(
日期2(年-月(
付款总额
服务时间

我想阅读.txt并输出(可以是json、csv、txt，甚至在数据库中(7列中的每个人，例如：

ID    Name     Work position   Date 1   Date 2    Gross payment     Service time
ID    Name     Work position   Date 1   Date 2    Gross payment     Service time
ID    Name     Work position   Date 1   Date 2    Gross payment     Service time
ID    Name     Work position   Date 1   Date 2    Gross payment     Service time

文本中的示例：

00000000886
MANUEL DE JESUS SUBERVI PEñA
MAESTRO MEDIA GENERAL
2006-08
2021-09
30556.04
15.7
00000000086
MANUEL DE JESUS SUERVI PEèA
MAESTRO MEDIA GENICAL
2006-01
2022-09
30556.04
15.7
000100000086
RO媒体概述
2006-01
2021-09
30556.04
15.7

import csv
#opening file
file = open (r"C:UsersRedfordDocumentsProyecto automatizaciondata1.txt") #open file
counter = 0
total_lines = len(file.readlines()) #count lines
#print('Total lines:', x)
#reading from file
content = file.read()
colist  = content.split ()
print(colist)

#read data from data1.txt and write in data2.txt
lines = open (r"C:UsersRedfordDocumentsProyecto automatizaciondata1.txt")
arr = []
with open('data2.txt', 'w') as f:
for line in lines:
#arr.append(line)
f.write (line)

我是编程新手，不知道如何将逻辑转换为代码。

您的代码不会收集多行代码来将它们写入一行。

使用这种方法：

逐行读取文件
将不带的每一行收集到列表中
如果列表长度达到7，则写入csv并清除列表
重复直到完成

创建数据文件：

with open ("t.txt","w") as f:
f.write("""00000000886nMANUEL DE JESUS SUBERVI PEÑAnMAESTRO MEDIA GENERALn2006-08n2021-09n30,556.04n15.7
00000000086nMANUEL DE JESUS SUBERVI PEÑAnMAESTRO MEDIA GENERALn2006-01n2021-09n30,556.04n15.7
00100000086nMANUEL DE JESUS SUBERVI PEÑAnMAESTRO MEDIA GENERALn2006-01n2021-09n30,556.04n15.7""")

程序：

import csv
with open("t.csv","w",newline="") as wr, open("t.txt") as r:
# create a csv writer
writer = csv.writer(wr)
# uncomment if you want a header over your data
# h =  ["ID","Name","Work position","Date 1","Date 2",
#       "Gross payment","Service time"]
# writer.writerow(h)
person = []
for line in r: # could use enumerate as well, this works ok
# collect line data minus the n into list
person.append(line.strip())
# this person is finished, write, clear list
if len(person) == 7:
# leveraged the csv module writer, look it up if you need
# to customize it further regarding quoting etc
writer.writerow(person)
person = [] # reset list for next person
# something went wrong, your file is inconsistent, write remainder
if person:
writer.writerow(person)
print(open("t.csv").read())

输出：

00000000886,MANUEL DE JESUS SUBERVI PEÑA,MAESTRO MEDIA GENERAL,2006-08,2021-09,"30,556.04",15.7
00000000086,MANUEL DE JESUS SUBERVI PEÑA,MAESTRO MEDIA GENERAL,2006-01,2021-09,"30,556.04",15.7
00100000086,MANUEL DE JESUS SUBERVI PEÑA,MAESTRO MEDIA GENERAL,2006-01,2021-09,"30,556.04",15.7

读取：csv模块-写入

"；总付款"；需要引用，因为它包含一个','，它是csv的分隔符-模块会自动执行此操作。

在@PatrickArtner的精彩回答之上，我想提出一个基于itertools的解决方案：

import csv
import itertools

def file_grouper_itertools(
in_filepath="t.txt",
out_filepath="t.csv",
size=7):
with open(in_filepath, 'r') as in_file,
open(out_filepath, 'w') as out_file:
writer = csv.writer(out_file)
args = [iter(in_file)] * size
for block in itertools.zip_longest(*args, fillvalue=' '):
# equivalent, for the given input, to:
# block = [x.rstrip('n') for x in block]
block = ''.join(block).rstrip('n').split('n')
writer.writerow(block)

这里的想法是在所需大小的块中循环。对于较大的组大小，这会变得更快，因为执行主循环的周期较少。

运行一些微观基准测试表明，与手动循环(调整为函数(相比，您的用例应该从这种方法中受益：

import csv

def file_grouper_manual(
in_filepath="t.txt",
out_filepath="t.csv",
size=7):
with open(in_filepath, 'r') as in_file,
open(out_filepath, 'w') as out_file:
writer = csv.writer(out_file)
block = []
for line in in_file:
block.append(line.rstrip('n'))
if len(block) == size:
writer.writerow(block)
block = []
if block:
writer.writerow(block)

基准：

n = 100_000
k = 7
with open ("t.txt", "w") as f:
for i in range(n):
f.write("n".join(["0123456"] * k))

%timeit file_grouper_manual()
# 1 loop, best of 5: 325 ms per loop
%timeit file_grouper_itertools()
# 1 loop, best of 5: 230 ms per loop

或者，您可以使用Pandas，这非常方便，但要求所有输入都能放入可用内存(在您的情况下这应该不是问题，但可以用于更大的输入(：

import numpy as np
import pandas as pd

def file_grouper_pandas(in_filepath="t.txt", out_filepath="t.csv", size=7):
with open(in_filepath) as in_filepath:
data = [x.rstrip('n') for x in in_filepath.readlines()]
df = pd.DataFrame(np.array(data).reshape((-1, size)), columns=list(range(size)))
# consistent with the other solutions
df.to_csv(out_filepath, header=False, index=False)  

%timeit file_grouper_pandas()
# 1 loop, best of 5: 666 ms per loop

如果您对表和数据做了大量的工作，NumPy和Pandas是非常有用的库。

import numpy as np
import pandas as pd
columns = ['ID', 'Name' , 'Work position', 'Date 1 (year - month)', 'Date 2 (year - month)',
'Gross payment', 'Service time']
with open('oldfile.txt', 'r') as stream:
# read file into a list of lines
lines = stream.readlines()
# remove newline character from each element of the list.
lines = [line.strip('n') for line in lines]
# Figure out how many rows there will be in the table
number_of_people = len(lines)/7
# Split data into rows
data = np.array_split(lines, number_of_people)
# Convert data to pandas dataframe
df = pd.DataFrame(data, columns = columns)

一旦将数据转换为Pandas数据帧，就可以轻松地将其输出为列出的任何格式。例如，要输出到csv，可以执行以下操作：

df.to_csv('newfile.csv')

或者对于json，它应该是：

df.to_json('newfile.csv')

相关内容

最新更新

热门标签：