如何在不使用外部库的情况下解析具有不同行元素的CSV文件



我正在尝试用Python解析CSV文件;文件中的元素在第一行之后从6增加到7。

CSV示例:

Title,Name,Job,Email,Address,ID
Eng.,"FirstName, LastName",Engineer,email@company.com,ACME Company,1234567
Eng.,"FirstName, LastName",Engineer,email@company.com,ACME Company,1234567

我需要一种将输出格式化并呈现为干净表格的方法。

根据我的理解,我的代码的问题是从第二行开始,CSV元素从6增加到7。因此,它抛出以下错误。

print(stringFormat.format(item.split(',')[0], item.split(',')[1], item.split(',')[2],
item.split(',')[3], item.split(',')[4], item.split(',')[5],))
IndexError: list index out of range

我的代码:

stringFormat = "{:>10} {:>10} {:>10} {:>10} {:>10}  {:>10}"
with open("the_file", 'r') as file:
for item in file.readlines():
print(stringFormat.format(item.split(',')[0], item.split(',')[1],
item.split(',')[2], item.split(',')[3],
item.split(',')[4], item.split(',')[5],
item.split(',')[6]))

您可以使用如下所示的非常简单的for循环来实现这一点。我添加了一个打印声明来显示的效果

# 'r' is not needed, it is the default value if omitted
with open("file_name") as infile:
result = []
# split the read() into a list of lines
# I prefer this over readlines() as this removes the EOL character
# automagically (I mean the `n` char) 
for line in infile.read().splitlines():
# check if line is empty (stripping all spaces)
if len(line.strip()) == 0: 
continue
# another way would be to check for ',' characters
if ',' not in line:
continue
# set some helper variables
line_result = []
found_quote = False
element = ""
# iterate over the line by character
for c in line:
# toggle the found_quote if quote found
if c == '"':
found_quote = not found_quote
continue
if c == ",":
if found_quote:
element += c
else:
# append the element to the line_result and reset element
line_result.append(element)
element = ""
else:
# append c to the element
element += c
# append leftover element to the line_result
line_result.append(element)

# append the line_result to the final result
result.append(line_result)
print(len(line_result), line_result)

print('------------------------------------------------------------')
stringFormat = "{:>10} {:>20} {:>20} {:>20} {:>20}  {:>10}"
for line in result:
print(stringFormat.format(*line))

输出

6 ['Title', 'Name', 'Job', 'Email', 'Address', 'ID']
6 ['Eng.', 'FirstName, LastName', 'Engineer', 'email@company.com', 'ACME Company', '1234567']
6 ['Eng.', 'FirstName, LastName', 'Engineer', 'email@company.com', 'ACME Company', '1234567']
------------------------------------------------------------
Title                 Name                  Job                Email              Address          ID
Eng.  FirstName, LastName             Engineer    email@company.com         ACME Company     1234567
Eng.  FirstName, LastName             Engineer    email@company.com         ACME Company     1234567

谈话后的一些调整

关于对列表列表进行排序的说明。它将内部列表的第一个元素相互比较。如果它们匹配,它将比较内部列表的第二个元素,等等。因此,您可能需要将ID列移动到结果列表中的第二列,因为这似乎就是所谓的唯一标识符(UID(。

with open("file_name") as infile:
lines = infile.read().splitlines()
# set the header and remove it from lines.
header = lines.pop(0).split(',')
# rearrange the header to put the last element (date) first
# -1 gets the last element (eg, count from end)
header.insert(0, header.pop(-1))
# store the header length as this will speed up the process for longer files
# otherwise you would have to call len(header) in each iteration of the loop
header_len = len(header)
result = []
for line in lines:
if ',' not in line:
continue
# split the line once here, so we don't have to split it a million
# times in the rest of the loop
split_line = line.split(',')
if len(split_line) > header_len:
# note, you can remove the strip('"') if you want to keep the quotation marks
# also note that .pop() removes the element "in place", which is why I
# use .pop(1) twice. first time it gets firstname, second time it gets lastname
split_line.insert(1, f"{split_line.pop(1)},{split_line.pop(1)}".strip('"'))
# move the date element to the start
split_line.insert(0, split_line.pop(-1))
# do some slicing on the date element to turn it into YYYYMMDD as this allows for
# proper sorting without any hassle. I'm assuming the date you provided is in the format
# MM/DD/YYYY. You can easily move the order around if it's DD/MM/YYYY
# Also, pad day/month with leading zero's using f"{string:>02}"
split_line[0] = f"{split_line[0].split('/')[2]}{split_line[0].split('/')[0]:>02}{split_line[0].split('/')[1]:>02}"
result.append(split_line)
# sort it. Since the date is in numeric format, and the first element, it sorts 
# properly automagically
result.sort()
# if you want you can re-format the date again. you can do so with some list slicing
# since the date string is now properly formatted this is very easy to do
# because the sort() above happens outside the initial loop, we cannot do it inside said loop
for line in result:
line[0] = f"{line[0][6:]}/{line[0][4:6]}/{line[0][0:4]}"
# insert the header
result.insert(0, header)

stringFormat = "{:>10} {:>25} {:>20} {:>20} {:>20} {:>10} {:>10}"
for line in result:
print(stringFormat.format(*line))

# write it as a CSV file with ; used as separator instead
with open("output.csv", "w") as outfile:
for line in result:
outfile.write(";".join(line) + "n")

您可以尝试这样的方法。for循环使用拆分项的长度,因此可以使用长度可变的行。

stringFormats = ["{:>10}", "{:>10}", "{:>10}", "{:>10}", "{:>10}", "{:>10}", "{:>10}"]
with open("the_file", 'r') as file:
for item in file.readlines():
s_item = item.split(',')
f_item = ''
for x in range(len(s_item)):
f_item += stringFormats[x].format(s_item[x])
print(f_item)

当然,您至少需要足够的字符串格式来匹配最大的行长度。如果您从不需要使用不同的选项,那么您可以将字符串格式改回单个字符串,而不是循环使用它

stringFormat = "{:>10}"
with open("the_file", 'r') as file:
for item in file.readlines():
s_item = item.split(',')
f_item = ''
for a_field in s_item:
f_item += stringFormat.format(a_field)
print(f_item)

相关内容

  • 没有找到相关文章

最新更新