如何在不使用Python中任何内置库的情况下处理其中一列中存在的列分隔符



我想在Python3中读取CSV文件,但由于某些限制,无法使用任何库。在几乎每一行中,一个或多个列都包含逗号(","(,并且随着列数的增加,使用row.split(',')会导致问题。

我的代码是:

import csv
file_name = "train_1.csv"
columns = [
"PassengerId",
"Survived",
"Pclass",
"Name",
"Sex",
"Age",
"SibSp",
"Parch",
"Ticket",
"Fare",
"Cabin",
"Embarked"
]
print("Total columns should be: {}".format(len(columns)))
with open(file_name, 'r') as reader:
for line in reader.readlines():
row_data = line.split(',')
if len(row_data) != len(columns):
print('This row does not have the required # of columns: {}'.format(
len(row_data)))
print(row_data)

我的输出(错误(是:

['1', '0', '3', '"Braund', ' Mr. Owen Harris"', 'male', '22', '1', '0', 'A/5 21171', '7.25', '', 'Sn']

相反,它应该是:

['1', '0', '3', '"Braund, Mr. Owen Harris"', 'male', '22', '1', '0', 'A/5 21171', '7.25', '', 'S']

额外的列是由于名称被一分为二,而不是一个和最后一列中的n

然而,我主要担心的是额外的列被拆分。注意:这个问题由CSV阅读器解决,但由于库的限制,我不能真正使用任何库。

部分输入为:

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S

此处提供完整的数据。

Name列值中的逗号将名称分成两列。以下解决方案修复了这一问题,并从Embarked列值中删除了新行

print("Total columns should be: {}".format(len(columns)))
with open(file_name, 'r') as reader:
for line in reader.readlines():
row_data = line.replace('n', '').split(',')
if len(row_data) != len(columns):
row_data[3] = (row_data[3]+ ',' + row_data[4])
del row_data[4]
print(row_data)
else:
print(row_data)
不能使用任何内置模块是一个奇怪的限制,但创建自己的csv解析器很容易。

正如您所注意到的,您必须处理值包含逗号的情况,CSV通过引用整个字符串来处理逗号。

在完整的数据链接中,还有一行添加了另一个褶皱:

889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S

这是一个嵌入逗号的值,所以它被引用了。然而,它在it中也有一个引号,因此CSV格式通过将引号加倍来"转义"这些引号。我想你需要保留这些转义引号。

def csv_values(text_line, delim=','):
row = []
embedded = False
parts = []
for word in text_line.split(delim):
# Set flag marking start of quoted value
if word.startswith('"'):
embedded = True
if embedded:
# If scanning a quoted value (with embedded commas),
# add the current portion to the accumulator
# word = word.replace('""', r'"')
parts.append(word)
else:
# Otherwise, append the value to the collection
row.append(word)
# Unset flag, marking end of quoted value
if word.endswith('"'):
embedded = False
# Add the accumulated value
# row.append(','.join(parts)[1:-1])
row.append(','.join(parts))
# Reset the accumulator
parts = []
return row

这个实现是我的"原样"方法,这意味着我唯一要做的就是积累嵌入逗号的值。我使用行882-891:得到这个结果

['882', '0', '3', '"Markun, Mr. Johann"', 'male', '33', '0', '0', '349257', '7.8958', '', 'S']
['883', '0', '3', '"Dahlberg, Miss. Gerda Ulrika"', 'female', '22', '0', '0', '7552', '10.5167', '', 'S']
['884', '0', '2', '"Banfield, Mr. Frederick James"', 'male', '28', '0', '0', 'C.A./SOTON 34068', '10.5', '', 'S']
['885', '0', '3', '"Sutehall, Mr. Henry Jr"', 'male', '25', '0', '0', 'SOTON/OQ 392076', '7.05', '', 'S']
['886', '0', '3', '"Rice, Mrs. William (Margaret Norton)"', 'female', '39', '0', '5', '382652', '29.125', '', 'Q']
['887', '0', '2', '"Montvila, Rev. Juozas"', 'male', '27', '0', '0', '211536', '13', '', 'S']
['888', '1', '1', '"Graham, Miss. Margaret Edith"', 'female', '19', '0', '0', '112053', '30', 'B42', 'S']
['889', '0', '3', '"Johnston, Miss. Catherine Helen ""Carrie"""', 'female', '', '1', '2', 'W./C. 6607', '23.45', '', 'S']
['890', '1', '1', '"Behr, Mr. Karl Howell"', 'male', '26', '0', '0', '111369', '30', 'C148', 'C']
['891', '0', '3', '"Dooley, Mr. Patrick"', 'male', '32', '0', '0', '370376', '7.75', '', 'Q']

如果您希望不包含引号并取消转义嵌入的引号,则可以取消注释行14&24,并注释掉第25行。这种方法会给出这样的结果:

['882', '0', '3', 'Markun, Mr. Johann', 'male', '33', '0', '0', '349257', '7.8958', '', 'S']
['883', '0', '3', 'Dahlberg, Miss. Gerda Ulrika', 'female', '22', '0', '0', '7552', '10.5167', '', 'S']
['884', '0', '2', 'Banfield, Mr. Frederick James', 'male', '28', '0', '0', 'C.A./SOTON 34068', '10.5', '', 'S']
['885', '0', '3', 'Sutehall, Mr. Henry Jr', 'male', '25', '0', '0', 'SOTON/OQ 392076', '7.05', '', 'S']
['886', '0', '3', 'Rice, Mrs. William (Margaret Norton)', 'female', '39', '0', '5', '382652', '29.125', '', 'Q']
['887', '0', '2', 'Montvila, Rev. Juozas', 'male', '27', '0', '0', '211536', '13', '', 'S']
['888', '1', '1', 'Graham, Miss. Margaret Edith', 'female', '19', '0', '0', '112053', '30', 'B42', 'S']
['889', '0', '3', 'Johnston, Miss. Catherine Helen "Carrie"', 'female', '', '1', '2', 'W./C. 6607', '23.45', '', 'S']
['890', '1', '1', 'Behr, Mr. Karl Howell', 'male', '26', '0', '0', '111369', '30', 'C148', 'C']
['891', '0', '3', 'Dooley, Mr. Patrick', 'male', '32', '0', '0', '370376', '7.75', '', 'Q']

在任何情况下,你都可以使用这样的功能:

with open(file_name, 'r') as in_file:
csv_lines = in_file.splitlines()
# Separate header from rest
headers, lines = csv_lines[0], csv_lines[1:]
for line in lines:
print(csv_values(line))

在观察csv文件后,我发现name列很乱,应该进行处理。

file_name = "train_1.csv"
columns = [
"PassengerId",
"Survived",
"Pclass",
"Name",
"Sex",
"Age",
"SibSp",
"Parch",
"Ticket",
"Fare",
"Cabin",
"Embarked"
]
print("Total columns should be: {}".format(len(columns)))
header = False
with open(file_name, 'r') as reader:
for line in reader.readlines():
line = line[:-1]
if not header:
header = True
continue
line_pre_name = line.split('"', 1)[0].split(',')[:-1]
name = [line.split('"', 2)[1]]
line_post_name = line.split('"')[-1].split(',')[1:]
row_data = line_pre_name + name + line_post_name
if len(row_data) != len(columns):
print('This row does not have the required # of columns: {}'.format(
len(row_data)))
print(row_data)

最新更新