Python循环从csv导入中删除重复项+原始项



因此,我有一个csv文件要导入,并希望根据第一列中的用户编号跳过从csv文件导入重复行和原始行,并且我正在使用StringIO模块。我目前的做法如下,这是不正确的,因为即使它跳过了重复的行,我相信它仍然会导入原始行。跳过从csv导入重复行和原始行的最佳方法是什么?

def csv_import(stream):
ostream = StringIO()
headers = stream.readline()
ostream.write(headers)
seen_user_numbers = {}
for row in stream:
list_row = row.split(',')
user_number = list_row[0]
if user_number in seen_user_numbers:
seen_user_numbers.pop(user_number)
continue
seen_user_numbers[user_number] = True
ostream.write(row)
ostream.seek(0)
return ostream

因为在到达输入文件的末尾之前不能确定行是否会被包括在内,所以需要将所有未排除的行存储在内存中,然后才能将它们写入文件。

你可以用字典做到这一点:

def csv_import(stream):
ostream = StringIO()
headers = stream.readline()
ostream.write(headers)
outputLines = dict()  # will use None for lines to exclude
for row in stream:
list_row = row.split(',')
user_number = list_row[0]

if user_number in outputLines:
outputLines[user_number] = None
else:
outputLines[user_number] = row

for row in filter(None,outputLines.values()):
ostream.write(row)
ostream.seek(0)
return ostream

相关内容

最新更新