获取csv文件中非唯一参数的值和行号



我需要检查csv文件中每列中所有值的唯一性,并获取非唯一参数所在的行的编号及其值。示例文件

Vendor,Email,Country
Nick,nick@gmail.com,US
Joe,joe@gmail.com,NL
Nick,nk@gmail.com,GB
Mary,nk@gmail.com,AU

预期结果

name = [{'Nick':3},]
email = [{'nk@gmail.com':4},]

或者类似的东西,这个输出格式是可选的

我所能做的就是通过唯一性测试。我有两种方法。

1.使用python csv库

with open(file) as csvfile:
reader = csv.DictReader(csvfile)
# Check uniques vendor name and email in file without request to DB
vendor_list = []
email_list = []
for count, rows in enumerate(reader, 1):
vendor_list.append(rows['Vendor'])
email_list.append(rows['Email'])
vendor_unique = set(vendor_list)
email_unique = set(email_list)
if len(vendor_list) != len(vendor_unique) or len(email_list) != len(email_unique):
raise ParseError('Your name and email should be unique')

2.使用熊猫

import pandas as pd
record = pd.read_csv(file)
if ((len(record['Email'])) != (len(record['Email'].unique())):
raise ParseError('Your email should be unique')
elif (len(record['Vendor']) != len(record['Vendor'].unique())):
raise ParseError('Your name should be unique')

Pandas解决方案-获取重复的值(所有值都没有首先通过Series.duplicated(,通过boolean indexing使用DataFrame.loc和最后一个交换密钥进行过滤,输出dicts中的值:

name = {v: k + 1 for k, v in df.loc[df['Vendor'].duplicated(), 'Vendor'].items()}
print (name)
{'Nick': 3}
email = {v: k +1 for k, v in df.loc[df['Email'].duplicated(), 'Email'].items()}
print (email )
{'nk@gmail.com': 4}

最新更新