我需要将csv转换为json文件格式,但输出没有如预期的那样。Line1, Line2等在json输出中重复。我需要去掉那些重复的部分。
输入数据
7,priya,kannan,shanthapriya794@gmail.com,07-12-1994,"123","456",67,mdu,tn,india
7,priya,kannan,shanthapriya7964@gmail.com,07-12-1994,"123","456",67,mdu,tn,india
预期输出
[ {
"source_id": 7,
"fname": "priya",
"lname": "kannan",
"date_of_birth": "07-12-1994",
"email": ["shanthapriya794@gmail.com", "shanthapriya7964@gmail.com"],
"address": [{
"line1": 123,
"line2": 456,
"line3": 67,
"city": "mdu",
"state": "tn",
"country": "india"
}]
}]
输出得到
[ {
"source_id": 7,
"fname": "priya",
"lname": "kannan",
"date_of_birth": "07-12-1994",
"email": ["shanthapriya794@gmail.com", "shanthapriya7964@gmail.com"],
"address": [{
"line1": 123,
"line2": 456,
"line3": 67,
"city": "mdu",
"state": "tn",
"country": "india"
}, {
"line1": 123,
"line2": 456,
"line3": 67,
"city": "mdu",
"state": "tn",
"country": "india"
}]
}]
代码尝试
g_cols = ['source_id', 'fname', 'lname', 'email', 'date_of_birth']
df = pd.read_csv(path, sep=",", header=0)
cols = df.columns[~df.columns.isin(g_cols)]
g_cols.remove('email')
df = (df.sort_values(g_cols)
.set_index(g_cols)
.assign(email=df.groupby(g_cols)['email'].agg(lambda x: tuple(pd.unique(x))))
.reset_index())
g_cols.append('email')
df1 = df.groupby(g_cols)[cols].apply(lambda x: x.to_dict('records')).reset_index(name='address').to_dict('record')
print(df1)
df2 = pd.DataFrame(df1)
在此步骤中使用drop_duplicates()
方法:
df1 = df.drop_duplicates().groupby(g_cols)[cols].apply(lambda x: x.to_dict('records')).reset_index(name='address').to_dict('record')
df1
输出:
[{'source_id': 7,
'fname': 'priya',
'lname': 'kannan',
'date_of_birth': '07-12-1994',
'email': ('shanthapriya794@gmail.com', 'shanthapriya7964@gmail.com'),
'address': [{'ln1': 123,
'ln2': 456,
'ln3': 67,
'cty': 'mdu',
'state': 'tn',
'cntry': 'india'}]}]
g_cols = ['source_id', 'fname', 'lname', 'email', 'date_of_birth']
df = pd.read_csv(path, sep=",", header=0)
cols = df.columns[~df.columns.isin(g_cols)]
g_cols.remove('email')
df = (df.sort_values(g_cols)
.set_index(g_cols)
.assign(email=df.groupby(g_cols)['email'].agg(lambda x: tuple(pd.unique(x))))
.reset_index())
g_cols.append('email')
df1 = df.drop_duplicates().groupby(g_cols)[cols].apply(lambda x: x.to_dict('records')).reset_index(name='address').to_dict('record')
print(df1)
df2 = pd.DataFrame(df1)