是否有一种简单的方法来加载json文件与以下结构:
('ID_1', 'col1_1', 'col2_1' col3_1', 'key1', 'value1', 'col6_1')
('ID_1', 'col1_1', 'col2_1' col3_1', 'key2', 'value2', 'col6_1')
('ID_1', 'col1_1', 'col2_1' col3_1', 'key3', 'value3', 'col6_1')
('ID_2', 'col1_2', 'col2_2' col3_2', 'key1', 'value1', 'col6_2')
('ID_2', 'col1_2', 'col2_2' col3_2', 'key2', 'value2', 'col6_2')
('ID_2', 'col1_2', 'col2_2' col3_2', 'key3', 'value3', 'col6_2')
实现:
('ID_1', 'col1_1', 'col2_1' col3_1', 'key1', 'key2', 'key3', col6_1')
('ID_2', 'col1_2', 'col2_2' col3_2', 'key1', 'key2', 'key3', col6_2')
和value1, value2, value3分别赋值给key1, key2, key3 ?
我想使用pandas或pyspark函数。
这个文件结构是一个无效的JSON文件,但是你可以使用DataFrame.drop_duplicates()
删除重复的文件:
import pandas as pd
df = pd.DataFrame({
'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
'rating': [4, 4, 3.5, 15, 5]
})
df.drop_duplicates(subset=['brand'], keep='first', inplace=True, ignore_index=True)
print(df)
API参考