我正在构建一个过程,以" ofter of Out"两个CSV文件并将结果导出为JSON对象。
# read the source csv files
firstcsv = pandas.read_csv('file1.csv', names = ['main_index','attr_one','attr_two'])
secondcsv = pandas.read_csv('file2.csv', names = ['main_index','attr_three','attr_four'])
# merge them
output = firstcsv.merge(secondcsv, on='main_index', how='outer')
jsonresult = output.to_json(orient='records')
print(jsonresult)
现在,两个CSV文件就是这样:
file1.csv:
1, aurelion, sol
2, lee, sin
3, cute, teemo
file2.csv:
1, midlane, mage
2, jungler, melee
,我希望由此产生的JSON输出:
[{"main_index":1,"attr_one":"aurelion","attr_two":"sol","attr_three":"midlane","attr_four":"mage"},
{"main_index":2,"attr_one":"lee","attr_two":"sin","attr_three":"jungler","attr_four":"melee"},
{"main_index":3,"attr_one":"cute","attr_two":"teemo"}]
相反,我正在使用main_index = 3
上线{"main_index":3,"attr_one":"cute","attr_two":"teemo","attr_three":null,"attr_four":null}]
因此,在输出中自动添加零。我想删除它们 - 我环顾四周,但找不到正确的方法。
希望有人可以帮助我!
由于我们使用的是数据框,pandas会用nan填充'值,即
>>> print(output)
main_index attr_one attr_two attr_three attr_four
0 1 aurelion sol midlane mage
1 2 lee sin jungler melee
2 3 cute teemo NaN NaN
我看不到pandas.to_json文档中的任何选项,以跳过null值:https://pandas.pydata.org/pandas-docs/stable/stable/generated/pandas.dataframe.to_json.to_json.html
所以我想出的方式涉及重建JSON字符串。对于数百万行的大型数据集来说,这可能不是很好(但联盟中的冠军不到200个,所以这不是一个大问题!)
)from collections import OrderedDict
import json
jsonresult = output.to_json(orient='records')
# read the json string to get a list of dictionaries
rows = json.loads(jsonresult)
# new_rows = [
# # rebuild the dictionary for each row, only including non-null values
# {key: val for key, val in row.items() if pandas.notnull(val)}
# for row in rows
# ]
# to maintain order use Ordered Dict
new_rows = [
OrderedDict([
(key, row[key]) for key in output.columns
if (key in row) and pandas.notnull(row[key])
])
for row in rows
]
new_json_output = json.dumps(new_rows)
您会发现new_json_output
已删除所有具有NAN值的键,并保留了订单:
>>> print(new_json_output)
[{"main_index": 1, "attr_one": " aurelion", "attr_two": " sol", "attr_three": " midlane", "attr_four": " mage"},
{"main_index": 2, "attr_one": " lee", "attr_two": " sin", "attr_three": " jungler", "attr_four": " melee"},
{"main_index": 3, "attr_one": " cute", "attr_two": " teemo"}]
我试图实现同一件事并找到以下解决方案,我认为应该很快(尽管我还没有测试过)。回答原来的问题有点太晚了,但可能对某些问题有用。
# Data
df = pd.DataFrame([
{"main_index":1,"attr_one":"aurelion","attr_two":"sol","attr_three":"midlane","attr_four":"mage"},
{"main_index":2,"attr_one":"lee","attr_two":"sin","attr_three":"jungler","attr_four":"melee"},
{"main_index":3,"attr_one":"cute","attr_two":"teemo"}
])
给出一个具有缺失值的数据框。
>>> print(df)
attr_four attr_one attr_three attr_two main_index
0 mage aurelion midlane sol 1
1 melee lee jungler sin 2
2 NaN cute NaN teemo 3
要将其转换为JSON,您可以在过滤空值后,将其应用于_json()()()。然后加入JSON,用逗号分隔,并用括号包裹。
# To json
json_df = df.T.apply(lambda row: row[~row.isnull()].to_json())
json_wrapped = "[%s]" % ",".join(json_df)
然后
>>> print(json_wrapped)
[{"attr_four":"mage","attr_one":"aurelion","attr_three":"midlane","attr_two":"sol","main_index":1},{"attr_four":"melee","attr_one":"lee","attr_three":"jungler","attr_two":"sin","main_index":2},{"attr_one":"cute","attr_two":"teemo","main_index":3}]