Python -PANDAS-如何在数据帧合并后从to_json删除null值



我正在构建一个过程,以" ofter of Out"两个CSV文件并将结果导出为JSON对象。

# read the source csv files
firstcsv = pandas.read_csv('file1.csv',  names = ['main_index','attr_one','attr_two'])
secondcsv = pandas.read_csv('file2.csv',  names = ['main_index','attr_three','attr_four'])
# merge them
output = firstcsv.merge(secondcsv, on='main_index', how='outer')
jsonresult = output.to_json(orient='records')
print(jsonresult)

现在,两个CSV文件就是这样:

file1.csv:
1, aurelion, sol
2, lee, sin
3, cute, teemo
file2.csv:
1, midlane, mage
2, jungler, melee

,我希望由此产生的JSON输出:

[{"main_index":1,"attr_one":"aurelion","attr_two":"sol","attr_three":"midlane","attr_four":"mage"},
{"main_index":2,"attr_one":"lee","attr_two":"sin","attr_three":"jungler","attr_four":"melee"},
{"main_index":3,"attr_one":"cute","attr_two":"teemo"}]

相反,我正在使用main_index = 3

上线
{"main_index":3,"attr_one":"cute","attr_two":"teemo","attr_three":null,"attr_four":null}]

因此,在输出中自动添加零。我想删除它们 - 我环顾四周,但找不到正确的方法。

希望有人可以帮助我!

由于我们使用的是数据框,pandas会用nan填充'值,即

>>> print(output)
      main_index   attr_one attr_two attr_three attr_four
0           1   aurelion      sol    midlane      mage
1           2        lee      sin    jungler     melee
2           3       cute    teemo        NaN       NaN

我看不到pandas.to_json文档中的任何选项,以跳过null值:https://pandas.pydata.org/pandas-docs/stable/stable/generated/pandas.dataframe.to_json.to_json.html

所以我想出的方式涉及重建JSON字符串。对于数百万行的大型数据集来说,这可能不是很好(但联盟中的冠军不到200个,所以这不是一个大问题!)

from collections import OrderedDict
import json
jsonresult = output.to_json(orient='records')
# read the json string to get a list of dictionaries
rows = json.loads(jsonresult)
# new_rows = [
#     # rebuild the dictionary for each row, only including non-null values
#     {key: val for key, val in row.items() if pandas.notnull(val)}
#     for row in rows
# ]
# to maintain order use Ordered Dict
new_rows = [
    OrderedDict([
        (key, row[key]) for key in output.columns
        if (key in row) and pandas.notnull(row[key])
    ])
   for row in rows
]
new_json_output = json.dumps(new_rows)

您会发现new_json_output已删除所有具有NAN值的键,并保留了订单:

>>> print(new_json_output)
[{"main_index": 1, "attr_one": " aurelion", "attr_two": " sol", "attr_three": " midlane", "attr_four": " mage"},
 {"main_index": 2, "attr_one": " lee", "attr_two": " sin", "attr_three": " jungler", "attr_four": " melee"},
 {"main_index": 3, "attr_one": " cute", "attr_two": " teemo"}]

我试图实现同一件事并找到以下解决方案,我认为应该很快(尽管我还没有测试过)。回答原来的问题有点太晚了,但可能对某些问题有用。

# Data
df = pd.DataFrame([
    {"main_index":1,"attr_one":"aurelion","attr_two":"sol","attr_three":"midlane","attr_four":"mage"},
    {"main_index":2,"attr_one":"lee","attr_two":"sin","attr_three":"jungler","attr_four":"melee"},
    {"main_index":3,"attr_one":"cute","attr_two":"teemo"}
])

给出一个具有缺失值的数据框。

>>> print(df)
  attr_four  attr_one attr_three attr_two  main_index
0      mage  aurelion    midlane      sol           1
1     melee       lee    jungler      sin           2
2       NaN      cute        NaN    teemo           3

要将其转换为JSON,您可以在过滤空值后,将其应用于_json()()()。然后加入JSON,用逗号分隔,并用括号包裹。

# To json    
json_df = df.T.apply(lambda row: row[~row.isnull()].to_json())
json_wrapped = "[%s]" % ",".join(json_df)

然后

>>> print(json_wrapped)
[{"attr_four":"mage","attr_one":"aurelion","attr_three":"midlane","attr_two":"sol","main_index":1},{"attr_four":"melee","attr_one":"lee","attr_three":"jungler","attr_two":"sin","main_index":2},{"attr_one":"cute","attr_two":"teemo","main_index":3}]

相关内容

  • 没有找到相关文章

最新更新