奇怪的JSON到CSV的转换



我有大约700 MB的数据,格式为:

{"hash":"b2f405b1589efd8b013869d1d5e605367643db20844572ea7bf788f8575c38d6","block_timestamp":"2020-05-08 13:21:33 UTC","addresses":["3E17PiWGJqP8945KRZHuPdsFSU59othGEQ"]}
{"hash":"6609073b5979d768933f2ea7d4f1723d07c03a3e08f48adff21b9f1d79cee164","block_timestamp":"2020-05-08 13:39:39 UTC","addresses":["3CfewsC7Xjp2oJSBT2zUQkYSXfzo2nuGha"]}
{"hash":"5c7d95f903ea505d9ab82d1090944780c00e91d343ae66e94610bff1d614f90f","block_timestamp":"2020-04-05 23:19:30 UTC","addresses":["1ztVt2xwNwgzH3W9SJ2nMgPMuZpUg8m5w"]}
{"hash":"7eb120e9b50dbc25f13415b3c899efe2cfaf870a7f49995aa6e3b672a1992e56","block_timestamp":"2020-04-08 05:41:51 UTC","addresses":["1HckjUpRGcrrRAtFaaCAUaGjsPx9oYmLaZ"]}
{"hash":"be202b37aa218461827138ff32e3dfa74945808f3ecb574fb5287e99c8ae6a33","block_timestamp":"2020-04-04 09:53:28 UTC","addresses":["3Jk8HaC8Sjq6Ufig9NkWFoFcfzC5a3CNyL"]}

目前,我将这些数据保存为JSON文件格式。我想把它转换成csv。通过常见的python和BASH方法进行操作并不能给我正确的结果。(请注意,两行之间没有逗号,只是一条换行符(

我希望它是一个CSV格式的头:hash,block_timestamp和地址。我该怎么做?

最简单的方法是在一行bash中完成。例如,如果您的文件名为tt,只需运行:

cat tt | sed -e "s/:/,/g" | awk -F"," '{print $2 "," $4 "," $6}'

或者,如果你真的想用Python做这件事,可以逐行读取文件,使用json.loads解析每一行,然后以CSV格式打印这一行(这就是ranka47在回答中所做的(。问题是,对于大文件来说,速度要慢得多。

我想这就是你想要的。假设JSON的每个addresses密钥中只有一个地址。

注意:As Python将一次加载每一行,因此如果您逐行读取,它可以处理大数据。

import json
fp = open("input.txt", "r")
ofp = open("output.txt", "w")
ofp.write("hash,block_timestamp,addressesn")
for line in fp:
json_obj = json.loads(line)
# print(json_obj)
ofp.write(json_obj["hash"] + "," + json_obj["block_timestamp"] + "," + json_obj["addresses"][0] + "n")
fp.close()
ofp.close()
fp = open("output.txt")
for line in fp:
print(line)

输出:

hash,block_timestamp,addresses
b2f405b1589efd8b013869d1d5e605367643db20844572ea7bf788f8575c38d6,2020-05-08 13:21:33 UTC,3E17PiWGJqP8945KRZHuPdsFSU59othGEQ
6609073b5979d768933f2ea7d4f1723d07c03a3e08f48adff21b9f1d79cee164,2020-05-08 13:39:39 UTC,3CfewsC7Xjp2oJSBT2zUQkYSXfzo2nuGha
5c7d95f903ea505d9ab82d1090944780c00e91d343ae66e94610bff1d614f90f,2020-04-05 23:19:30 UTC,1ztVt2xwNwgzH3W9SJ2nMgPMuZpUg8m5w
7eb120e9b50dbc25f13415b3c899efe2cfaf870a7f49995aa6e3b672a1992e56,2020-04-08 05:41:51 UTC,1HckjUpRGcrrRAtFaaCAUaGjsPx9oYmLaZ
be202b37aa218461827138ff32e3dfa74945808f3ecb574fb5287e99c8ae6a33,2020-04-04 09:53:28 UTC,3Jk8HaC8Sjq6Ufig9NkWFoFcfzC5a3CNyL

如果您有pandas,则可以执行以下操作:

import json
import pandas as pd

contents = [json.loads(line) for line in open("input.txt", 'r')]
df = pd.DataFrame(contents)
df.to_csv("output.csv", index=False)

这可以通过取消每一行的序列化并将生成的dict传递到csv来完成。DictWriter。

import csv
import json
with open('data.json') as jf, open('data.csv', 'w', newline='') as f:
# Handle the first row individually because we need to work out the
# column headings
line = next(jf)
dict_ = json.loads(line)
writer = csv.DictWriter(f, dict_.keys())
writer.writerow(dict_)
# Loop through the rest of the file
for line in jf:
dict_ = json.loads(line)
writer.writerow(dict_)

如果您不想要标题行,可以将代码简化为使用csv.writer

with open('data.json') as jf, open('data.csv', 'w', newline='') as f:
writer = csv.writer(f)
for line in jf:
dict_ = json.loads(line)
writer.writerow(dict_.values())

事实上,我的想法,最简单的方法是:-使用强力编辑器打开文件,因为您的文件大小很大。-将所有dict键替换为空文本,例如:replace<"hash":>为空。。。-也将{,},[,]替换为空文本。-左边部分是良好的CSV格式。

哈哈,不需要代码。

相关内容

  • 没有找到相关文章

最新更新