如何用pandas转换csv中的嵌套json



我有一个嵌套的json文件(100k行(,看起来像这样:

{"UniqueId":"4224f3c9-323c-e911-a820-a7f2c9e35195","TransactionDateUTC":"2019-03-01 15:00:52.627 UTC","Itinerary":"MUC-CPH-ARN-MUC","OriginAirportCode":"MUC","DestinationAirportCode":"CPH","OneWayOrReturn":"Return","Segment":[{"DepartureAirportCode":"MUC","ArrivalAirportCode":"CPH","SegmentNumber":"1","LegNumber":"1","NumberOfPassengers":"1"},{"DepartureAirportCode":"ARN","ArrivalAirportCode":"MUC","SegmentNumber":"2","LegNumber":"1","NumberOfPassengers":"1"}]}

我正在尝试创建一个csv,这样它就可以很容易地加载到rdbms中。我试图在pandas中使用json_normalize((,但甚至在到达那里之前,我就收到了以下错误。

with open('transactions.json') as data_file:    
data = json.load(data_file)

JSONDecodeError: Extra data: line 2 column 1 (char 466)

如果您的问题源于读取json文件本身,那么我只需要使用:

json.loads() 

然后使用

pd.read_csv()

如果你的问题源于从json dict到数据帧的转换,你可以使用这个:

test = {"UniqueId":"4224f3c9-323c-e911-a820-a7f2c9e35195","TransactionDateUTC":"2019-03-01 15:00:52.627 UTC","Itinerary":"MUC-CPH-ARN-MUC","OriginAirportCode":"MUC","DestinationAirportCode":"CPH","OneWayOrReturn":"Return","Segment":[{"DepartureAirportCode":"MUC","ArrivalAirportCode":"CPH","SegmentNumber":"1","LegNumber":"1","NumberOfPassengers":"1"},{"DepartureAirportCode":"ARN","ArrivalAirportCode":"MUC","SegmentNumber":"2","LegNumber":"1","NumberOfPassengers":"1"}]}
import json
import pandas
# convert json to string and read
df = pd.read_json(json.dumps(test), convert_axes=True)
# 'unpack' the dict as series and merge them with original df
df = pd.concat([df, df.Segment.apply(pd.Series)], axis=1)
# remove dict
df.drop('Segment', axis=1, inplace=True)

这将是我的方法,但可能还有更方便的方法。

第一步:在记录文件上循环

由于您的文件每行有一个JSON记录,因此您需要对文件中的所有记录进行循环,您可以这样做:

with open('transactions.json', encoding="utf8") as data_file:
for line in data_file:
data = json.loads(line) 
# or
df = pd.read_json(line, convert_axes=True)
# do something with data or df

第二步:编写CSV文件

现在,您可以将其与csv.writer结合起来,将文件转换为CSV文件。

with open('transactions.csv', "w", encoding="utf8") as csv_file:
writer = csv.writer(csv_file)
#Loop for each record, somehow:
#row = build list with row contents
writer.writerow(row)

把它们放在一起

我将读取第一条记录一次,以获得显示密钥并将其输出为CSV标头,然后我将读取整个文件并一次将其转换为一条记录:

import copy
import csv
import json
import pandas as pd
# Read the first JSON record to get the keys that we'll use as headers for the CSV file
with open('transactions.json', encoding="utf8") as data_file:
keys = list(json.loads(next(data_file)).keys())
# Our CSV headers are going to be the keys from the first row, except for
# segments, which we'll replace (arbitrarily) by three numbered segment column
# headings.
keys.pop()
base_keys = copy.copy(keys)
keys.extend(["Segment1", "Segment2", "Segment3"])
with open('transactions.csv', "w", encoding="utf8") as csv_file:
writer = csv.writer(csv_file)
writer.writerow(keys)  # Write the CSV headers
with open('transactions.json', encoding="utf8") as data_file:
for line in data_file:
data = json.loads(line)
row = [data[k] for k in base_keys] + data["Segment"]
writer.writerow(row)

生成的CSV文件在每个Segmenti列中仍将有一个JSON记录。如果你想以不同的方式格式化每个段,你可以定义一个format_segment(segment)函数,并用以下列表理解取代data["Segment"][format_segment(segment) for segment in data["Segment"]]

最新更新