将带有三种格式数据的JSON文件解析为三个单独的数据范围



我有JSON数据,其中包含我需要读取的三种不同类型的记录,将每种类型解析到每种类型中,并将其存储为三个单独的数据帧。我在这里使用了虚拟数据,因为实际记录很长。数据为每种记录类型具有不同数量的列,并具有一些共同的列名称。模式基本上是:

架构1:每个项目ID的记录具有宽格格式的基本特征,看起来像:

[id,year,object_name,base_price]

架构2和3:每个是长长格式的一组不同的功能,用于单个ID,看起来像:

[id, feature1, price1]
[id, feature2, price2]
[id, feature1, size1]
[id, feature2, size2]

这是JSON数据的外观

[
{"id": ["500068"], "year": ["2018"], "object_name": "barn"},
{"id": ["500069"], "year": ["2018"], "object_name": "well"},
{"id": ["500068"], "year": ["2018"], "feature": "extension","price": "65"},
{"id": ["500068"], "year": ["2018"], "feature": "stalls","price": "72"},
{"id": ["500070"], "year": ["2018"], "feature": "depth_extension","price": "55"},
{"id": ["500070"], "year": ["2018"], "feature": "cover","price": "55"},
{"id": ["500070"], "year": ["2018"], "feature": "depth_extension","size": "1050"},
{"id": ["500070"], "year": ["2018"], "feature": "cover","size": "1160"},
]

我试图将JSON变成可以尝试这样做的格式,但我被卡在这里。

import pandas as pd
with open(file, 'r') as f:
    data = f.readlines()
# remove the trailing "n" from each line
data = map(lambda x: x.rstrip(), data)
# each element of 'data' is an individual JSON object.
# i want to convert it into an *array* of JSON objects
# which, in and of itself, is one large JSON object
# basically... add square brackets to the beginning
# and end, and have all the individual business JSON objects
# separated by a comma
data_json_str = "[" + ','.join(data) + "]"
# now, load it into pandas 
# (I get stuck here since they are different record types
data_df = pd.read_json(data_json_str)

我希望将数据读取为具有适当列名称的三个不同的熊猫数据框。感谢您的帮助!

从JSON数据的各种列名创建3个不同的数据帧,您可以尝试以下代码。

import pandas as pd
json_data='[{"id": ["500068"], "year": ["2018"], "object_name": "barn"},{"id": ["500069"], "year": ["2018"], "object_name": "well"},{"id": ["500068"], "year": ["2018"], "feature": "extension","price": "65"},{"id": ["500068"], "year": ["2018"], "feature": "stalls","price": "72"},{"id": ["500070"], "year": ["2018"], "feature": "depth_extension","price": "55"},{"id": ["500070"], "year": ["2018"], "feature": "cover","price": "55"},{"id": ["500070"], "year": ["2018"], "feature": "depth_extension","size": "1050"},{"id": ["500070"], "year": ["2018"], "feature": "cover","size": "1160"},]'
list_data=eval(json_data)
df=pd.DataFrame(list_data)
df1=df[df['size'].notnull()]
df1=df1.drop(['object_name','price'], 1)
print(df1)
print('n')
df2=df[df['price'].notnull()]
df2=df2.drop(['object_name','size'], 1)
print(df2)
print('n')
df3=df[df['object_name'].notnull()]
df3=df3.drop(['feature', 'price', 'size'], 1)
print(df3)
print('n')

输出:

           feature        id  size    year
6  depth_extension  [500070]  1050  [2018]
7            cover  [500070]  1160  [2018]

           feature        id price    year
2        extension  [500068]    65  [2018]
3           stalls  [500068]    72  [2018]
4  depth_extension  [500070]    55  [2018]
5            cover  [500070]    55  [2018]

         id object_name    year
0  [500068]        barn  [2018]
1  [500069]        well  [2018]

希望这可以帮助您,祝您有美好的一天。:(

最新更新