如何规范化熊猫数据帧的 JSON 字符串类型列



我正在尝试将带有列表数组的列拆分为多列并创建多行。以下是示例数据:

signalid    monthyear   readings
5135         201901    [{"v":"90","q":1,"t":1546444800000,"g":0}]
5135         201901    [{"v":"50","q":1,"t":1546444900000,"g":0}]
5135         201901    [{"v":"40","q":1,"t":1546445800000,"g":0}]
5135         201901    [{"v":"30","q":1,"t":1546446800000,"g":0},{"v":"35","q":1,"t":1546446900000,"g":0}]

有关数据的详细信息:

  1. 数据是从Cassandra数据库中获取的。

  2. Cassandra 中的列数据类型是文本。

  3. 数据帧读数[0]

'[{"v":"9817","q":1,"t":1548979150085,"g":0},{"v":"9821","q":1,"t":1548979151475,"g":0}]'

  1. dataframe.readings.dtype

    dtype('O'(

规范化代码:

normalizeddataframe = json_normalize(data=dataframe, record_path='readings', 
                            meta=["signalid", "monthyear"])

预期产出:

signalid    monthyear  v     q         t           g
5135         201901    90    1    1546444800000    0
5135         201901    50    1    1546444900000    0
5135         201901    40    1    1546445800000    0
5135         201901    30    1    1546446800000    0
5135         201901    35    1    1546446900000    0

结果:

TypeError                                 Traceback (most recent call last)
<ipython-input-13-486775fb3431> in <module>
     60 #normalizeddataframe = pandas.read_json(dataframe.readings)
     61 normalizeddataframe = json_normalize(data=dataframe, record_path='readings', 
---> 62                             meta=["signalid", "monthyear", "fromtime", "totime", "avg", "insertdate", "max", "min"])
     63 #normalizeddataframe = json_normalize(pandas.Series.to_json(dataframe.readings))
     64 print("Processig completed");
C:ProgramDataAnaconda3libsite-packagespandasiojsonnormalize.py in json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep)
    260                 records.extend(recs)
    261 
--> 262     _recursive_extract(data, record_path, {}, level=0)
    263 
    264     result = DataFrame(records)
C:ProgramDataAnaconda3libsite-packagespandasiojsonnormalize.py in _recursive_extract(data, path, seen_meta, level)
    236         else:
    237             for obj in data:
--> 238                 recs = _pull_field(obj, path[0])
    239 
    240                 # For repeating the metadata later
C:ProgramDataAnaconda3libsite-packagespandasiojsonnormalize.py in _pull_field(js, spec)
    183                 result = result[field]
    184         else:
--> 185             result = result[spec]
    186 
    187         return result
TypeError: string indices must be integers
我知道

这已经有几年的历史了,但我想我会添加我的答案,以防其他人遇到这个问题。这侧重于维护作者指定的输入和输出的结构。我相信有一些优化的余地。

import pandas as pd
import json
signalid = [5135, 5135, 5135, 5135]
monthyear = [201901, 201901, 201901, 201901]
readings = ['[{"v":"90","q":1,"t":1546444800000,"g":0}]', '[{"v":"50","q":1,"t":1546444900000,"g":0}]', '[{"v":"40","q":1,"t":1546445800000,"g":0}]', '[{"v":"30","q":1,"t":1546446800000,"g":0},{"v":"35","q":1,"t":1546446900000,"g":0}]']
# Reconstruct sample DataFrame
df = pd.DataFrame({
        "signalid": signalid,
        "monthyear": monthyear,
        "readings": readings
    })
# Convert strings to JSON objects
df['readings'] = df['readings'].map(json.loads)
# Can't use nested lists of JSON objects in pd.json_normalize
df = df.explode(column='readings').reset_index(drop=True)
# pd.json_normalize expects a list of JSON objects not a DataFrame
df = pd.concat([df[['signalid', 'monthyear']], pd.json_normalize(df['readings'])], axis=1) 
print(df)

输出:

   signalid  monthyear   v  q              t  g
0      5135     201901  90  1  1546444800000  0
1      5135     201901  50  1  1546444900000  0
2      5135     201901  40  1  1546445800000  0
3      5135     201901  30  1  1546446800000  0
4      5135     201901  35  1  1546446900000  0

假设我们有一个列名为 readings 的数据框,列值为

[{"v":"90","q":1,"t":1546444800000,"g":0}]

[{"v":"50","q":1,"t":1546444900000,"g":0}]

[{"v":"40","q":1,"t":1546445800000,"g":0}]

for row in data['readings']:
    for value_dict in row:
        for key in value_dict:
            data[key] = value_dict[key]

如果列表中有多个 json,您没有提及预期的输出。我想这可能会对你有所帮助。如果您提到所需的输出,我可以根据它修改我的代码。

也许有一个更简单的解决方案,但这个应该有效。这个想法是将readings列的每个条目转换为一个数据帧,其索引由 signalidmonthyear 给出。

此函数将字典转换为数据帧并设置索引:

def to_df(index, x):
    return pd.DataFrame.from_dict(x, orient='index').T.set_index(pd.MultiIndex.from_arrays(index, names=['signalid', 'monthyear']))

下一个函数触发 to_df 的调用

def concat(y):
    return pd.concat(map(lambda x: to_df([[y.signalid], [y.monthyear]], x), y.readings))

我们现在将concat应用于每一行。结果是一系列数据帧,我们可以连接这些数据帧以接收最终数据:

s = df.apply(concat, axis=1)
pd.concat(s.tolist())

最新更新