我正在尝试将带有列表数组的列拆分为多列并创建多行。以下是示例数据:
signalid monthyear readings
5135 201901 [{"v":"90","q":1,"t":1546444800000,"g":0}]
5135 201901 [{"v":"50","q":1,"t":1546444900000,"g":0}]
5135 201901 [{"v":"40","q":1,"t":1546445800000,"g":0}]
5135 201901 [{"v":"30","q":1,"t":1546446800000,"g":0},{"v":"35","q":1,"t":1546446900000,"g":0}]
有关数据的详细信息:
数据是从Cassandra数据库中获取的。
Cassandra 中的列数据类型是文本。
数据帧读数[0]
'[{"v":"9817","q":1,"t":1548979150085,"g":0},{"v":"9821","q":1,"t":1548979151475,"g":0}]'
dataframe.readings.dtype
dtype('O'(
规范化代码:
normalizeddataframe = json_normalize(data=dataframe, record_path='readings',
meta=["signalid", "monthyear"])
预期产出:
signalid monthyear v q t g
5135 201901 90 1 1546444800000 0
5135 201901 50 1 1546444900000 0
5135 201901 40 1 1546445800000 0
5135 201901 30 1 1546446800000 0
5135 201901 35 1 1546446900000 0
结果:
TypeError Traceback (most recent call last)
<ipython-input-13-486775fb3431> in <module>
60 #normalizeddataframe = pandas.read_json(dataframe.readings)
61 normalizeddataframe = json_normalize(data=dataframe, record_path='readings',
---> 62 meta=["signalid", "monthyear", "fromtime", "totime", "avg", "insertdate", "max", "min"])
63 #normalizeddataframe = json_normalize(pandas.Series.to_json(dataframe.readings))
64 print("Processig completed");
C:ProgramDataAnaconda3libsite-packagespandasiojsonnormalize.py in json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep)
260 records.extend(recs)
261
--> 262 _recursive_extract(data, record_path, {}, level=0)
263
264 result = DataFrame(records)
C:ProgramDataAnaconda3libsite-packagespandasiojsonnormalize.py in _recursive_extract(data, path, seen_meta, level)
236 else:
237 for obj in data:
--> 238 recs = _pull_field(obj, path[0])
239
240 # For repeating the metadata later
C:ProgramDataAnaconda3libsite-packagespandasiojsonnormalize.py in _pull_field(js, spec)
183 result = result[field]
184 else:
--> 185 result = result[spec]
186
187 return result
TypeError: string indices must be integers
这已经有几年的历史了,但我想我会添加我的答案,以防其他人遇到这个问题。这侧重于维护作者指定的输入和输出的结构。我相信有一些优化的余地。
import pandas as pd
import json
signalid = [5135, 5135, 5135, 5135]
monthyear = [201901, 201901, 201901, 201901]
readings = ['[{"v":"90","q":1,"t":1546444800000,"g":0}]', '[{"v":"50","q":1,"t":1546444900000,"g":0}]', '[{"v":"40","q":1,"t":1546445800000,"g":0}]', '[{"v":"30","q":1,"t":1546446800000,"g":0},{"v":"35","q":1,"t":1546446900000,"g":0}]']
# Reconstruct sample DataFrame
df = pd.DataFrame({
"signalid": signalid,
"monthyear": monthyear,
"readings": readings
})
# Convert strings to JSON objects
df['readings'] = df['readings'].map(json.loads)
# Can't use nested lists of JSON objects in pd.json_normalize
df = df.explode(column='readings').reset_index(drop=True)
# pd.json_normalize expects a list of JSON objects not a DataFrame
df = pd.concat([df[['signalid', 'monthyear']], pd.json_normalize(df['readings'])], axis=1)
print(df)
输出:
signalid monthyear v q t g
0 5135 201901 90 1 1546444800000 0
1 5135 201901 50 1 1546444900000 0
2 5135 201901 40 1 1546445800000 0
3 5135 201901 30 1 1546446800000 0
4 5135 201901 35 1 1546446900000 0
假设我们有一个列名为 readings
的数据框,列值为
[{"v":"90","q":1,"t":1546444800000,"g":0}]
[{"v":"50","q":1,"t":1546444900000,"g":0}]
[{"v":"40","q":1,"t":1546445800000,"g":0}]
for row in data['readings']:
for value_dict in row:
for key in value_dict:
data[key] = value_dict[key]
如果列表中有多个 json,您没有提及预期的输出。我想这可能会对你有所帮助。如果您提到所需的输出,我可以根据它修改我的代码。
也许有一个更简单的解决方案,但这个应该有效。这个想法是将readings
列的每个条目转换为一个数据帧,其索引由 signalid
和 monthyear
给出。
此函数将字典转换为数据帧并设置索引:
def to_df(index, x):
return pd.DataFrame.from_dict(x, orient='index').T.set_index(pd.MultiIndex.from_arrays(index, names=['signalid', 'monthyear']))
下一个函数触发 to_df
的调用
def concat(y):
return pd.concat(map(lambda x: to_df([[y.signalid], [y.monthyear]], x), y.readings))
我们现在将concat
应用于每一行。结果是一系列数据帧,我们可以连接这些数据帧以接收最终数据:
s = df.apply(concat, axis=1)
pd.concat(s.tolist())