我有一个DataFrame,它有一个json数组作为一列的值。我想选择其中一个作为列的值,并去掉其余的。我已经将所需的值放入一个序列中,但我不知道如何将它们连接回DataFrame,以取代现有的列:
import json
from pandas.io.json import json_normalize
df = pd.DataFrame({
'bank_account': [101, 102, 201, 301],
'data': [
'{"uid": 100, "account_type": 1, "account_data": {"currency": {"current": 1000, "minimum": -500}, "fees": {"monthly": 13.5}}, "user_name": "Alice"}',
'{"uid": 100, "account_type": 2, "account_data": {"currency": {"current": 2000, "minimum": 0}, "fees": {"monthly": 0}}, "user_name": "Alice"}',
'{"uid": 200, "account_type": 1, "account_data": {"currency": {"current": 3000, "minimum": 0}, "fees": {"monthly": 13.5}}, "user_name": "Bob"}',
'{"uid": 300, "account_type": 1, "account_data": {"currency": {"current": 4000, "minimum": 0}, "fees": {"monthly": 13.5}}, "user_name": "Carol"}'
]},
index = ['Alice', 'Alice', 'Bob', 'Carol']
)
lst = []
for d in df['data']:
d = pd.read_json(d, lines=True)['uid'].values[0]
lst.append(d)
s = pd.DataFrame(lst)
df['data'] = s
print(s)
print(df)
返回
0
0 100
1 100
2 200
3 300
bank_account data
Alice 101 NaN
Alice 102 NaN
Bob 201 NaN
Carol 301 NaN
目前,我不知道为什么数据列显示所有nan值。感谢您的帮助。
更新的问题:有些行有json数组列表,而不是只有一个。以下是我目前所拥有的:
import json
from pandas.io.json import json_normalize
df = pd.DataFrame({
'bank_account': [101, 102, 201, 301],
'data': [
'[{"uid": 100, "account_type": 1, "account_data": {"currency": {"current": 1000, "minimum": -500}, "fees": {"monthly": 13.5}}, "user_name": "Alice"},{"uid": 150, "account_type": 1, "account_data": {"currency": {"current": 1000, "minimum": -500}, "fees": {"monthly": 13.5}}, "user_name": "jer"}]',
'{"uid": 100, "account_type": 2, "account_data": {"currency": {"current": 2000, "minimum": 0}, "fees": {"monthly": 0}}, "user_name": "Alice"}',
'{"uid": 200, "account_type": 1, "account_data": {"currency": {"current": 3000, "minimum": 0}, "fees": {"monthly": 13.5}}, "user_name": "Bob"}',
'{"uid": 300, "account_type": 1, "account_data": {"currency": {"current": 4000, "minimum": 0}, "fees": {"monthly": 13.5}}, "user_name": "Carol"}'
]},
index = ['Alice', 'Alice', 'Bob', 'Carol']
)
# df["data"] = df["data"].apply(lambda x: pd.read_json(x, lines=True)["uid"][0])
df["data"] = df["data"].apply(lambda array : (",".join(list(map(lambda x : pd.read_json(x, lines=True)["uid"][0], array),(df['data'])))))
print(df)
这对我有效:
df = pd.DataFrame({
'bank_account': [101, 102, 201, 301],
'data': [
'{"uid": 100, "account_type": 1, "account_data": {"currency": {"current": 1000, "minimum": -500}, "fees": {"monthly": 13.5}}, "user_name": "Alice"}',
'{"uid": 100, "account_type": 2, "account_data": {"currency": {"current": 2000, "minimum": 0}, "fees": {"monthly": 0}}, "user_name": "Alice"}',
'{"uid": 200, "account_type": 1, "account_data": {"currency": {"current": 3000, "minimum": 0}, "fees": {"monthly": 13.5}}, "user_name": "Bob"}',
'{"uid": 300, "account_type": 1, "account_data": {"currency": {"current": 4000, "minimum": 0}, "fees": {"monthly": 13.5}}, "user_name": "Carol"}'
]},
index = ['Alice', 'Alice', 'Bob', 'Carol']
)
df["data"] = df["data"].apply(lambda x: pd.read_json(x, lines=True)["uid"][0])
您的代码不起作用,因为df
和s
具有不同的索引。如果要在两个打印语句之前修复代码集df['data'] = s[0].values
(而不是df['data'] = s
(。
正如@rachwa所指出的,问题是索引不匹配,因为s
的索引是数字,而df
的索引是名称。如果您直接分配lst
,而不是将其投射到DataFrame,您将获得所需的结果,即
df['data'] = lst
将按预期工作。
您也可以使用json.loads
而不是read_json
(它应该更快(:
import json
df['data'] = [json.loads(d)['uid'] for d in df['data']]
输出:
bank_account data
Alice 101 100
Alice 102 100
Bob 201 200
Carol 301 300