为什么pandas.read_json会修改长整数的值

我不知道为什么id_1&id_2打印时发生更改。

我有一个名为test_data.json的json文件

{
"objects":{
"value":{
"1298543947669573634":{
"timestamp":"Wed Aug 26 08:52:57 +0000 2020",
"id_1":"1298543947669573634",
"id_2":"1298519559306190850"
}
}
}
}

输出

python test_data.py 
id_1                 id_2                 timestamp
0  1298543947669573632  1298519559306190848 2020-08-26 08:52:57+00:00

我的代码名为test_data.py是

import pandas as pd
import json
file = "test_data.json"
with open (file, "r")  as f:
all_data = json.loads(f.read()) 
data = pd.read_json(json.dumps(all_data['objects']['value']), orient='index')
data = data.reset_index(drop=True)
print(data.head())

如何解决此问题，以便正确解释数值？

使用python 3.8.5和pandas 1.1.1

当前实现

首先，代码读取文件并将其从str类型转换为dict，其中json.loads

with open (file, "r")  as f:
all_data = json.loads(f.read())

然后'value'被转换回str

json.dumps(all_data['objects']['value'])

使用orient='index'将keys设置为列标题，values位于行中。
- 此时数据也会转换为int，并且值会发生变化
- 我猜这一步中存在一些浮点转换问题
  - panda问题：如果dtype没有明确提到#20608，read_json将大整数作为字符串读取错误

pd.read_json(json.dumps(all_data['objects']['value']), orient='index')

更新的代码

选项1

使用pandas.DataFrame.from_dict，然后转换为数字

file = "test_data.json"
with open (file, "r")  as f:
all_data = json.loads(f.read()) 
# use .from_dict
data = pd.DataFrame.from_dict(all_data['objects']['value'], orient='index')
# convert columns to numeric
data[['id_1', 'id_2']] = data[['id_1', 'id_2']].apply(pd.to_numeric, errors='coerce')
data = data.reset_index(drop=True)
# display(data)
timestamp                 id_1                 id_2
0  Wed Aug 26 08:52:57 +0000 2020  1298543947669573634  1298519559306190850
print(data.info())
[out]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
#   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
0   timestamp  1 non-null      object
1   id_1       1 non-null      int64 
2   id_2       1 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 152.0+ bytes

选项2

使用pandas.json_normalize，然后将列转换为数字

file = "test_data.json"
with open (file, "r")  as f:
all_data = json.loads(f.read()) 
# read all_data into a dataframe
df = pd.json_normalize(all_data['objects']['value'])
# rename the columns
df.columns = [x.split('.')[1] for x in df.columns]
# convert to numeric
df[['id_1', 'id_2']] = df[['id_1', 'id_2']].apply(pd.to_numeric, errors='coerce')
# display(df)
timestamp                 id_1                 id_2
0  Wed Aug 26 08:52:57 +0000 2020  1298543947669573634  1298519559306190850
print(df.info()
[out]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
#   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
0   timestamp  1 non-null      object
1   id_1       1 non-null      int64 
2   id_2       1 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 152.0+ bytes

这是由问题20608引起的，在当前1.2.4版本的Pandas中仍然会发生这种情况。

这是我的解决方法，它在我的数据上甚至比read_json:稍快

def broken_load_json(path):
"""There's an open issue: https://github.com/pandas-dev/pandas/issues/20608
about read_csv loading large integers incorrectly because it's converting
from string to float to int, losing precision."""
df = pd.read_json(pathlib.Path(path), orient='index')
return df
def orjson_load_json(path):
import orjson  # The builting json module would also work
with open(path) as f:
d = orjson.loads(f.read())
df = pd.DataFrame.from_dict(d, orient='index')  # Builds the index from the dict's keys as strings, sadly
# Fix the dtype of the index
df = df.reset_index()
df['index'] = df['index'].astype('int64')
df = df.set_index('index')
return df

请注意，我的回答保留了ID的值，这在我的用例中是有意义的。

当前实现

更新的代码

选项1

选项2

相关内容

最新更新

热门标签：