Python和Pandas版本在读取DataFrame后更改数字的解释方式

我有两个环境：

环境#1(旧(：

Python 3.7.5
熊猫0.23.4

环境#2(新(：

Python 3.8.10
熊猫1.3.4

当我在这两种环境中通过pd.read_csv('name_of_my_csv_file.csv', delimiter=';', dtype=str)加载同一个CSV文件时，我注意到Python或Pandas误解了一些(不是所有的，大约是50000行中的12行(数字。在环境#1(旧(中，对数字的误解看起来像7546.168415200001，而实际上Excel文件中的数字是7546.1684152环境#2(新(正确解释数字，即7546.1684152。

>>> amount_old
7546.168415200001
>>>
>>> amount_new
7546.1684152
>>>
>>> # Types of both numbers from DataFrame
>>> type(amount_old)
<class 'numpy.float64'>
>>>
>>> type(amount_new)
<class 'numpy.float64'>
>>>

基于此，我有两个问题：

是什么导致了这种差异
如何确保在环境#2(新(中，我得到的数字与在>环境#1(旧(境#2(新(与环#1(旧(的值匹配的原因是，我有一个比较DataFrame哈希的测试，由于这些不同的数字，该测试失败了。在这两种情况下，散列都是由以下命令创建的：pd.util.hash_pandas_object(my_dataframe_from_excel)。然后，在测试中对哈希进行比较，但测试失败了，因为即使数字发生最微小的变化也会导致哈希不同

编辑：我使用的不是pd.read_excel()，而是pd.read_csv()。

编辑要确保加载的数据帧没有任何解释，请在以下两个环境中使用dtype=object：

read_excel(Pandas0.23.4(关于dtype:的文献

使用对象将数据保存为Excel中存储的数据，而不是解释数据类型

使用np.close:

import numpy as np
amount_old = 7546.168415200001
amount_new = 7546.1684152
>>> amount_old == amount_new
False
>>> np.isclose(amount_old, amount_new)
True

带数据帧：

df_old = pd.DataFrame({'amount': [7546.168415200001]})
df_new = pd.DataFrame({'amount': [7546.1684152]})
>>> df_old['amount'] == df_new['amount']
0    False
Name: amount, dtype: bool
>>> np.isclose(df_old['amount'], df_new['amount'])
array([ True])
# Or without np.isclose
>>> df_old['amount'].sub(df_new['amount']).abs() <= 1e-6
0    True
Name: amount, dtype: bool

因此，最后，浮动对象的表示似乎有问题，在环境#1(旧的(中(在我上面的问题中描述(，它被误解了。环境#2(新(的值实际上是正确的。这意味着，我们需要调整测试，以实际匹配新环境的输出，而不是旧环境的输出。

感谢大家的帮助。

相关内容

最新更新

热门标签：