Panda 1.1.5和1.3.4之间的什么变化改变了set_index/reset_index过程



我们有一些代码一直运行良好,直到我们团队中有人将panda从1.1.5升级到1.3.4。以下是导致此问题的代码的简化版本。理想情况下,我想知道如何更改set_index和/或reset_index调用,以便它们在1.1.5和1.3.4下都能工作。

熊猫1.1.5:

>>> import pandas
>>> from pandas import Timestamp
>>> df = pandas.DataFrame({'label': {1000: 'apple',
1001: 'carrot',
1002: 'carrot',
1003: 'apple',
1004: 'apple',
1005: 'carrot'},
'date': {1000: Timestamp('2021-10-27 00:00:00'),
1001: Timestamp('2021-10-27 00:00:00'),
1002: Timestamp('2021-10-28 00:00:00'),
1003: Timestamp('2021-10-28 00:00:00'),
1004: Timestamp('2021-10-29 00:00:00'),
1005: Timestamp('2021-10-29 00:00:00')},
'stock': {1000: 100,
1001: 150,
1002: 75,
1003: 50,
1004: 200,
1005: 20}})
>>> df_rolling = df.set_index(['label', 'date']).groupby(level='label').rolling(window=7, min_periods=1).sum()
>>> df_rolling
stock
label        
apple   100.0
apple   150.0
apple   350.0
carrot  150.0
carrot  225.0
carrot  245.0
>>> df_rolling.index
MultiIndex([( 'apple',),
( 'apple',),
( 'apple',),
('carrot',),
('carrot',),
('carrot',)],
names=['label'])
>>> df_rolling = df_rolling.reset_index()
>>> df_rolling.index
RangeIndex(start=0, stop=6, step=1)

熊猫1.3.4:

>>> import pandas
>>> from pandas import Timestamp
>>> df = pandas.DataFrame({'label': {1000: 'apple',
1001: 'carrot',
1002: 'carrot',
1003: 'apple',
1004: 'apple',
1005: 'carrot'},
'date': {1000: Timestamp('2021-10-27 00:00:00'),
1001: Timestamp('2021-10-27 00:00:00'),
1002: Timestamp('2021-10-28 00:00:00'),
1003: Timestamp('2021-10-28 00:00:00'),
1004: Timestamp('2021-10-29 00:00:00'),
1005: Timestamp('2021-10-29 00:00:00')},
'stock': {1000: 100,
1001: 150,
1002: 75,
1003: 50,
1004: 200,
1005: 20}})
>>> df_rolling = df.set_index(['label', 'date']).groupby(level='label').rolling(window=7, min_periods=1).sum()
>>> df_rolling                                                                                                                                                                                     
stock
label  label  date             
apple  apple  2021-10-27  100.0
2021-10-28  150.0
2021-10-29  350.0
carrot carrot 2021-10-27  150.0
2021-10-28  225.0
2021-10-29  245.0
>>> df_rolling.index                                                                                                                                                                               
MultiIndex([( 'apple',  'apple', '2021-10-27'),
( 'apple',  'apple', '2021-10-28'),
( 'apple',  'apple', '2021-10-29'),
('carrot', 'carrot', '2021-10-27'),
('carrot', 'carrot', '2021-10-28'),
('carrot', 'carrot', '2021-10-29')],
names=['label', 'label', 'date'])
>>> df_rolling = df_rolling.reset_index()                                                                                                                                                          
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-53-8b81c1e32ea2> in <module>
----> 1 df_rolling = df_rolling.reset_index()
/usr/local/lib/python3.8/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
309                     stacklevel=stacklevel,
310                 )
--> 311             return func(*args, **kwargs)
312 
313         return wrapper
/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py in reset_index(self, level, drop, inplace, col_level, col_fill)
5797                     )
5798 
-> 5799                 new_obj.insert(0, name, level_values)
5800 
5801         new_obj.index = new_index
/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py in insert(self, loc, column, value, allow_duplicates)
4412         if not allow_duplicates and column in self.columns:
4413             # Should this be a different kind of error??
-> 4414             raise ValueError(f"cannot insert {column}, already exists")
4415         if not isinstance(loc, int):
4416             raise TypeError("loc must be int")
ValueError: cannot insert label, already exists

不确定它是否解决了您的问题,因为我无法在两个版本上测试它,但是,请尝试使用inplace=True参数,如下所示:

df_rolling.reset_index(inplace=True)

这样做,就不需要将整个DF重新分配回同一个变量。

Python中的一个主要问题是,等号=不仅从右向左赋值,而且从左向右赋值,因此,如果您编写

a = b 

它的字面意思是A得到B的值,而B得到A的值。因此,回到您的问题,可能是编译器在您尝试分配"时抛出异常;重置索引DF";DF的版本转换为original_indexed BF,AND同时反过来。

另一种选择可以是使用.copy()方法的临时DF,如下所示:

myDF = pd.DataFrame(blabla...)
tempDF = myDF.reset_index()
myDF = tempDF.copy()

我更喜欢";就地";解决方案,如果两者都有效。让他们尝试一下并报告结果。

相关内容

最新更新