TypeError:对2列进行分组时,插入列的索引与框架索引不兼容



我有一个数据集,看起来像这样(+一些其他列(:

Value         Theme       Country
-1.975767     Weather     China
-0.540979     Fruits      China
-2.359127     Fruits      China
-2.815604     Corona      Brazil
-0.929755     Weather     UK
-0.929755     Weather     UK

我想找到按主题和国家分组后的值的标准偏差(如这里所解释的,通过分组两列来计算标准偏差

df = pd.read_csv('./Brazil.csv')
df['std'] = df.groupby(['themes', 'country'])['value'].std()

然而,目前,我得到这个错误:

File /usr/local/Cellar/ipython/8.0.1/libexec/lib/python3.10/site-packages/pandas/core/frame.py:3656, in DataFrame.__setitem__(self, key, value)
3653     self._setitem_array([key], value)
3654 else:
3655     # set column
-> 3656     self._set_item(key, value)
File /usr/local/Cellar/ipython/8.0.1/libexec/lib/python3.10/site-packages/pandas/core/frame.py:3833, in DataFrame._set_item(self, key, value)
3823 def _set_item(self, key, value) -> None:
3824     """
3825     Add series to DataFrame in specified column.
3826 
(...)
3831     ensure homogeneity.
3832     """
-> 3833     value = self._sanitize_column(value)
3835     if (
3836         key in self.columns
3837         and value.ndim == 1
3838         and not is_extension_array_dtype(value)
3839     ):
3840         # broadcast across multiple columns if necessary
3841         if not self.columns.is_unique or isinstance(self.columns, MultiIndex):
File /usr/local/Cellar/ipython/8.0.1/libexec/lib/python3.10/site-packages/pandas/core/frame.py:4534, in DataFrame._sanitize_column(self, value)
4532 # We should never get here with DataFrame value
4533 if isinstance(value, Series):
-> 4534     return _reindex_for_setitem(value, self.index)
4536 if is_list_like(value):
4537     com.require_length_match(value, self.index)
File /usr/local/Cellar/ipython/8.0.1/libexec/lib/python3.10/site-packages/pandas/core/frame.py:10985, in _reindex_for_setitem(value, index)
10981     if not value.index.is_unique:
10982         # duplicate axis
10983         raise err
> 10985     raise TypeError(
10986         "incompatible index of inserted column with frame index"
10987     ) from err
10988 return reindexed_value
TypeError: incompatible index of inserted column with frame index

使用DataFrame.expanding并通过DataFrame.droplevel删除新列的第一级应该是更简单的解决方案:

df['std']  = (df.groupby(['Theme', 'Country'])['Value']
.expanding()
.std()
.droplevel([0,1]))
print (df)
Value    Theme Country       std
0 -1.975767  Weather   China       NaN
1 -0.540979   Fruits   China       NaN
2 -2.359127   Fruits   China  1.285625
3 -2.815604   Corona  Brazil       NaN
4 -0.929755  Weather      UK       NaN
5 -0.929755  Weather      UK  0.000000

您可以使用rolling方法来计算每组的累积标准偏差。

代码

import pandas as pd
# Create a sample dataframe
import io
text_csv = '''Value,Theme,Country
-1.975767,Weather,China
-0.540979,Fruits,China
-2.359127,Fruits,China
-2.815604,Corona,Brazil
-0.929755,Weather,UK
-0.929755,Weather,UK'''
df = pd.read_csv(io.StringIO(text_csv))
# Calculate cumulative standard deviations
df_std = df.groupby(['Theme', 'Country'], as_index=False)['Value'].rolling(len(df), min_periods=1).std()
# Merge the original df with the cumulative std values
df_std = df.join(df_std.drop(['Theme', 'Country'], axis=1).rename(columns={'Value': 'CorrectedStd'}))

输出

主题0-1.7577天气国1-0.540979果中国nan2-2.35913td style="text-align:centre;">3-2.8156>Corona西4-0.9727555-0.929755英国天气

最新更新