我正在学习使用管道,并用FunctionTransformer
制作了一个非常简单的管道来添加新列、ordinal encoder
和LinearRegression
模型。
但事实证明,当我运行管道时,我得到了SettingwithCopy
,并将问题隔离到了FunctionTransformer
。
这是代码,我省略了所有不必要的代码(比如管道中的序数enoder和regression(-
def weekfunc(df):
df['date'] = pd.to_datetime(df.loc[:,'date'])
df['weekend'] = df.loc[:, 'date'].dt.weekday
df['weekend'].replace(range(5), 0, inplace = True)
df['weekend'].replace([5,6], 1, inplace = True)
return df
get_weekend = FunctionTransformer(weekfunc)
pipe = Pipeline([
('weekend transform', get_weekend),
])
pipe.transform(X_train)
这给了我以下错误-
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:12: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
if sys.path[0] == '':
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:13: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
del sys.path[0]
/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py:6619: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
return self._update_inplace(result)
这很奇怪,因为我可以在没有FunctionTransformer的情况下做同样的事情,而不会得到错误。
我真的很困惑,所以任何帮助都很感激
这是在警告您,您可能不一定已经完成了需要做的事情。您正在尝试访问和更新视图。尽管视图已经更新,但您可能不一定已经更新了原始df。这就是问题所在。
Pandas警告你,因为有可能出现大错误,尤其是当你处理大数据集时。
让我们演示一下;
df=pd.DataFrame({'city':['San Fransisco', 'Nairobi'], 'score':[123,95]})
如果城市是内罗毕,则让子集加2得分
df['score']=df.loc[df['city']=='Nairobi','score']+2
结果
city score
0 San Fransisco NaN
1 Nairobi 97.0
你意识到,尽管它起了作用,但结果却让旧金山黯然失色。这就是的全部警告
正确的方法是什么?正确的方法是屏蔽不需要更新的内容。这样做的一种方法是警告所建议的。使用lo访问器选择要更新的单元格。
df.loc[df['city']=='Nairobi','score']=df.loc[df['city']=='Nairobi','score']+2
结果
city score
0 San Fransisco 123
1 Nairobi 97