我正在为一个机器学习项目清理数据,分别用'Age'和'Fare'列的零和平均值替换缺失的值。其代码如下所示:
train_data['Age'] = train_data['Age'].fillna(0)
mean = train_data['Fare'].mean()
train_data['Fare'] = train_data['Fare'].fillna(mean)
由于我必须对其他数据集多次执行此操作,因此我希望通过创建一个通用函数来自动化此过程,该函数将DataFrame作为输入,并执行修改它并返回修改后的函数的操作。代码如下所示:
def data_cleaning(df):
df['Age'] = df['Age'].fillna(0)
fare_mean = df['Fare'].mean()
df['Fare'] = df['Fare'].fillna()
return df
然而,当我传递训练数据DataFrame:
train_data = data_cleaning(train_data)
我得到以下错误:
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_42/1440633985.py in <module>
1 #print(train_data)
----> 2 train_data = data_cleaning(train_data)
3 cross_val_data = data_cleaning(cross_val_data)
/tmp/ipykernel_42/3053068338.py in data_cleaning(df)
2 df['Age'] = df['Age'].fillna(0)
3 fare_mean = df['Fare'].mean()
----> 4 df['Fare'] = df['Fare'].fillna()
5 return df
/opt/conda/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args,
**kwargs)
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
312
313 return wrapper
/opt/conda/lib/python3.7/site-packages/pandas/core/series.py in fillna(self, value,
method, axis, inplace, limit, downcast)
4820 inplace=inplace,
4821 limit=limit,
-> 4822 downcast=downcast,
4823 )
4824
/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in fillna(self, value,
method, axis, inplace, limit, downcast)
6311 """
6312 inplace = validate_bool_kwarg(inplace, "inplace")
-> 6313 value, method = validate_fillna_kwargs(value, method)
6314
6315 self._consolidate_inplace()
/opt/conda/lib/python3.7/site-packages/pandas/util/_validators.py in
validate_fillna_kwargs(value, method, validate_scalar_dict_value)
368
369 if value is None and method is None:
--> 370 raise ValueError("Must specify a fill 'value' or 'method'.")
371 elif value is None and method is not None:
372 method = clean_fill_method(method)
ValueError: Must specify a fill 'value' or 'method'.
在一些研究中,我发现我必须使用apply()和map()函数来代替,但我不确定如何输入列的平均值。此外,这不能很好地扩展,因为我必须在将所有fillna值输入到函数之前计算它们,这很麻烦。所以我想问,有没有更好的方法来自动清理数据?
这行df['Fare'] = df['Fare'].fillna()
在你的函数中,你没有用任何东西填充n/a,因此它返回一个错误。你应该把它改成df['Fare'] = df['Fare'].fillna(fare_mean)
。
如果你想让它在同一目录下的另一个文件中使用,你可以在另一个文件中调用它:
from file_that_contain_function import function_name
如果你想让它在你的工作空间/虚拟环境中可重用,你可能需要创建你自己的python包。
所以,是的,另一个答案解释了错误是从哪里来的。
然而,开头的警告与填充nan无关。该警告告诉您正在修改数据框架副本的一个片段。将代码更改为
def data_cleaning(df):
df['Age'] = df.loc[:, 'Age'].fillna(0)
fare_mean = df['Fare'].mean()
df['Fare'] = df.loc[:, 'Fare'].fillna(fare_mean) # <- and also fix this error
return df
我建议也在这里搜索这个特定的警告,因为有数百个帖子详细介绍了这个警告以及如何处理它。