我正试图将box-cox转换应用于单个列,但我无法做到这一点。有人能帮我解决这个问题吗?
from sklearn.datasets import fetch_california_housing
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.preprocessing import PowerTransformer
california_housing = fetch_california_housing(as_frame=True).frame
california_housing
power = PowerTransformer(method='box-cox', standardize=True)
california_housing['MedHouseVal']=power.fit_transform(california_housing['MedHouseVal'])
函数power.fit_transform
要求输入数据在单个特征的情况下具有形状(n, 1)
而不是(n,)
(其中california_housing['MedHouseVal']
的形状为(n,)
,因为它是pd.Series
(。这可以通过重塑来实现,即通过更换
power.fit_transform(california_housing['MedHouseVal'])
带有
power.fit_transform(california_housing['MedHouseVal'].to_numpy().reshape(-1, 1))
或者,可替换地,通过简单地用california_housing[['MedHouseVal']]
访问列列表(其给出pd.DataFrame
(,而不是用california_housing['MedHouseVal']
访问单列(其给出了pd.Series
(,也就是说,通过使用
power.fit_transform(california_housing[['MedHouseVal']])
注意
print(california_housing['MedHouseVal'].shape)
print(california_housing[['MedHouseVal']].shape)
打印
(20640,)
(20640, 1)
另一种选择是使用scipy.stats.boxcox
:
from sklearn.datasets import fetch_california_housing
from scipy.stats import boxcox
california_housing = fetch_california_housing(as_frame=True).frame
california_housing['MedHouseVal'] = boxcox(california_housing['MedHouseVal'])[0]