dtype='numeric' 与字节/字符串数组不兼容

我尝试对以下数据执行线性回归

Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes*

我试过了:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
oHe=OneHotEncoder()
ct=ColumnTransformer(transformers=[('encoder',oHe,[0])],remainder='passthrough')
X=np.array(ct.fit_transform(X),dtype=np.str)
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=1)

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,y_train)

我得到的错误信息如下，我甚至尝试通过重塑它们，但它没有成功:

ValueError: dtype='numeric' is not compatible with arrays of bytes/strings.Convert your data to numeric values explicitly instead.

这里我从你的代码中做了一个例子:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
data = {'Country': ['France', 'Spain', 'Germany', 'Spain', 'Germany', 'France', 'Spain', 'France', 'Germany', 'France'],
'Age': [44.0, 27.0, 30.0, 38.0, 40.0, 35.0, None, 48.0, 50.0, 37.0],
'Salary': [72000.0, 48000.0, 54000.0, 61000.0, None, 58000.0, 52000.0, 79000.0, 83000.0, 67000.0],
'Purchased': ['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes']}
df = pd.DataFrame(data)
df = df.dropna()
X = df[["Country","Age"]]
y = df["Salary"]
ohe=OneHotEncoder()
ct=ColumnTransformer(transformers=[('encoder',ohe,[0])],remainder='passthrough')
X=np.array(ct.fit_transform(X))
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=1)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,y_train)
regressor.score(X_test, y_test) #0.705

I got no error with this.

相关内容

最新更新

热门标签：