我尝试对以下数据执行线性回归
Country Age Salary Purchased
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 NaN Yes
5 France 35.0 58000.0 Yes
6 Spain NaN 52000.0 No
7 France 48.0 79000.0 Yes
8 Germany 50.0 83000.0 No
9 France 37.0 67000.0 Yes*
我试过了:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
oHe=OneHotEncoder()
ct=ColumnTransformer(transformers=[('encoder',oHe,[0])],remainder='passthrough')
X=np.array(ct.fit_transform(X),dtype=np.str)
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=1)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,y_train)
我得到的错误信息如下,我甚至尝试通过重塑它们,但它没有成功:
ValueError: dtype='numeric' is not compatible with arrays of bytes/strings.Convert your data to numeric values explicitly instead.
这里我从你的代码中做了一个例子:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
data = {'Country': ['France', 'Spain', 'Germany', 'Spain', 'Germany', 'France', 'Spain', 'France', 'Germany', 'France'],
'Age': [44.0, 27.0, 30.0, 38.0, 40.0, 35.0, None, 48.0, 50.0, 37.0],
'Salary': [72000.0, 48000.0, 54000.0, 61000.0, None, 58000.0, 52000.0, 79000.0, 83000.0, 67000.0],
'Purchased': ['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes']}
df = pd.DataFrame(data)
df = df.dropna()
X = df[["Country","Age"]]
y = df["Salary"]
ohe=OneHotEncoder()
ct=ColumnTransformer(transformers=[('encoder',ohe,[0])],remainder='passthrough')
X=np.array(ct.fit_transform(X))
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=1)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,y_train)
regressor.score(X_test, y_test) #0.705
I got no error with this.