dtype='numeric' 与字节/字符串数组不兼容。改为将数据显式转换为数值



我正在使用scikit learn做线性回归,我尝试了各种方法,通过重塑它们来导致代码中的整个错误。数据集为

R&D Spend  Administration  Marketing Spend       State     Profit
0   165349.20       136897.80        471784.10    New York  192261.83
1   162597.70       151377.59        443898.53  California  191792.06
2   153441.51       101145.55        407934.54     Florida  191050.39
3   144372.41       118671.85        383199.62    New York  182901.99
4   142107.34        91391.77        366168.42     Florida  166187.94
5   131876.90        99814.71        362861.36    New York  156991.12
6   134615.46       147198.87        127716.82  California  156122.51
7   130298.13       145530.06        323876.68     Florida  155752.60
8   120542.52       148718.95        311613.29    New York  152211.77
9   123334.88       108679.17        304981.62  California  149759.96
10  101913.08       110594.11        229160.95     Florida  146121.95
11  100671.96        91790.61        249744.55  California  144259.40
12   93863.75       127320.38        249839.44     Florida  141585.52
13   91992.39       135495.07        252664.93  California  134307.35
14  119943.24       156547.42        256512.92     Florida  132602.65

我试过下面的代码

#Dataset
dataset=pd.read_csv(r'50_Startups.csv')
X=dataset.iloc[:,:-1]
y=dataset.iloc[:,-1]
#Encoding Categorical Data
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
oHe=OneHotEncoder()
ct=ColumnTransformer(transformers=[('encoder',oHe,[3])],remainder='passthrough')
X = np.array(ct.fit_transform(X), dtype = np.str)
#Splitting into Training and Test sets 
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=1)
#Training the Multiple Linear Regression
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(X_train,y_train)

错误是:

ValueError: dtype='numeric' is not compatible with arrays of bytes/strings.
Convert your data to numeric values explicitly instead.

您应该使用数字类型的X:

X = np.array(ct.fit_transform(X), dtype=np.float64)

则回归无误差:

regressor.fit(X_train, y_train)
regressor.coef_
# array([ 2.21054629e+03,  2.33695693e+03, -4.54750322e+03,  8.05301486e-01,
#        -9.57801181e-03,  1.17912512e-02])
regressor.intercept_
# 52971.480360281625

在这里,我们首先使用LabelEncoder将分类变量转换为数值,然后对转换后的数值数据应用OneHotEncoder。最后,从np.array()函数调用中删除dtype形参,以确保转换后的数据具有适当的数值数据类型。

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
le=LabelEncoder()
oHe=OneHotEncoder()
X.iloc[:,3] = le.fit_transform(X.iloc[:, 3])
ct=ColumnTransformer(transformers=[('encoder',oHe,[3])],remainder='passthrough')

相关内容

最新更新