如何使用线性回归与分类变量在sklearn

我正在尝试执行一些速度比较测试Python与R，并努力解决问题-在sklearn下使用分类变量的线性回归。

代码:

# Start the clock!
ptm <- proc.time()
ptm
test_data = read.csv("clean_hold.out.csv")
# Regression Model
model_liner = lm(test_data$HH_F ~ ., data = test_data)
# Stop the clock
new_ptm <- proc.time() - ptm

Python代码:

import pandas as pd
import time
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction import DictVectorizer
start = time.time()
test_data = pd.read_csv("./clean_hold.out.csv")
x_train = [col for col in test_data.columns[1:] if col != 'HH_F']
y_train = ['HH_F']
model_linear = LinearRegression(normalize=False)
model_linear.fit(test_data[x_train], test_data[y_train])

但它不适合我

返回X.astype (np。如果X.dtype == np.int32 else np.float64)ValueError:无法将字符串转换为float: Bee True

我尝试了另一种方法

test_data = pd.read_csv("./clean_hold.out.csv").to_dict()
v = DictVectorizer(sparse=False)
X = v.fit_transform(test_data)

然而，我发现了另一个错误:

文件"C: Anaconda32 lib 网站 sklearn feature_extraction dict_vectorizer.py",第258行，在transform中Xa[i, vocab[f]] = dtype(v) TypeError: float()参数必须是字符串或数字

我不明白Python应该如何解决这个问题…

数据示例:http://screencast.com/t/hYyyu7nU9hQm

在使用fit之前我必须做一些编码。

可以使用以下几个类:

LabelEncoder : turn your string into incremental value
OneHotEncoder : use One-of-K algorithm to transform your String into integer

我想有一个可扩展的解决方案，但没有得到任何答案。我选择了将所有字符串二值化的OneHotEncoder。这是非常有效的，但如果你有很多不同的字符串，矩阵会增长得非常快，将需要内存。

相关内容

最新更新

热门标签：