我试图在此处对葡萄酒数据集进行分类-http://archive.ics.uci.edu/ml/datasets/wine Quality使用逻辑回归(使用方法='bfgs'和l1 norm)并捕获了一个单数值矩阵误差(提高linalgerror('singular matrix'),并具有完整的等级[我使用np.linalg.matrix_rank进行了测试[data [data [data [data [data [data [data [data [data [data [data [data]train_cols]。值)]。
这就是我得出的结论,即某些功能可能是其他功能的线性组合。为此,我尝试了使用网格搜索/线性SVC的实验 - 以及下面的错误以及我的代码&数据集。
我可以看到只有6/7的功能实际上是"独立的" - 我在比较x_train_new [0]和x_train的行时解释了这一点(因此我可以获得哪些列是冗余的)
# Train & test DATA CREATION
from sklearn.svm import LinearSVC
import numpy, random
import pandas as pd
df = pd.read_csv("https://github.com/ekta1007/Predicting_wine_quality/blob/master/wine_red_dataset.csv")
#,skiprows=0, sep=',')
df=df.dropna(axis=1,how='any') # also tried how='all' - still get NaN errors as below
header=list(df.columns.values) # or df.columns
X = df[df.columns - [header[-1]]] # header[-1] = ['quality'] - this is to make the code genric enough
Y = df[header[-1]] # df['quality']
rows = random.sample(df.index, int(len(df)*0.7)) # indexing the rows that will be picked in the train set
x_train, y_train = X.ix[rows],Y.ix[rows] # Fetching the data frame using indexes
x_test,y_test = X.drop(rows),Y.drop(rows)
# Training the classifier using C-Support Vector Classification.
clf = LinearSVC(C=0.01, penalty="l1", dual=False) #,tol=0.0001,fit_intercept=True, intercept_scaling=1)
clf.fit(x_train, y_train)
x_train_new = clf.fit_transform(x_train, y_train)
#print x_train_new #works
clf.predict(x_test) # does NOT work and gives NaN errors for some x_tests
clf.score(x_test, y_test) # Does NOT work
clf.coef_ # Works, but I am not sure, if this is OK, given huge NaN's - or does the coef's get impacted ?
clf.predict(x_train)
552 NaN
209 NaN
427 NaN
288 NaN
175 NaN
427 NaN
748 7
552 NaN
429 NaN
[... and MORE]
Name: quality, Length: 1119
clf.predict(x_test)
76 NaN
287 NaN
420 7
812 NaN
443 7
420 7
430 NaN
373 5
624 5
[..and More]
Name: quality, Length: 480
奇怪的是,当我运行clf.predict(x_train)时,我仍然会看到一些nan的 - 我在做什么错?在使用此模型训练之后,这不应该发生,对吗?/strong>
根据此线程,我还检查了我的CSV文件中没有null (尽管我将"质量"重新标记为5和7标签(从范围(3,10)中如何修复" nan或infinity"Python中稀疏矩阵的问题?
也 - 这是x_test&的数据类型y_test/train ...
x_test
<class 'pandas.core.frame.DataFrame'>
Int64Index: 480 entries, 1 to 1596
Data columns:
alcohol 480 non-null values
chlorides 480 non-null values
citric acid 480 non-null values
density 480 non-null values
fixed acidity 480 non-null values
free sulfur dioxide 480 non-null values
pH 480 non-null values
residual sugar 480 non-null values
sulphates 480 non-null values
total sulfur dioxide 480 non-null values
volatile acidity 480 non-null values
dtypes: float64(11)
y_test
1 5
10 5
18 5
21 5
30 5
31 7
36 7
40 5
50 5
52 7
53 5
55 5
57 5
60 5
61 5
[..And MORE]
Name: quality, Length: 480
最后..
clf.score(x_test, y_test)
Traceback (most recent call last):
File "<pyshell#31>", line 1, in <module>
clf.score(x_test, y_test)
File "C:Python27libsite-packagessklearnbase.py", line 279, in score
return accuracy_score(y, self.predict(X))
File "C:Python27libsite-packagessklearnmetricsmetrics.py", line 742, in accuracy_score
y_true, y_pred = check_arrays(y_true, y_pred)
File "C:Python27Libsite-packagessklearnutilsvalidation.py", line 215, in check_arrays
File "C:Python27Libsite-packagessklearnutilsvalidation.py", line 18, in _assert_all_finite
ValueError: Array contains NaN or infinity.
#I also explicitly checked for NaN's as here -:
for i in df.columns:
df[i].isnull()
提示:还请提及如果我使用LinearSVC的思考过程正确,给定我的用例,或者我应该使用Grid-search?
免责声明:该代码的一部分是基于stackoverflow和其他货物的类似上下文中的建议构建的 - 如果此方法非常适合我的场景,我的真正用例就是尝试访问。仅此而已。
这有效。我唯一需要更改的是使用x_test* .values *以及其他pandas dataframes(x_train,y__train,y_test)。正如指出的那样,唯一的原因是Pandas DF和Scikit-Learn(使用Numpy Arrays)
之间不兼容 #changing your Pandas Dataframe elegantly to work with scikit-learn by transformation to numpy arrays
>>> type(x_test)
<class 'pandas.core.frame.DataFrame'>
>>> type(x_test.values)
<type 'numpy.ndarray'>
这个hack来自这篇文章http://python.dzone.com/articles/python-making-making-scikit-learn------ @andreasmueller,他指出了不一致。