SKLEARN最佳功能.fit (X,Y)是什么意思?如何定义X和Y?



我的代码不工作,我认为这是因为X和Y没有定义。我从一本书中得到了代码,它实际上并没有告诉我它们是如何定义的。

import pandas as pd
from matplotlib import pyplot
import seaborn as sns
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.datasets import load_digits
from pandas import read_csv
from pandas.plotting import scatter_matrix

filename = '/Users/rahulparmeshwar/Documents/Algo Bots/Data/Live Data/Tester.csv'
data = read_csv(filename)
correlation = data.corr()
bestfeatures = SelectKBest(k=5)
fit = bestfeatures.fit(X,Y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
featurescores = pd.concat([dfcolumns,dfscores],axis=1)
pd.set_option('display.width',100)
data.head(1)
print(data)
scatter_matrix(data)
pyplot.show()
print(featurescores.nlargest('2,score'))

我已经检查了SkLearn的文档,但它不是很有帮助。如有任何帮助,不胜感激

Xy应该是您从数据文件加载的功能集和目标变量。这是定义它们的一种典型方法:

data = read_csv(filename)
y = data['target variable name']
X = data.drop('target variable name', axis=1)

请看这里的例子。

https://github.com/ASH-WICUS/Notebooks/blob/master/Accuracies%20of%20Different%20Regressors%20-%20Housing%20Prices.ipynb

您可以从这里下载示例代码。

https://raw.githubusercontent.com/RuiChang123/Regression_for_house_price_estimation/master/final_data.csv

您应该能够使其相对容易地工作。然后,修改代码以满足您的特定需求。请记住,"X"由自变量组成,您想要测量哪些变量对因变量(即"y")有影响。

这应该也有帮助!

https://mljar.com/blog/feature-importance-in-random-forest/

最新更新