我的代码不工作,我认为这是因为X和Y没有定义。我从一本书中得到了代码,它实际上并没有告诉我它们是如何定义的。
import pandas as pd
from matplotlib import pyplot
import seaborn as sns
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.datasets import load_digits
from pandas import read_csv
from pandas.plotting import scatter_matrix
filename = '/Users/rahulparmeshwar/Documents/Algo Bots/Data/Live Data/Tester.csv'
data = read_csv(filename)
correlation = data.corr()
bestfeatures = SelectKBest(k=5)
fit = bestfeatures.fit(X,Y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
featurescores = pd.concat([dfcolumns,dfscores],axis=1)
pd.set_option('display.width',100)
data.head(1)
print(data)
scatter_matrix(data)
pyplot.show()
print(featurescores.nlargest('2,score'))
我已经检查了SkLearn的文档,但它不是很有帮助。如有任何帮助,不胜感激
X
和y
应该是您从数据文件加载的功能集和目标变量。这是定义它们的一种典型方法:
data = read_csv(filename)
y = data['target variable name']
X = data.drop('target variable name', axis=1)
请看这里的例子。
https://github.com/ASH-WICUS/Notebooks/blob/master/Accuracies%20of%20Different%20Regressors%20-%20Housing%20Prices.ipynb
您可以从这里下载示例代码。
https://raw.githubusercontent.com/RuiChang123/Regression_for_house_price_estimation/master/final_data.csv
您应该能够使其相对容易地工作。然后,修改代码以满足您的特定需求。请记住,"X"由自变量组成,您想要测量哪些变量对因变量(即"y")有影响。
这应该也有帮助!
https://mljar.com/blog/feature-importance-in-random-forest/