我正在尝试使用scikit-learn的SelectKBest
特征进行监督机器学习实验,但我不确定如何在找到最佳特征后创建新的数据框架:
让我们假设我想进行实验,选择5个最佳特征:
from sklearn.feature_selection import SelectKBest, f_classif
select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit_transform(features_dataframe, targeted_class)
现在,如果我添加一行:
import pandas as pd
dataframe = pd.DataFrame(select_k_best_classifier)
我收到一个没有功能名称的新数据框(只有从0到4的索引),但是我想用新选择的功能创建一个数据框,如下所示:
dataframe = pd.DataFrame(fit_transofrmed_features, columns=features_names)
我的问题是如何创建features_names
列表?
我知道我应该使用:
select_k_best_classifier.get_support()
返回一个布尔值数组,其中true值索引表示在原始数据框中应该被选择的列。
我应该如何使用这个布尔数组的所有功能名称的数组,我可以通过方法feature_names = list(features_dataframe.columns.values)
?
不需要循环。
# Create and fit selector
selector = SelectKBest(f_classif, k=5)
selector.fit(features_df, target)
# Get columns to keep and create new dataframe with those only
cols_idxs = selector.get_support(indices=True)
features_df_new = features_df.iloc[:,cols_idxs]
对我来说,这段代码工作得很好,更"python化":
mask = select_k_best_classifier.get_support()
new_features = features_dataframe.columns[mask]
您可以这样做:
mask = select_k_best_classifier.get_support() #list of booleans
new_features = [] # The list of your K best features
for bool_val, feature in zip(mask, feature_names):
if bool_val:
new_features.append(feature)
然后更改功能的名称:
dataframe = pd.DataFrame(fit_transofrmed_features, columns=new_features)
下面的代码将帮助您找到最重要的K个特征及其f值。设,X是pandas数据框架,其中的列是所有的特征,y是类标签的列表。
import pandas as pd
from sklearn.feature_selection import SelectKBest, f_classif
#Suppose, we select 5 features with top 5 Fisher scores
selector = SelectKBest(f_classif, k = 5)
#New dataframe with the selected features for later use in the classifier. fit() method works too, if you want only the feature names and their corresponding scores
X_new = selector.fit_transform(X, y)
names = X.columns.values[selector.get_support()]
scores = selector.scores_[selector.get_support()]
names_scores = list(zip(names, scores))
ns_df = pd.DataFrame(data = names_scores, columns=['Feat_names', 'F_Scores'])
#Sort the dataframe for better visualization
ns_df_sorted = ns_df.sort_values(['F_Scores', 'Feat_names'], ascending = [False, True])
print(ns_df_sorted)
根据chi2选择最佳10个功能;
from sklearn.feature_selection import SelectKBest, chi2
KBest = SelectKBest(chi2, k=10).fit(X, y)
使用get_support()获取特性
f = KBest.get_support(1) #the most important features
创建新的df名为X_new;
X_new = X[X.columns[f]] # final features`
在Scikit-learn 1.0中,变压器具有get_feature_names_out
方法,这意味着您可以编写
dataframe = pd.DataFrame(fit_transformed_features, columns=transformer.get_features_names_out())
还有另一种替代方法,但是,它没有上述解决方案快。
# Use the selector to retrieve the best features
X_new = select_k_best_classifier.fit_transform(train[feature_cols],train['is_attributed'])
# Get back the kept features as a DataFrame with dropped columns as all 0s
selected_features = pd.DataFrame(select_k_best_classifier.inverse_transform(X_new),
index=train.index,
columns= feature_cols)
selected_columns = selected_features.columns[selected_features.var() !=0]
# Fit the SelectKBest instance
select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit(features_dataframe, targeted_class)
# Extract the required features
new_features = select_k_best_classifier.get_feature_names_out(features_names)
假设您要选择10个最佳特性:
import pandas as pd
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(score_func=chi2, k = 10)
selector.fit_transform(X, y)
features_names = selector.feature_names_in_
print(features_names)