在Boruta-py:Python特性选择方法之后保留列名



下面是python中的Boruta实现。这是一种特征选择方法,可以消除相关、无用和冗余的变量,并有助于在执行ML算法或数据分析之前仅从数据集中获得相关特征。

基本上,如果我的df是这样的:

df
Feature 1    Feature 2     Feature 3    Feature 4................Feature 700

然后在boruta之后,我得到了一个数组:

[True, False, True.....False] etc using feat_support

这表示第一和第三特征被选择,而第二和第700特征未被选择。但我并没有像最初的df中那样得到列名,比如功能1、功能2等

# NOTE BorutaPy accepts numpy arrays only, if X_train and y_train are pandas dataframes, then add .values attribute X_train.values in that case
X_train = X_train.values
y_train = y_train.values
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
# define Boruta feature selection method
feat_selector = BorutaPy(rf, n_estimators='auto', verbose=2, random_state=1)
# find all relevant features 
feat_selector.fit(X_train, y_train)
# check selected features 
[IN]feat_selector.support_
[OUT]
array([False, False, False, False, False, False, False, False, False,
False, False, True, False, False, False, False, False, False,
False, False, False,
.................. False, False, False, False, True])

更多代码:

[IN]print (feat_selector.n_features_)
[OUT]441 #441 features were selected out of total 700 in my case.
# call transform() on X to filter it down to selected features
[IN]X_filtered = feat_selector.transform(X_train)
[OUT]
[[ 0  0  0 ...  0  0  0]
[24  6  0 ...  0  0  0]
[ 0  0  0 ... 43  0  0]
...
[ 0  0  0 ...  0  0  0]]

所以基本上我得到了在feat_selector.support_中选择的特性列表;但我并没有像通过Boruta获得原始X_train中那样获得列名。如何保留列名?

从源代码来看,support_是一个掩码数组。

support_ : array of shape [n_features]
The mask of selected features - only confirmed ones are True.

因此,您可以在列名中使用此项来获取功能名称。

X_train.columns[feat_selector.support_]

以获取已选择的列名。

最新更新