下面是python中的Boruta实现。这是一种特征选择方法,可以消除相关、无用和冗余的变量,并有助于在执行ML算法或数据分析之前仅从数据集中获得相关特征。
基本上,如果我的df是这样的:
df
Feature 1 Feature 2 Feature 3 Feature 4................Feature 700
然后在boruta之后,我得到了一个数组:
[True, False, True.....False] etc using feat_support
这表示第一和第三特征被选择,而第二和第700特征未被选择。但我并没有像最初的df中那样得到列名,比如功能1、功能2等
# NOTE BorutaPy accepts numpy arrays only, if X_train and y_train are pandas dataframes, then add .values attribute X_train.values in that case
X_train = X_train.values
y_train = y_train.values
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
# define Boruta feature selection method
feat_selector = BorutaPy(rf, n_estimators='auto', verbose=2, random_state=1)
# find all relevant features
feat_selector.fit(X_train, y_train)
# check selected features
[IN]feat_selector.support_
[OUT]
array([False, False, False, False, False, False, False, False, False,
False, False, True, False, False, False, False, False, False,
False, False, False,
.................. False, False, False, False, True])
更多代码:
[IN]print (feat_selector.n_features_)
[OUT]441 #441 features were selected out of total 700 in my case.
# call transform() on X to filter it down to selected features
[IN]X_filtered = feat_selector.transform(X_train)
[OUT]
[[ 0 0 0 ... 0 0 0]
[24 6 0 ... 0 0 0]
[ 0 0 0 ... 43 0 0]
...
[ 0 0 0 ... 0 0 0]]
所以基本上我得到了在feat_selector.support_
中选择的特性列表;但我并没有像通过Boruta获得原始X_train中那样获得列名。如何保留列名?
从源代码来看,support_
是一个掩码数组。
support_ : array of shape [n_features]
The mask of selected features - only confirmed ones are True.
因此,您可以在列名中使用此项来获取功能名称。
X_train.columns[feat_selector.support_]
以获取已选择的列名。