我正在尝试使用scikit-learn
对一些文本文档进行集群。我正在尝试DBSCAN和MeanShift,并想确定哪些超参数(例如,MeanShif的bandwidth
和DBSCAN的eps
)最适合我正在使用的数据类型(新闻文章)。
我有一些测试数据,它由预先标记的集群组成。我一直在尝试使用scikit-learn
的GridSearchCV
,但不明白在这种情况下如何(或是否可以)应用,因为它需要拆分测试数据,但我想在整个数据集上运行评估,并将结果与预先标记的数据进行比较。
我一直试图指定一个评分函数,将估计器的标签与真实标签进行比较,但它当然不起作用,因为只有一个样本的数据被聚类,而不是全部
这里有什么合适的方法?
DBSCAN的以下函数可能会有所帮助。我编写它是为了迭代超参数eps和min_samples,并包含min和max集群的可选参数。由于DBSCAN是无监督的,所以我没有包含评估参数。
def dbscan_grid_search(X_data, lst, clst_count, eps_space = 0.5,
min_samples_space = 5, min_clust = 0, max_clust = 10):
"""
Performs a hyperparameter grid search for DBSCAN.
Parameters:
* X_data = data used to fit the DBSCAN instance
* lst = a list to store the results of the grid search
* clst_count = a list to store the number of non-whitespace clusters
* eps_space = the range values for the eps parameter
* min_samples_space = the range values for the min_samples parameter
* min_clust = the minimum number of clusters required after each search iteration in order for a result to be appended to the lst
* max_clust = the maximum number of clusters required after each search iteration in order for a result to be appended to the lst
Example:
# Loading Libraries
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Loading iris dataset
iris = datasets.load_iris()
X = iris.data[:, :]
y = iris.target
# Scaling X data
dbscan_scaler = StandardScaler()
dbscan_scaler.fit(X)
dbscan_X_scaled = dbscan_scaler.transform(X)
# Setting empty lists in global environment
dbscan_clusters = []
cluster_count = []
# Inputting function parameters
dbscan_grid_search(X_data = dbscan_X_scaled,
lst = dbscan_clusters,
clst_count = cluster_count
eps_space = pd.np.arange(0.1, 5, 0.1),
min_samples_space = pd.np.arange(1, 50, 1),
min_clust = 3,
max_clust = 6)
"""
# Importing counter to count the amount of data in each cluster
from collections import Counter
# Starting a tally of total iterations
n_iterations = 0
# Looping over each combination of hyperparameters
for eps_val in eps_space:
for samples_val in min_samples_space:
dbscan_grid = DBSCAN(eps = eps_val,
min_samples = samples_val)
# fit_transform
clusters = dbscan_grid.fit_predict(X = X_data)
# Counting the amount of data in each cluster
cluster_count = Counter(clusters)
# Saving the number of clusters
n_clusters = sum(abs(pd.np.unique(clusters))) - 1
# Increasing the iteration tally with each run of the loop
n_iterations += 1
# Appending the lst each time n_clusters criteria is reached
if n_clusters >= min_clust and n_clusters <= max_clust:
dbscan_clusters.append([eps_val,
samples_val,
n_clusters])
clst_count.append(cluster_count)
# Printing grid search summary information
print(f"""Search Complete. nYour list is now of length {len(lst)}. """)
print(f"""Hyperparameter combinations checked: {n_iterations}. n""")
您是否考虑过自己实现搜索?
实现for循环并不特别困难。即使你想优化两个参数,它仍然相当容易。
然而,对于DBSCAN和MeanShift,我建议首先了解您的相似性度量。根据对测量的理解选择参数更有意义,而不是通过参数优化来匹配某些标签(这有很高的过拟合风险)。
换句话说,的两个文章被认为在哪个距离处被聚类?
如果这个距离在不同的数据点之间变化太大,这些算法将严重失败;你可能需要找到一个归一化的距离函数,这样实际的相似性值才有意义。TF-IDF是文本的标准,但主要在检索上下文中。它们在集群环境中可能工作得更糟。
还要注意MeanShift(类似于k-means)需要重新计算坐标-在文本数据上,这可能会产生不希望的结果;更新后的坐标实际上变得更糟,而不是更好。
您可以将GridSearchCV
的cv
参数指定为"可迭代的屈服(train,test)分裂为索引数组";(引自文件)。
特别是DBSCAN
,还有一个问题——没有predict
方法。我使用这个答案的解决方案。
这是示例代码。
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
# The scorer function
def cmp(y_pred, y_true):
return np.sum(y_pred == y_true)
class DBSCANWrapper(DBSCAN):
# Won't work if `_X` is not the same X used in `self.fit`
def predict(self, _X, _y=None):
return self.labels_
# Let X be your data to cluster, e.g.:
X = np.random.rand(100, 10)
# Let y_true be the groundtruth clustering result, e.g.:
y_true = np.random.randint(5, size=100)
# hyper parameters to search, e.g.:
hyperparams_dict = {'eps': np.linspace(0.1, 1.0, 10)}
# Notice here, the spec of `cv`:
cv = [(np.arange(X.shape[0]), np.arange(X.shape[0]))]
search = GridSearchCV(DBSCANWrapper(), hyperparams_dict, scoring=make_scorer(cmp), cv=cv)
search.fit(X, y_true)
print(search.best_params_)
但它当然不起作用,因为只有一个数据样本被聚类,而不是全部
如果你想在其他方面适应训练集,并在不同于训练集的测试集上进行评估(当然,这对DBSCAN不起作用),上面的解决方案也很有效:只需修改cv = ...
行代码。