"The least populated class in y has only 1 ... groups for any class cannot be less than 2." 不带 trai



我正在尝试运行此代码,使用一个关于科罗纳病例与科罗纳死亡关系的数据集。我没有找到任何原因,为什么错误应该通过我处理X和y数据帧分割的方式出现,但我也不完全理解错误。

有人知道这里出了什么问题吗?

import numpy as np
#from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn import preprocessing

#import csv
X_test = pd.read_csv("test.csv")
y_output = pd.read_csv("sample_submission.csv")
data_train = pd.read_csv("train.csv")
X_train = data_train.drop(columns=["Next Week's Deaths"])
y_train = data_train["Next Week's Deaths"]
#prepare for fit (transform Location strings into classes)
Location = data_train["Location"]
le = preprocessing.LabelEncoder()
le.fit(Location)
LocationToInt = le.transform(Location)
LocationDict = dict(zip(Location, LocationToInt))
X_train["Location"] = X_train["Location"].replace(LocationDict)

#train and run
model = HistGradientBoostingClassifier(max_bins=255, max_iter=100)
model.fit(X_train, y_train)

追溯:

Input In [89], in <cell line: 29>()
     27 #train and run
     28 model = HistGradientBoostingClassifier(max_bins=255, max_iter=100)
---> 29 model.fit(X_train, y_train)
File ~AppDataLocalProgramsPythonPython310libsite-packagessklearnensemble_hist_gradient_boostinggradient_boosting.py:348, in BaseHistGradientBoosting.fit(self, X, y, sample_weight)
    343 # Save the state of the RNG for the training and validation split.
    344 # This is needed in order to have the same split when using
    345 # warm starting.
    347 if sample_weight is None:
--> 348     X_train, X_val, y_train, y_val = train_test_split(
    349         X,
    350         y,
    351         test_size=self.validation_fraction,
    352         stratify=stratify,
    353         random_state=self._random_seed,
    354     )
    355     sample_weight_train = sample_weight_val = None
    356 else:
    357     # TODO: incorporate sample_weight in sampling here, as well as
    358     # stratify
File ~AppDataLocalProgramsPythonPython310libsite-packagessklearnmodel_selection_split.py:2454, in train_test_split(test_size, train_size, random_state, shuffle, stratify, *arrays)
   2450         CVClass = ShuffleSplit
   2452     cv = CVClass(test_size=n_test, train_size=n_train, random_state=random_state)
-> 2454     train, test = next(cv.split(X=arrays[0], y=stratify))
   2456 return list(
   2457     chain.from_iterable(
   2458         (_safe_indexing(a, train), _safe_indexing(a, test)) for a in arrays
   2459     )
   2460 )
File ~AppDataLocalProgramsPythonPython310libsite-packagessklearnmodel_selection_split.py:1613, in BaseShuffleSplit.split(self, X, y, groups)
   1583 """Generate indices to split data into training and test set.
   1584 
   1585 Parameters
   (...)
   1610 to an integer.
   1611 """
   1612 X, y, groups = indexable(X, y, groups)
-> 1613 for train, test in self._iter_indices(X, y, groups):
   1614     yield train, test
File ~AppDataLocalProgramsPythonPython310libsite-packagessklearnmodel_selection_split.py:1953, in StratifiedShuffleSplit._iter_indices(self, X, y, groups)
   1951 class_counts = np.bincount(y_indices)
   1952 if np.min(class_counts) < 2:
-> 1953     raise ValueError(
   1954         "The least populated class in y has only 1"
   1955         " member, which is too few. The minimum"
   1956         " number of groups for any class cannot"
   1957         " be less than 2."
   1958     )
   1960 if n_train < n_classes:
   1961     raise ValueError(
   1962         "The train_size = %d should be greater or "
   1963         "equal to the number of classes = %d" % (n_train, n_classes)
   1964     )
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

用于文本高亮显示:Traceback 的图片

HistGradientBoostingClassifier在内部将数据集拆分为训练和验证。验证的默认值为10%(在文档中签出validation_fraction参数(。

在您的情况下,有一个类上只有一个元素,因此如果它进入训练分割,分类器就无法验证这个类,反之亦然。重点是:每个类中至少需要两个示例。

如何解决?好吧,首先你需要一个适当的诊断:运行以下代码来查看哪个类有问题:

import bumpy as np
unq, cnt = no.unique(y_train, return_counts=True)
for u, c in zip(unq, cnt):
    print(f"class {u} contains {c}")

现在该怎么办?好吧,首先要确保这些结果对你来说是有意义的,并且没有以前的错误(可能是之前读错了你的CSV或丢失了数据(。

然后,如果问题仍然存在,您的选择如下:

  • 收集更多数据。不总是可能的,但这是最好的。

  • 添加合成数据。例如,imblearn是一个类似sklearn的库,用于处理像您这样的不平衡问题。它提供了几种众所周知的过采样方法。你也可以创建自己的合成数据,因为你知道它是什么

  • 用一个例子删除类。这意味着要稍微重新界定你的问题,但可能会奏效。放下这一排。你也可以将它重新标记为最接近的标签之一,例如,如果你有阳性、阴性和中性类,以及中性类的一个例子,那么也许你可以将其重新标记为阴性。

  • 将基数较低的类分组。当你有多个类,比如说10个类,其中有一些,比如说3个,只有很少的例子时,这很有用。你可以把那些低基数类混合成一个类";其他";并将您的问题转换为另一个类较少但填充较多的类似问题,在本例中,您将有8个类,而不是10个。

什么是最好的替代方案?这真的取决于你的问题。

EDIT上一个答案假设您正在解决一个分类问题(告诉示例属于哪一类(。如果您正在解决回归任务(预测数量(,请将HistGradientBoostingClassifier替换为HistGradientBoostingRegressor

相关内容

  • 没有找到相关文章

最新更新