我正在尝试运行此代码,使用一个关于科罗纳病例与科罗纳死亡关系的数据集。我没有找到任何原因,为什么错误应该通过我处理X和y数据帧分割的方式出现,但我也不完全理解错误。
有人知道这里出了什么问题吗?
import numpy as np
#from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn import preprocessing
#import csv
X_test = pd.read_csv("test.csv")
y_output = pd.read_csv("sample_submission.csv")
data_train = pd.read_csv("train.csv")
X_train = data_train.drop(columns=["Next Week's Deaths"])
y_train = data_train["Next Week's Deaths"]
#prepare for fit (transform Location strings into classes)
Location = data_train["Location"]
le = preprocessing.LabelEncoder()
le.fit(Location)
LocationToInt = le.transform(Location)
LocationDict = dict(zip(Location, LocationToInt))
X_train["Location"] = X_train["Location"].replace(LocationDict)
#train and run
model = HistGradientBoostingClassifier(max_bins=255, max_iter=100)
model.fit(X_train, y_train)
追溯:
Input In [89], in <cell line: 29>()
27 #train and run
28 model = HistGradientBoostingClassifier(max_bins=255, max_iter=100)
---> 29 model.fit(X_train, y_train)
File ~AppDataLocalProgramsPythonPython310libsite-packagessklearnensemble_hist_gradient_boostinggradient_boosting.py:348, in BaseHistGradientBoosting.fit(self, X, y, sample_weight)
343 # Save the state of the RNG for the training and validation split.
344 # This is needed in order to have the same split when using
345 # warm starting.
347 if sample_weight is None:
--> 348 X_train, X_val, y_train, y_val = train_test_split(
349 X,
350 y,
351 test_size=self.validation_fraction,
352 stratify=stratify,
353 random_state=self._random_seed,
354 )
355 sample_weight_train = sample_weight_val = None
356 else:
357 # TODO: incorporate sample_weight in sampling here, as well as
358 # stratify
File ~AppDataLocalProgramsPythonPython310libsite-packagessklearnmodel_selection_split.py:2454, in train_test_split(test_size, train_size, random_state, shuffle, stratify, *arrays)
2450 CVClass = ShuffleSplit
2452 cv = CVClass(test_size=n_test, train_size=n_train, random_state=random_state)
-> 2454 train, test = next(cv.split(X=arrays[0], y=stratify))
2456 return list(
2457 chain.from_iterable(
2458 (_safe_indexing(a, train), _safe_indexing(a, test)) for a in arrays
2459 )
2460 )
File ~AppDataLocalProgramsPythonPython310libsite-packagessklearnmodel_selection_split.py:1613, in BaseShuffleSplit.split(self, X, y, groups)
1583 """Generate indices to split data into training and test set.
1584
1585 Parameters
(...)
1610 to an integer.
1611 """
1612 X, y, groups = indexable(X, y, groups)
-> 1613 for train, test in self._iter_indices(X, y, groups):
1614 yield train, test
File ~AppDataLocalProgramsPythonPython310libsite-packagessklearnmodel_selection_split.py:1953, in StratifiedShuffleSplit._iter_indices(self, X, y, groups)
1951 class_counts = np.bincount(y_indices)
1952 if np.min(class_counts) < 2:
-> 1953 raise ValueError(
1954 "The least populated class in y has only 1"
1955 " member, which is too few. The minimum"
1956 " number of groups for any class cannot"
1957 " be less than 2."
1958 )
1960 if n_train < n_classes:
1961 raise ValueError(
1962 "The train_size = %d should be greater or "
1963 "equal to the number of classes = %d" % (n_train, n_classes)
1964 )
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
用于文本高亮显示:Traceback 的图片
HistGradientBoostingClassifier
在内部将数据集拆分为训练和验证。验证的默认值为10%(在文档中签出validation_fraction
参数(。
在您的情况下,有一个类上只有一个元素,因此如果它进入训练分割,分类器就无法验证这个类,反之亦然。重点是:每个类中至少需要两个示例。
如何解决?好吧,首先你需要一个适当的诊断:运行以下代码来查看哪个类有问题:
import bumpy as np
unq, cnt = no.unique(y_train, return_counts=True)
for u, c in zip(unq, cnt):
print(f"class {u} contains {c}")
现在该怎么办?好吧,首先要确保这些结果对你来说是有意义的,并且没有以前的错误(可能是之前读错了你的CSV或丢失了数据(。
然后,如果问题仍然存在,您的选择如下:
收集更多数据。不总是可能的,但这是最好的。
添加合成数据。例如,
imblearn
是一个类似sklearn的库,用于处理像您这样的不平衡问题。它提供了几种众所周知的过采样方法。你也可以创建自己的合成数据,因为你知道它是什么用一个例子删除类。这意味着要稍微重新界定你的问题,但可能会奏效。放下这一排。你也可以将它重新标记为最接近的标签之一,例如,如果你有阳性、阴性和中性类,以及中性类的一个例子,那么也许你可以将其重新标记为阴性。
将基数较低的类分组。当你有多个类,比如说10个类,其中有一些,比如说3个,只有很少的例子时,这很有用。你可以把那些低基数类混合成一个类";其他";并将您的问题转换为另一个类较少但填充较多的类似问题,在本例中,您将有8个类,而不是10个。
什么是最好的替代方案?这真的取决于你的问题。
EDIT上一个答案假设您正在解决一个分类问题(告诉示例属于哪一类(。如果您正在解决回归任务(预测数量(,请将HistGradientBoostingClassifier
替换为HistGradientBoostingRegressor