ValueError：输入包含 NaN、无穷大或对于 dtype('float64') 来说太大的值，带有 scikit-learn

from sklearn.cluster.bicluster import SpectralCoclustering
from sklearn.feature_extraction.text import TfidfVectorizer
def number_normalizer(tokens):
    """
    Map all numeric tokens to a placeholder.
    For many applications, tokens that begin with a number are not directly
    useful, but the fact that such a token exists can be relevant.  By applying
    this form of dimensionality reduction, some methods may perform better.
    """
        
    return ("#NUMBER" if token[0].isdigit() else token for token in tokens)
    
    
class NumberNormalizingVectorizer(TfidfVectorizer):
    
    def build_tokenizer(self):
        tokenize = super(NumberNormalizingVectorizer, self).build_tokenizer()
        return lambda doc: list(number_normalizer(tokenize(doc)))
vectorizer = NumberNormalizingVectorizer(stop_words='english', min_df=5)
cocluster = SpectralCoclustering(n_clusters=5, svd_method='arpack', random_state=0)
X = vectorizer.fit_transform(data)
cocluster.fit(X)

我选择SpectralCococococococococcopusters群集大约30k推文，一切都在将数据x放入" Cocluster"中之前进行。

它增加了下面显示的错误。

.env/lib/python3.5/site-packages/sklearn/utils/validation.py", line 43, in _assert_all_finite
    " or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

然后，我将代码键入错误报告，但它是" false"。发生错误时应该是正确的，对

那么还有什么可以找到错误的？谢谢！

https://github.com/scikit-learn/scikit-learn/blob/main/main/sklearn/utils/validation.py#l43

>>> X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum()) and not np.isfinite(X).all()
False

今天，当我遇到另一个sklearn模块时，以下故障射击对我有帮助：

当您的输入数据不有趣时，请尝试重现错误。在0到10之间，用30个整数替换30k大数字。
（就我而言，我无法以这种方式重现错误。）
检查数据中的inf/NaN值。如果有的话，用常数替换它们。例如，由LARGE_NUMBER替换inf。
（就我而言，错误仍然没有消失。）
使您的常数LARGE_NUMBER较小。如果您的实际数据范围在-100到100之间，则可能是10^100和10^10之间的差异。
（就我而言，错误消息更改（首次成功）。因此，我使我的LARGE_NUMBER虽然较小，然后错误就消失了。）

我想（就我而言）即sklearn -模块有时会使用指数函数来得出这种行为。因此，您的方法可能会调用另一个参数为inf的函数，尽管您的输入不包含此类值。

相关内容

最新更新

热门标签：