'term frequency'和"文档频率"有什么区别？

编辑：这是我最终想问的问题：了解scikit CountVectorizer中的min_df和max_df

我正在阅读scikit-learn CountVectorizer的文档，并注意到在讨论max_df时，我们关注的是令牌的文档频率：

max_df : float in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

但是当我们考虑max_features时，我们对术语频率感兴趣：

max_features : int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

我很困惑：如果我们使用 max_df ，并说我们将其设置为 10，我们不是在说"忽略任何出现超过 10 次的令牌"吗？

如果我们max_features设置为 100，我们不是说，"只使用语料库中出现次数最多的 100 个代币"吗？

如果我做对了...那么使用"术语频率"和"文档频率"时的措辞有什么区别？

当您将max_df设置为 10 时，您会说"忽略出现在超过 10 个文档中的任何令牌"..在这里，您不考虑令牌在每个文档中出现的次数，只考虑它出现在文档中的次数。

当您将max_features设置为 100 时，它表示"按术语频率（这意味着令牌在语料库的每个文档中出现的次数）对令牌进行排序（按降序排列），然后仅考虑这些令牌中的前 100 个"

相关内容

最新更新

热门标签：