抓取列表的两个不同部分

使用Scikit Learn，我在Python中构建了一些基本的情感分类器。我现在尝试使用交叉验证技术进行评估。我的数据集包含100000条积极和消极的推文，称为training_data。

每次我都需要从整套中拿出20000块进行测试，然后用剩下的180000块进行训练。我遇到的问题是，当块不在两端时，我如何才能获得块两侧的数据？

我试过做一些类似的事情

training_data.data[:20000] + training_data.data[40000:]

但上面写着

操作数不能与不同形状的一起广播

然而，我的印象是dataset.data只是一个列表。

根据要求，这里是training_data.data[1:10]的输出示例：

["@karoliiinem i'm personally following the next 300 people that will follow --& gt; @omgfantasy rt once you're done so i'd know ?n", '@kristensaywha i know s tupid peoplen', 'i might be going shopping tomorrow at the beach ) i hope son' , '@_sophieallam cannae wait for a 5 hour train journey n', 'wifey needs a hug n', "i'm scared to drive to daytona with this car n", "@xxiluvdahviexx i'm so sorryn", "@chooselogoism that sucks i can't see w/o my glasses at alln", 'x f actor n']

我想我正在寻找一个列表上的某种操作，在这个列表中，你可以获取除指定切片之外的所有数据？

sklearn.cross_validation.KFold为您生成这些折叠。

from sklearn.cross_validation import KFold
cv = KFold(180000, 9)

返回一个迭代器，该迭代器在每个步骤生成训练和测试索引。如果你的分类器被称为your_classifier，你的数据（推文）是X，你的目标（情感）是y，那么你可以使用sklearn.cross_validation.cross_val_score来获得所有折叠的分数：

from sklearn.cross_validation import cross_val_score
scores = cross_val_score(your_classifier, X, y, cv=cv, scoring= ...)

其中scoring是一个非常重要的问题。如果你使用一个天真的记分器，比如准确性，它计算正确的预测，你的类必须是平衡的，或者你必须意识到你的估计器可能只是在预测更频繁的类。如果你想平衡你的类，你可能也想看看sklearn.cross_validation.StratifiedKFold（doc）。

您需要使用cross_validation模块，它有许多场景可以使用，其中一个是k-fold。

我在这里复制粘贴sklearn手册页面：

>>> from sklearn import cross_validation
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([1, 2, 3, 4])
>>> kf = cross_validation.KFold(4, n_folds=2)
>>> len(kf)
2
>>> print(kf)
sklearn.cross_validation.KFold(n=4, n_folds=2)
>>> for train_index, test_index in kf:
...    print("TRAIN:", train_index, "TEST:", test_index)
...    X_train, X_test = X[train_index], X[test_index]
...    y_train, y_test = y[train_index], y[test_index]
TRAIN: [2 3] TEST: [0 1]
TRAIN: [0 1] TEST: [2 3]

相关内容

最新更新

热门标签：