从Scikit Learn中保存的训练分类器进行预测

我用Python为Tweets写了一个分类器，然后以.pkl格式保存在磁盘上，这样我就可以一次又一次地运行它，而不需要每次都训练它。这是代码:

import pandas
import re
from sklearn.feature_extraction import FeatureHasher
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn import cross_validation
from sklearn.externals import joblib

#read the dataset of tweets
header_row=['sentiment','tweetid','date','query', 'user', 'text']
train = pandas.read_csv("training.data.csv",names=header_row)
#keep only the right columns
train = train[["sentiment","text"]]
#remove puctuation, special characters, numbers and lower case the text
def remove_spch(text):
    return re.sub("[^a-z]", ' ', text.lower())
train['text'] = train['text'].apply(remove_spch)

#Feature Hashing
def tokens(doc):
    """Extract tokens from doc.
    This uses a simple regex to break strings into tokens.
    """
    return (tok.lower() for tok in re.findall(r"w+", doc))
n_features = 2**18
hasher = FeatureHasher(n_features=n_features, input_type="string", non_negative=True)
X = hasher.transform(tokens(d) for d in train['text'])
y = train['sentiment']
X_new = SelectKBest(chi2, k=20000).fit_transform(X, y)
a_train, a_test, b_train, b_test = cross_validation.train_test_split(X_new, y, test_size=0.2, random_state=42)
from sklearn.ensemble import RandomForestClassifier 
classifier=RandomForestClassifier(n_estimators=10)                  
classifier.fit(a_train.toarray(), b_train)                            
prediction = classifier.predict(a_test.toarray()) 
#Export the trained model to load it in another project
joblib.dump(classifier, 'my_model.pkl', compress=9)

假设我有另一个Python文件，我想对Tweet进行分类。我该如何进行分类?

from sklearn.externals import joblib
model_clone = joblib.load('my_model.pkl')
mytweet = 'Uh wow:@medium is doing a crowdsourced data-driven investigation tracking down a disappeared refugee boat'

直到hasher.transform，我可以复制相同的过程将其添加到预测模型中，但随后我遇到了无法计算最佳20k特征的问题。要使用SelectKBest，您需要添加特性和标签。因为我想预测标签，所以我不能使用SelectKBest。那么，我如何通过这个问题来继续预测呢?

我支持@EdChum的评论

你通过在数据上训练它来建立一个模型，这些数据可能足够有代表性，使它能够处理看不见的数据

实际上，这意味着您需要将FeatureHasher和SelectKBest应用于您的新数据，而predict 仅为。(在新数据上重新训练FeatureHasher是错误的，因为通常它会产生不同的特征)。

分别pickle FeatureHasher和SelectKBest

或(更好)

创建一个包含FeatureHasher, SelectKBest和RandomForestClassifier的Pipeline，并pickle整个管道。然后，您可以加载该管道并在新数据上使用predict。

相关内容

最新更新

热门标签：