我已经训练了一个通过pickle加载的分类器。我主要的疑问是,是否有什么东西可以加速分类任务。每个文本(特征提取和分类)几乎需要1分钟,这正常吗?我应该使用多线程吗?
下面是一些代码片段,可以看到整个流程:for item in items:
review = ''.join(item['review_body'])
review_features = getReviewFeatures(review)
normalized_predicted_rating = getPredictedRating(review_features)
item_processed['rating'] = str(round(float(normalized_predicted_rating),1))
def getReviewFeatures(review, verbose=True):
text_tokens = tokenize(review)
polarity = getTextPolarity(review)
subjectivity = getTextSubjectivity(review)
taggs = getTaggs(text_tokens)
bigrams = processBigram(taggs)
freqBigram = countBigramFreq(bigrams)
sort_bi = sortMostCommun(freqBigram)
adjectives = getAdjectives(taggs)
freqAdjectives = countFreqAdjectives(adjectives)
sort_adjectives = sortMostCommun(freqAdjectives)
word_features_adj = list(sort_adjectives)
word_features = list(sort_bi)
features={}
for bigram,freq in word_features:
features['contains(%s)' % unicode(bigram).encode('utf-8')] = True
features["count({})".format(unicode(bigram).encode('utf-8'))] = freq
for word,freq in word_features_adj:
features['contains(%s)' % unicode(word).encode('utf-8')] = True
features["count({})".format(unicode(word).encode('utf-8'))] = freq
features["polarity"] = polarity
features["subjectivity"] = subjectivity
if verbose:
print "Get review features..."
return features
def getPredictedRating(review_features, verbose=True):
start_time = time.time()
classifier = pickle.load(open("LinearSVC5.pickle", "rb" ))
p_rating = classifier.classify(review_features) # in the form of "# star"
predicted_rating = re.findall(r'd+', p_rating)[0]
predicted_rating = int(predicted_rating)
best_rating = 5
worst_rating = 1
normalized_predicted_rating = 0
normalized_predicted_rating = round(float(predicted_rating)*float(10.0)/((float(best_rating)-float(worst_rating))+float(worst_rating)))
if verbose:
print "Get predicted rating..."
print "ML_RATING: ", normalized_predicted_rating
print("---Took %s seconds to predict rating for the review---" % (time.time() - start_time))
return normalized_predicted_rating
NLTK是一个很好的工具,也是自然语言处理的一个很好的起点,但如果速度很重要,它有时就不是很有用了,正如作者暗示的那样:
NLTK被称为"使用Python进行计算语言学教学和工作的绝佳工具"one_answers"使用自然语言的神奇库"。
所以如果你的问题只在于分类器的速度,你必须使用另一个资源,或者你必须自己编写分类器。
如果你想使用一个可能更快的分类器,Scikit可能对你有帮助。
似乎您使用dictionary
来构建特征向量。我强烈怀疑问题就在那里。
正确的方法是使用numpy ndarray
,其中包含行上的示例和列上的特征。比如
import numpy as np
# let's suppose 6 different features = 6-dimensional vector
feats = np.array((1, 6))
# column 0 contains polarity, column 1 subjectivity, and so on..
feats[:, 0] = polarity
feats[:, 1] = subjectivity
# ....
classifier.classify(feats)
当然,在训练过程中必须使用相同的数据结构并遵守相同的约定。