我正在使用词袋对文本进行分类。它工作得很好,但我想知道如何添加一个功能,这不是一个词。
下面是我的示例代码。
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
X_train = np.array(["new york is a hell of a town",
"new york was originally dutch",
"new york is also called the big apple",
"nyc is nice",
"the capital of great britain is london. london is a huge metropolis which has a great many number of people living in it. london is also a very old town with a rich and vibrant cultural history.",
"london is in the uk. they speak english there. london is a sprawling big city where it's super easy to get lost and i've got lost many times.",
"london is in england, which is a part of great britain. some cool things to check out in london are the museum and buckingham palace.",
"london is in great britain. it rains a lot in britain and london's fogs are a constant theme in books based in london, such as sherlock holmes. the weather is really bad there.",])
y_train = [[0],[0],[0],[0],[1],[1],[1],[1]]
X_test = np.array(["it's a nice day in nyc",
'i loved the time i spent in london, the weather was great, though there was a nip in the air and i had to wear a jacket.'
])
target_names = ['Class 1', 'Class 2']
classifier = Pipeline([
('vectorizer', CountVectorizer(min_df=1,max_df=2)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)
for item, labels in zip(X_test, predicted):
print '%s => %s' % (item, ', '.join(target_names[x] for x in labels))
现在很明显,关于伦敦的文章往往比关于纽约的文章长得多。我该如何添加文本长度作为功能?我是否需要使用另一种分类方法,然后将两种预测结合起来?有什么方法可以把它和这袋单词一起做吗?一些示例代码会很好——我是机器学习和scikit学习的新手。
如注释所示,这是一个FunctionTransformer
, FeaturePipeline
和FeatureUnion
的组合。
import numpy as np
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import FunctionTransformer
X_train = np.array(["new york is a hell of a town",
"new york was originally dutch",
"new york is also called the big apple",
"nyc is nice",
"the capital of great britain is london. london is a huge metropolis which has a great many number of people living in it. london is also a very old town with a rich and vibrant cultural history.",
"london is in the uk. they speak english there. london is a sprawling big city where it's super easy to get lost and i've got lost many times.",
"london is in england, which is a part of great britain. some cool things to check out in london are the museum and buckingham palace.",
"london is in great britain. it rains a lot in britain and london's fogs are a constant theme in books based in london, such as sherlock holmes. the weather is really bad there.",])
y_train = np.array([[0],[0],[0],[0],[1],[1],[1],[1]])
X_test = np.array(["it's a nice day in nyc",
'i loved the time i spent in london, the weather was great, though there was a nip in the air and i had to wear a jacket.'
])
target_names = ['Class 1', 'Class 2']
def get_text_length(x):
return np.array([len(t) for t in x]).reshape(-1, 1)
classifier = Pipeline([
('features', FeatureUnion([
('text', Pipeline([
('vectorizer', CountVectorizer(min_df=1,max_df=2)),
('tfidf', TfidfTransformer()),
])),
('length', Pipeline([
('count', FunctionTransformer(get_text_length, validate=False)),
]))
])),
('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)
predicted
这将把文本的长度添加到分类器使用的特征中。
我假设您想要添加的新特性是数字。这是我的逻辑。首先使用TfidfTransformer
或类似的东西将文本转换为稀疏。然后将稀疏表示转换为pandas DataFrame
并添加我假设是数字的新列。最后,您可能希望使用scipy
或任何其他您觉得舒服的模块将数据帧转换回sparse
矩阵。我假设您的数据在pandas DataFrame
中,称为dataset
,包含'Text Column'
和'Numeric Column'
。下面是一些代码:
dataset = pd.DataFrame({'Text Column':['Sample Text1','Sample Text2'], 'Numeric Column': [2,1]})
dataset.head()
Numeric Column Text Column
0 2 Sample Text1
1 1 Sample Text2
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from scipy import sparse
tv = TfidfVectorizer(min_df = 0.05, max_df = 0.5, stop_words = 'english')
X = tv.fit_transform(dataset['Text column'])
vocab = tv.get_feature_names()
X1 = pd.DataFrame(X.toarray(), columns = vocab)
X1['Numeric Column'] = dataset['Numeric Column']
X_sparse = sparse.csr_matrix(X1.values)
最后,您可能想要;
print(X_sparse.shape)
print(X.shape)
以确保新列已成功添加。