我有句子,来自研究,还有手动提取的单词短语,这是我想要的句子的关键词。现在,为了构建 SVM 分类器的训练数据,我想将句子与每个关键字一起矢量化。 查看代码
我正在考虑字典和应用sklearn-Library中的DictVectorizer。
Code:
sklearn.feature_extraction import DictVectorizer
v = DictVectorizer()
D = [{"sentence":"the laboratory information system was evaluated",
"keyword":"laboratory information system"},
{"sentence":"the electronic health record system was evaluated",
"keyword":"electronic health record system"}]
X = v.fit_transform(D)
print(X)
content = X.toarray()
print(content)
print(v.get_feature_names())
Results:
(0, 1) 1.0
(0, 3) 1.0
(1, 0) 1.0
(1, 2) 1.0
[[0. 1. 0. 1.]
[1. 0. 1. 0.]]
['keyword=electronic health record system', 'keyword=laboratory information system', 'sentence=the electronic health record system was evaluated', 'sentence=the laboratory information system was evaluated']
这种方法是否正确,或者我如何将每个句子与相应的手动提取关键字结合在一起以进行矢量化以揭示训练数据。多谢。
我认为这样做并不理想,因为您将整个句子用作功能。对于大型数据集来说,这将成为一个问题。
例如
D = [{"sentence":"This is sentence one",
"keyword":"key 1"},
{"sentence":"This is sentence one",
"keyword":"key 2"},
{"sentence":"This is sentence one",
"keyword":"key 3"},
{"sentence":"This is sentence one",
"keyword":"key 2"},
{"sentence":"This is sentence one",
"keyword":"key 1"}]
X
将是
[[1. 0. 0. 0. 0. 1. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 1.]
[0. 0. 1. 0. 0. 0. 1. 0.]
[0. 1. 0. 0. 1. 0. 0. 0.]
[1. 0. 0. 1. 0. 0. 0. 0.]]
您可能只是从scikit-learn中应用TfidfVectorizer
,这可能会在句子中选取重要的单词。
法典:
from sklearn.feature_extraction.text import TfidfVectorizer
sentences = [d['sentence'] for d in D]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(sentences)