Im使用scikit学习创建文档的特征向量。我的目标是用这些特征向量创建一个二进制分类器(Gender分类器)。
我想有一个功能k-top单词,所以两个标签文档中计数最高的k个单词(k=10->20个功能,因为有2个标签)
我的两个文档(label1document,label2document)都填充了这样的实例:
user:somename, post:"A written text which i use"
到目前为止,我的理解是,我使用两个文档中所有实例的所有文本来创建一个带有计数的词汇表(两个标签的计数,这样我就可以比较标签数据):
#These are my documents with all text
label1document = "car eat essen sleep sleep"
label2document = "eat sleep woman woman woman woman"
vectorizer = CountVectorizer(min_df=1)
corpus = [label1document,label2document]
#Here I create a Matrix with all the countings of the words from both documents
X = vectorizer.fit_transform(corpus)
问题1:我必须在fit_transform中输入什么才能从两个标签中获得计数最多的单词??
X_new = SelectKBest(chi2, k=2).fit_transform( ?? )
因为最后,我想要这样的训练数据(实例):
<label> <feature1 : value> ... <featureN: value>
问题2:我如何从那里获得这些培训数据?
Oliver
import pandas as pd
# get the names of the features
features = vectorizer.get_feature_names()
# change the matrix from sparse to dense
df = pd.DataFrame(X.toarray(), columns = features)
df
将返回:
car eat essen sleep woman
0 1 1 1 2 0
1 0 1 0 1 4
然后获得最常用的术语:
highest_frequency = df.max()
highest_frequency.sort(ascending=False)
highest_frequency
哪个将返回:
woman 4
sleep 2
essen 1
eat 1
car 1
dtype: int64
一旦您在DataFrame
中有了数据,就可以很容易地将其转换为您想要的格式,例如:
df.to_dict()
>>> {u'car': {0: 1, 1: 0},
u'eat': {0: 1, 1: 1},
u'essen': {0: 1, 1: 0},
u'sleep': {0: 2, 1: 1},
u'woman': {0: 0, 1: 4}}
df.to_json()
>>>'{"car":{"0":1,"1":0},"eat":{"0":1,"1":1},"essen":{"0":1,"1":0},"sleep":{"0":2,"1":1},"woman":{"0":0,"1":4}}'
df.to_csv()
>>>',car,eat,essen,sleep,womann0,1,1,1,2,0n1,0,1,0,1,4n'
这里有一些有用的文档。