在scikit-learn中将单词列表转换为整数列表

我想在scikit-learn中将单词列表转换为整数列表，并对由单词列表列表组成的语料库执行此操作。例如，语料库可以是一堆句子。

我可以使用sklearn.feature_extraction.text.CountVectorizer做以下事情，但是有没有更简单的方法?我怀疑我可能遗漏了一些CountVectorizer功能，因为它是自然语言处理中常见的预处理步骤。在这段代码中，我首先拟合CountVectorizer，然后我必须遍历每个单词列表中的每个单词以生成整数列表。

import sklearn
import sklearn.feature_extraction
import numpy as np
def reverse_dictionary(dict):
    '''
    http://stackoverflow.com/questions/483666/python-reverse-inverse-a-mapping
    '''
    return {v: k for k, v in dict.items()}
vectorizer = sklearn.feature_extraction.text.CountVectorizer(min_df=1)
corpus = ['This is the first document.',
        'This is the second second document.',
        'And the third one.',
        'Is this the first document? This is right.',]
X = vectorizer.fit_transform(corpus).toarray()
tokenizer = vectorizer.build_tokenizer()
output_corpus = []
for line in corpus: 
    line = tokenizer(line.lower())
    output_line = np.empty_like(line, dtype=np.int)
    for token_number, token in np.ndenumerate(line):
        output_line[token_number] = vectorizer.vocabulary_.get(token) 
    output_corpus.append(output_line)
print('output_corpus: {0}'.format(output_corpus))
word2idx = vectorizer.vocabulary_
print('word2idx: {0}'.format(word2idx))
idx2word = reverse_dictionary(word2idx)
print('idx2word: {0}'.format(idx2word))

输出:

output_corpus: [array([9, 3, 7, 2, 1]), # 'This is the first document.'
                array([9, 3, 7, 6, 6, 1]), # 'This is the second second document.'
                array([0, 7, 8, 4]), # 'And the third one.'
                array([3, 9, 7, 2, 1, 9, 3, 5])] # 'Is this the first document? This is right.'
word2idx: {u'and': 0, u'right': 5, u'third': 8, u'this': 9, u'is': 3, u'one': 4,
           u'second': 6, u'the': 7, u'document': 1, u'first': 2}
idx2word: {0: u'and', 1: u'document', 2: u'first', 3: u'is', 4: u'one', 5: u'right', 
           6: u'second', 7: u'the', 8: u'third', 9: u'this'}

我不知道是否有更直接的方法，但是您可以通过使用map代替for循环来遍历每个单词来简化语法。

您可以使用build_analyzer()，它处理预处理和标记化，那么就不需要显式调用lower()。

analyzer = vectorizer.build_analyzer()
output_corpus = [map(lambda x: vectorizer.vocabulary_.get(x), analyzer(line)) for line in corpus]
# For Python 3.x it should be
# [list(map(lambda x: vectorizer.vocabulary_.get(x), analyzer(line))) for line in corpus]

output_corpus:

[[9, 3, 7, 2, 1], [9, 3, 7, 6, 6, 1], [0, 7, 8, 4], [3, 9, 7, 2, 1, 9, 3, 5]]

编辑

多亏了@user3914041，在这种情况下，只使用列表推导可能更可取。它避免了lambda，因此可以比map稍微快一些。(根据Python List Comprehension Vs. Map和我的简单测试。)

output_corpus = [[vectorizer.vocabulary_.get(x) for x in analyzer(line)] for line in corpus]

我经常在python中使用Counter来解决这个问题，例如

from collections import Counter
corpus = ['This is the first document.',
        'This is the second second document.',
        'And the third one.',
        'Is this the first document? This is right.',]

#convert to str from list and split
as_one = ''
for sentence in corpus:
    as_one = as_one + ' ' + sentence
words = as_one.split()

from collections import Counter
counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}

print(vocab_to_int)

输出:

{的:1、"这":2,"是":3,"第一":4,"文档。': 5， 'second':6、"And":7;"third":8;': 9， 'Is': 10， 'this': 11， 'document?":12日的权利。":13}

对于给定的文本，CountVectorizer意味着返回一个表示每个单词计数的向量。

。对于语料库:corpus = ["the cat", "the dog"]，矢量器将找到3个不同的单词，因此它将输出维度3的向量，其中"the"对应于第一个维度，"cat"对应于第二个维度，"dog"对应于第三个维度。例如，"猫"将被转换为[1,1,0]，"狗"将被转换为[1,0,1]，并且含有重复单词的句子将具有更大的值(例如:"the cat cat"→[1,2,0]).

对于你想要的，你会有一个很好的时间与Zeugma包。您只需要执行以下操作(在终端中运行pip install zeugma之后):

>>> from zeugma import TextsToSequences
>>> sequencer = TextsToSequences()
>>> sequencer.fit_transform(["this is a sentence.", "and another one."])
array([[1, 2, 3, 4], [5, 6, 7]], dtype=object)

你总是可以使用

来访问"index to word mapping: "

>>> sequencer.index_word
{1: 'this', 2: 'is', 3: 'a', 4: 'sentence', 5: 'and', 6: 'another', 7: 'one'}

从那里你可以用这个映射转换任何新的句子:

>>> sequencer.transform(["a sentence"])
array([[3, 4]])

我希望它有帮助!

相关内容

最新更新

热门标签：