结合NLTK和scikit学习中的文本词干和标点符号去除



我使用NLTK和scikit-learnCountVectorizer的组合来词干和标记化。

以下是CountVectorizer:的简单用法示例

from sklearn.feature_extraction.text import CountVectorizer
vocab = ['The swimmer likes swimming so he swims.']
vec = CountVectorizer().fit(vocab)
sentence1 = vec.transform(['The swimmer likes swimming.'])
sentence2 = vec.transform(['The swimmer swims.'])
print('Vocabulary: %s' %vec.get_feature_names())
print('Sentence 1: %s' %sentence1.toarray())
print('Sentence 2: %s' %sentence2.toarray())

哪个将打印

Vocabulary: ['he', 'likes', 'so', 'swimmer', 'swimming', 'swims', 'the']
Sentence 1: [[0 1 0 1 1 0 1]]
Sentence 2: [[0 0 0 1 0 1 1]]

现在,让我们说,我想删除停止词和词干。一种选择是这样做:

from nltk import word_tokenize          
from nltk.stem.porter import PorterStemmer
#######
# based on http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed
def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems
######## 
vect = CountVectorizer(tokenizer=tokenize, stop_words='english') 
vect.fit(vocab)
sentence1 = vect.transform(['The swimmer likes swimming.'])
sentence2 = vect.transform(['The swimmer swims.'])
print('Vocabulary: %s' %vect.get_feature_names())
print('Sentence 1: %s' %sentence1.toarray())
print('Sentence 2: %s' %sentence2.toarray())

打印:

Vocabulary: ['.', 'like', 'swim', 'swimmer']
Sentence 1: [[1 1 1 1]]
Sentence 2: [[1 0 1 1]]

但我该如何最好地去掉第二个版本中的标点符号呢?

有几个选项,请尝试在标记化之前删除标点符号。但这意味着don't->dont

import string
def tokenize(text):
    text = "".join([ch for ch in text if ch not in string.punctuation])
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems

或者尝试在标记化后删除标点符号。

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    tokens = [i for i in tokens if i not in string.punctuation]
    stems = stem_tokens(tokens, stemmer)
    return stems

编辑

上面的代码可以工作,但它相当慢,因为它多次循环通过同一文本:

  • 删除标点符号一次
  • 第二次标记
  • 第三次干

如果你有更多的步骤,如删除数字或删除停止字或降低成本,等等

最好尽可能多地将这些步骤集中在一起,如果您的数据需要更多的预处理步骤,这里有几个更好的答案更有效:

  • 基于NLTK的文本预处理在pandas数据帧中的应用
  • 为什么我的NLTK函数在处理DataFrame时速度较慢
  • https://www.kaggle.com/alvations/basic-nlp-with-nltk

相关内容

  • 没有找到相关文章

最新更新