仅使用Scikit-learn CountVectorizer为同一行上的单词创建ngram(不考虑换行符)

在Python中使用scikit-learn库时，我可以使用CountVectorizer来创建所需长度（例如2个单词）的ngram，如下所示：

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.collocations import *
from nltk.probability import FreqDist
import nltk
myString = 'This is anmultiline string'
countVectorizer = CountVectorizer(ngram_range=(2,2))
analyzer = countVectorizer.build_analyzer()
listNgramQuery = analyzer(myString)
NgramQueryWeights = nltk.FreqDist(listNgramQuery)
print(NgamQueryWeights.items())

这将打印：

dict_items([('is multiline', 1), ('multiline string', 1), ('this is', 1)])

从创建的 is multiline ngram（默认情况下过滤掉停用词a）可以看出，引擎不关心字符串中的换行符。

如何修改创建 ngrams 的引擎以遵循字符串中的换行符，并且仅创建所有单词都属于同一行文本的 ngram？我的预期输出是：

dict_items([('multiline string', 1), ('this is', 1)])

我知道我可以通过token_pattern=someRegex传递给 CountVectorizer 来修改分词器模式。此外，我在某处读到使用的默认正则表达式是 u'(?u)\b\w\w+\b' .不过，我认为这个问题更多的是关于 ngram 的创建而不是关于分词器，因为问题不在于在不尊重换行符的情况下创建标记，而是不尊重 ngram。

您需要重载分析器，如文档中所述。

def bigrams_per_line(doc):
    for ln in doc.split('n'):
        terms = re.findall(r'w{2,}', ln)
        for bigram in zip(terms, terms[1:]):
            yield '%s %s' % bigram

cv = CountVectorizer(analyzer=bigrams_per_line)
cv.fit(['This is anmultiline string'])
print(cv.get_feature_names())
# ['This is', 'multiline string']

接受的答案工作正常，但只能找到双字母（正好由两个单词组成的标记）。为了将其推广到 ngrams（就像我在问题中使用 ngram_range=(min,max) 参数的示例代码一样），可以使用以下代码：

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.collocations import *
from nltk.probability import FreqDist
import nltk
import re
from itertools import tee, islice
# custom ngram analyzer function, matching only ngrams that belong to the same line
def ngrams_per_line(doc):
    # analyze each line of the input string seperately
    for ln in doc.split('n'):
        # tokenize the input string (customize the regex as desired)
        terms = re.findall(u'(?u)\b\w+\b', ln)
        # loop ngram creation for every number between min and max ngram length
        for ngramLength in range(minNgramLength, maxNgramLength+1):
            # find and return all ngrams
            # for ngram in zip(*[terms[i:] for i in range(3)]): <-- solution without a generator (works the same but has higher memory usage)
            for ngram in zip(*[islice(seq, i, len(terms)) for i, seq in enumerate(tee(terms, ngramLength))]): # <-- solution using a generator
                ngram = ' '.join(ngram)
                yield ngram

然后使用自定义分析器作为 CountVectorizer 的参数：

cv = CountVectorizer(analyzer=ngrams_per_line)

确保minNgramLength和maxNgramLength的定义方式使ngrams_per_line函数知道它们（例如，将它们声明为全局变量），因为它们不能作为参数传递给它（至少我不知道如何）。

Dirk 的答案甚至比公认的答案还要好，只是给出了如何为这个函数分配参数的另一个线索——只需使用闭包。

def gen_analyzer(minNgramLength, maxNgramLength):
     def ngrams_per_line(doc):
     ...
     
     return ngrams_per_line
cv = CountVectorizer(analyzer=gen_analyzer(1, 2))

相关内容

最新更新

热门标签：