我想使用 python 计算文件中所有双字母(相邻单词对)的出现次数。在这里,我正在处理非常大的文件,所以我正在寻找一种有效的方法。我尝试在文件内容上使用带有正则表达式"\w+\s\w+"的计数方法,但事实证明它并不有效。
例如,假设我想计算文件 a.txt 中的双字母数,该文件具有以下内容:
"the quick person did not realize his speed and the quick person bumped "
对于上面的文件,双字母集及其计数将是:
(the,quick) = 2
(quick,person) = 2
(person,did) = 1
(did, not) = 1
(not, realize) = 1
(realize,his) = 1
(his,speed) = 1
(speed,and) = 1
(and,the) = 1
(person, bumped) = 1
我遇到了一个 Python 中的 Counter 对象示例,它用于计算 unigram(单个单词)。它还使用正则表达式方法。
示例如下所示:
>>> # Find the ten most common words in Hamlet
>>> import re
>>> from collections import Counter
>>> words = re.findall('w+', open('a.txt').read())
>>> print Counter(words)
上述代码的输出是:
[('the', 2), ('quick', 2), ('person', 2), ('did', 1), ('not', 1),
('realize', 1), ('his', 1), ('speed', 1), ('bumped', 1)]
我想知道是否可以使用 Counter 对象来获取二进制图的计数。除计数器对象或正则表达式之外的任何方法也将受到赞赏。
itertools
魔法:
>>> import re
>>> from itertools import islice, izip
>>> words = re.findall("w+",
"the quick person did not realize his speed and the quick person bumped")
>>> print Counter(izip(words, islice(words, 1, None)))
输出:
Counter({('the', 'quick'): 2, ('quick', 'person'): 2, ('person', 'did'): 1,
('did', 'not'): 1, ('not', 'realize'): 1, ('and', 'the'): 1,
('speed', 'and'): 1, ('person', 'bumped'): 1, ('his', 'speed'): 1,
('realize', 'his'): 1})
奖金
获取任何 n 元语法的频率:
from itertools import tee, islice
def ngrams(lst, n):
tlst = lst
while True:
a, b = tee(tlst)
l = tuple(islice(a, n))
if len(l) == n:
yield l
next(b)
tlst = b
else:
break
>>> Counter(ngrams(words, 3))
输出:
Counter({('the', 'quick', 'person'): 2, ('and', 'the', 'quick'): 1,
('realize', 'his', 'speed'): 1, ('his', 'speed', 'and'): 1,
('person', 'did', 'not'): 1, ('quick', 'person', 'did'): 1,
('quick', 'person', 'bumped'): 1, ('did', 'not', 'realize'): 1,
('speed', 'and', 'the'): 1, ('not', 'realize', 'his'): 1})
这也适用于惰性迭代对象和生成器。因此,您可以编写一个生成器,该生成器逐行读取文件,生成单词,并将其传递给ngarms
以懒惰地使用,而无需读取内存中的整个文件。
zip()
怎么样?
import re
from collections import Counter
words = re.findall('w+', open('a.txt').read())
print(Counter(zip(words,words[1:])))
简单地将Counter
用于任何n_gram,如下所示:
from collections import Counter
from nltk.util import ngrams
text = "the quick person did not realize his speed and the quick person bumped "
n_gram = 2
Counter(ngrams(text.split(), n_gram))
>>>
Counter({('and', 'the'): 1,
('did', 'not'): 1,
('his', 'speed'): 1,
('not', 'realize'): 1,
('person', 'bumped'): 1,
('person', 'did'): 1,
('quick', 'person'): 2,
('realize', 'his'): 1,
('speed', 'and'): 1,
('the', 'quick'): 2})
对于 3 克,只需将n_gram
更改为 3:
n_gram = 3
Counter(ngrams(text.split(), n_gram))
>>>
Counter({('and', 'the', 'quick'): 1,
('did', 'not', 'realize'): 1,
('his', 'speed', 'and'): 1,
('not', 'realize', 'his'): 1,
('person', 'did', 'not'): 1,
('quick', 'person', 'bumped'): 1,
('quick', 'person', 'did'): 1,
('realize', 'his', 'speed'): 1,
('speed', 'and', 'the'): 1,
('the', 'quick', 'person'): 2})
<</div>
div class="one_answers"> 从 Python 3.10
开始,新的 pairwise
函数提供了一种滑动连续元素对的方法,因此您的用例只需:
from itertools import pairwise
import re
from collections import Counter
# text = "the quick person did not realize his speed and the quick person bumped "
Counter(pairwise(re.findall('w+', text)))
# Counter({('the', 'quick'): 2, ('quick', 'person'): 2, ('person', 'did'): 1, ('did', 'not'): 1, ('not', 'realize'): 1, ('realize', 'his'): 1, ('his', 'speed'): 1, ('speed', 'and'): 1, ('and', 'the'): 1, ('person', 'bumped'): 1})
<小时 />中间结果的详细信息:
re.findall('w+', text)
# ['the', 'quick', 'person', 'did', 'not', 'realize', 'his', ...]
pairwise(re.findall('w+', text))
# [('the', 'quick'), ('quick', 'person'), ('person', 'did'), ...]
这个问题已经很久没有被提出并成功回答了。我从创建自己的解决方案的响应中受益。我想分享一下:
import regex
bigrams_tst = regex.findall(r"bw+sw+", open(myfile).read(), overlapped=True)
这将提供所有不会被标点符号打断的双字母。
scikit-learn(pip install sklearn
)的CountVectorizer来生成双元(或者更一般地说,任何ngram)。
示例(使用 Python 3.6.7 和 scikit-learn 0.24.2 进行测试)。
import sklearn.feature_extraction.text
ngram_size = 2
train_set = ['the quick person did not realize his speed and the quick person bumped']
vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
vectorizer.fit(train_set) # build ngram dictionary
ngram = vectorizer.transform(train_set) # get ngram
print('ngram: {0}n'.format(ngram))
print('ngram.shape: {0}'.format(ngram.shape))
print('vectorizer.vocabulary_: {0}'.format(vectorizer.vocabulary_))
输出:
>>> print('ngram: {0}n'.format(ngram)) # Shows the bi-gram count
ngram: (0, 0) 1
(0, 1) 1
(0, 2) 1
(0, 3) 1
(0, 4) 1
(0, 5) 1
(0, 6) 2
(0, 7) 1
(0, 8) 1
(0, 9) 2
>>> print('ngram.shape: {0}'.format(ngram.shape))
ngram.shape: (1, 10)
>>> print('vectorizer.vocabulary_: {0}'.format(vectorizer.vocabulary_))
vectorizer.vocabulary_: {'the quick': 9, 'quick person': 6, 'person did': 5, 'did not': 1,
'not realize': 3, 'realize his': 7, 'his speed': 2, 'speed and': 8, 'and the': 0,
'person bumped': 4}