Python:计算每个POS出现的每个单词的标签数



文件/content/try list.txt包含:

DT The NNP Fulton NNP County NNP Grand NNP Jury VBD said NNP Friday 0 DT an NN investigation IN of NNP Atlanta POS 's JJ recent JJ primary NN election VBD produced DT no NN evidence '' '' IN that DT any NNS irregularities VBD took NN place . . DT The NN jury RB further VBD said IN in JJ term-end NNS presentments IN that DT the NNP City NNP Executive
fname = open('/content/try list.txt', "r")
counts = dict()
for line in fname:
words = line.split()
for word in words:
if word not in counts:
counts[word] = 1
else:
counts[word] += 1
print(counts)
"""
Output: {
'DT': 10, 'The': 2, 'NNP': 11, 'Fulton': 1, 'County': 1,
'Grand': 1, 'Jury': 1, 'VBD': 5, 'said': 2, 'Friday': 1,
'0': 1, 'an': 1, 'NN': 9, 'investigation': 1, 'IN': 8,
'of': 4, 'Atlanta': 2, 'POS': 1, "'s": 1, 'JJ': 4, 'recent': 1,
}
"""

它是在计算每个单词和阶段的出现,但我如何才能明智地使用单词?

预期输出应为:

The-->DT:48, Fulton--> NNP:28

如果你想计算一个单词有多少次特定的pos,你需要同时迭代POStag和单词。此外,您还需要一个更复杂的数据结构,例如一个包含POS字典的单词字典,这样您就可以得到word -> pos -> count

with open('/content/try list.txt', "r") as fname:
# If all your document is in one file you don't need to do 'for line in fname'
words = fname.read().split()
counts = dict()
# range(0, len(words), 2) will be [0, 2, 4, 6, ...]
for i in range(0, len(words), 2):
pos = words[i]
word = words[i+1]
# Ensure word is in counts
if word not in counts:
counts[word] = dict()
# Ensure pos is in counts[word]
if pos not in counts[word]:
counts[word][pos] = 0
# Actual counting !
counts[word][pos] += 1
print(counts)

您也可以在不需要检查密钥是否存在的情况下使用defaultdict!

from collections import defaultdict
counts = defaultdict(lambda: defaultdict(int))
# range(len(words), 2) will be [0, 2, 4, 6, ...]
for i in range(0, len(words), 2):
pos = words[i]
word = words[i+1]
# Actual counting !
counts[word][pos] += 1
print(counts)  # This won't print as pretty but has the same result
fr = open('/content/try list.txt', "r").read()
cleantxt = text.replace("''","").replace(".","").replace("0","").split()
from collections import Counter
counts = Counter(list(zip(cleantxt[1::2],cleantxt[::2])))
print(counts)

输出:

Counter({('The', 'DT'): 2,
('Fulton', 'NNP'): 1,
('County', 'NNP'): 1,
('Grand', 'NNP'): 1,
('Jury', 'NNP'): 1,
('said', 'VBD'): 2,
('Friday', 'NNP'): 1,
('an', 'DT'): 1,
('investigation', 'NN'): 1,
('of', 'IN'): 1,....

最新更新