统计POS标注模式的出现次数

所以我已经对数据框架中的一个列应用了POS标记。对于每个句子，我想计算这种模式的出现次数:NNP, MD, VB。

例如，我有下面的句子:业主和承包商之间的沟通应使用英语

POS标注为:(NNS)通信(,)、(,DT),(本金,NNP), (CC), (, DT),(承包商,NNP),(应当MD), (, VB), (DT), (DT),(英语,JJ),(语言,NN)。

注意，在POS标注结果中，模式(NNP, MD, VB)存在并且出现了1次。我想在df中为这个出现次数创建一个新列。

有什么好主意吗?

Thanks in advance

一个简单的计数器函数将执行您想要的!

输入:

df = pd.DataFrame({'POS':['(communications, NNS), (between,IN), (the, DT), (Principal, NNP), (and, CC), (the, DT), (Contractor, NNP), (shall, MD), (be,VB), (in, DT), (the, DT), (English, JJ), (language, NN)', '(Contractor, NNP), (shall, MD), (be,VB), (communications, NNS), (between,IN), (the, DT), (Principal, NNP), (and, CC), (the, DT), (Contractor, NNP), (shall, MD), (be,VB), (in, DT), (the, DT), (English, JJ), (language, NN)', '(and, CC), (the, DT)']})

功能:

def counter(pos):
words, tags = [], []
for item in pos.split('), ('):
temp = item.strip(' )(')
word, tag = temp.split(',')[0], temp.split(',')[-1].strip()
words.append(word); tags.append(tag)
length = len(tags)
if length<3:
return 0
count = 0
for idx in range(length):
if tags[idx:idx+3]==['NNP', 'MD', 'VB']:
count+=1
return count

输出:

df['occ'] = df['POS'].apply(counter)
df
POS     occ
0   (communications, NNS), (between,IN), (the, DT)...   1
1   (Contractor, NNP), (shall, MD), (be,VB), (comm...   2
2   (and, CC), (the, DT)    0

相关内容

最新更新

热门标签：