如何从pos标记列返回列表



这些是我的模块:

import pandas as pd
from nltk.tag import pos_tag
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

我有一个类似的df:

df = pd.DataFrame({'comments': ['Daniel is really cool',
'Daniel is the most',
'We had such a',
'Very professional operation',
'Lots of bookcases']})

然后我通过以下内容:

df['tokenized'] = df['comments'].apply(word_tokenize)
df['tagged'] = df['tokenized'].apply(pos_tag)
df['lower_tagged'] = df['tokenized'].apply(lambda lt: [word.lower() for word in lt]).apply(pos_tag)

我感兴趣的列是标记较低的列

0    [(daniel, NN), (is, VBZ), (really, RB), (cool,...
1    [(daniel, NN), (is, VBZ), (the, DT), (most, RBS)]
2         [(we, PRP), (had, VBD), (such, JJ), (a, DT)]
3    [(very, RB), (professional, JJ), (operation, NN)]
4            [(lots, NNS), (of, IN), (bookcases, NNS)]

我正在尝试实现一个函数,该函数在lower_taged列中返回1000个最常用名词的列表。

预期结果应该是这样的:

nouns = ['daniel', 'operation', 'bookcases', 'lots']

我尝试过的一种方法如下:

lower_tag = df['lower_tagged']
print([t[0] for t in lower_tag if t[1] == 'NN'])

然而,这只是返回一个空列表。我尝试过的另一种方法:

def list_nouns(df):
s = lower_tag
nouns = [word for word, pos in pos_tag(word_tokenize(s)) if pos.startswith('NN')]
return nouns

然而,我得到了这个错误:expected string or bytes-like object

为这篇长帖子道歉——任何建议都将不胜感激,因为我已经在这篇文章上呆了一段时间了!感谢

使用explodetolist创建一个新的DataFrame,然后在使用str.startswith创建的布尔索引上使用loc,然后使用nlargest只获取每个单词的值计数:

top_n_words = 2
new_df = pd.DataFrame(
df['lower_tagged'].explode().tolist(),
columns=['word', 'part_of_speech']
)
nouns = new_df[
new_df['part_of_speech'].str.startswith('NN')
].value_counts().reset_index().nlargest(top_n_words, 0)['word'].tolist()

或者explode,然后使用str访问器和str.startswith在序列上创建布尔索引,然后nlargest只获取每个单词的值计数:

top_n_words = 2
s = df['lower_tagged'].explode()
nouns = (
s[s.str[-1].str.startswith('NN')].str[0]
.value_counts()
.reset_index()
.nlargest(top_n_words, 'lower_tagged')['index'].tolist()
)

只需更改top_n_words即可选择需要多少单词。

nouns用于top_n_words = 2:

['daniel', 'bookcases']

最新更新