如何使用NLTK标记数据帧中的文本列



我的df看起来像这样:

team_name   text
---------   ----
red         this is text from red team
blue        this is text from blue team
green       this is text from green team
yellow      this is text from yellow team

我正试图得到这个:

team_name   text                             text_token
---------   ----                             ----------
red         this is text from red team       'this', 'is', 'text', 'from', 'red','team'
blue        this is text from blue team      'this', 'is', 'text', 'from', 'blue','team'
green       this is text from green team     'this', 'is', 'text', 'from', 'green','team'
yellow      this is text from yellow team    'this', 'is', 'text', 'from', 'yellow','team'

我试过什么?

df['text_token'] = nltk.word_tokenize(df['text'])

但这是行不通的。我如何达到我想要的结果?也可以做frequency dist吗?

Stack overflow有几个例子供您研究。

这个问题已经在链接中解决:如何在数据帧中使用word_tokesize

df['text_token'] = df.apply(lambda row: nltk.word_tokenize(row['text']), axis=1)

最新更新