用于单词云的NLP单词处理和数据帧取消透视



我有一个数据帧,它只包含两列,即['content_ID']和['content'],我想将它转换为另一个数据框架,它包含一个额外的列,用于标记内容。有什么线索吗?提前谢谢。

df = {'content_ID':  ['id_A', 'id_B'],
'content': ['eating apple', 'i love eat fruits and orange']
}
df = pd.DataFrame(df)

改造后:

|content_ID |content    |word|
|id_A   |eating apple   |eat|
|id_A   |eating apple   |apple|
|id_B   |I love eat fruits and orange   |i|
|id_B   |I love eat fruits and orange   |love|
|id_B   |I love eat fruits and orange   |eat|
|id_B   |I love eat fruits and orange   |fruit|
|id_B   |I love eat fruits and orange   |and|
|id_B   |I love eat fruits and orange   |orange|

首先需要标记化,可以通过str.split来完成。然后,通过使用爆炸,您可以使令牌列表变平:

df['tokens'] = df['content'].str.split()
df = df.explode('tokens').reset_index(drop=True)

如果您打印df:,这是输出

content_ID                       content  tokens
0       id_A                  eating apple  eating
1       id_A                  eating apple   apple
2       id_B  i love eat fruits and orange       i
3       id_B  i love eat fruits and orange    love
4       id_B  i love eat fruits and orange     eat
5       id_B  i love eat fruits and orange  fruits
6       id_B  i love eat fruits and orange     and
7       id_B  i love eat fruits and orange  orange

最新更新