用于单词云的NLP单词处理和数据帧取消透视

我有一个数据帧，它只包含两列，即['content_ID']和['content']，我想将它转换为另一个数据框架，它包含一个额外的列，用于标记内容。有什么线索吗？提前谢谢。

df = {'content_ID':  ['id_A', 'id_B'],
'content': ['eating apple', 'i love eat fruits and orange']
}
df = pd.DataFrame(df)

改造后：

|content_ID |content    |word|
|id_A   |eating apple   |eat|
|id_A   |eating apple   |apple|
|id_B   |I love eat fruits and orange   |i|
|id_B   |I love eat fruits and orange   |love|
|id_B   |I love eat fruits and orange   |eat|
|id_B   |I love eat fruits and orange   |fruit|
|id_B   |I love eat fruits and orange   |and|
|id_B   |I love eat fruits and orange   |orange|

首先需要标记化，可以通过str.split来完成。然后，通过使用爆炸，您可以使令牌列表变平：

df['tokens'] = df['content'].str.split()
df = df.explode('tokens').reset_index(drop=True)

如果您打印df:，这是输出

content_ID                       content  tokens
0       id_A                  eating apple  eating
1       id_A                  eating apple   apple
2       id_B  i love eat fruits and orange       i
3       id_B  i love eat fruits and orange    love
4       id_B  i love eat fruits and orange     eat
5       id_B  i love eat fruits and orange  fruits
6       id_B  i love eat fruits and orange     and
7       id_B  i love eat fruits and orange  orange

相关内容

最新更新

热门标签：