我有一个数据帧,它只包含两列,即['content_ID']和['content'],我想将它转换为另一个数据框架,它包含一个额外的列,用于标记内容。有什么线索吗?提前谢谢。
df = {'content_ID': ['id_A', 'id_B'],
'content': ['eating apple', 'i love eat fruits and orange']
}
df = pd.DataFrame(df)
改造后:
|content_ID |content |word|
|id_A |eating apple |eat|
|id_A |eating apple |apple|
|id_B |I love eat fruits and orange |i|
|id_B |I love eat fruits and orange |love|
|id_B |I love eat fruits and orange |eat|
|id_B |I love eat fruits and orange |fruit|
|id_B |I love eat fruits and orange |and|
|id_B |I love eat fruits and orange |orange|
首先需要标记化,可以通过str.split
来完成。然后,通过使用爆炸,您可以使令牌列表变平:
df['tokens'] = df['content'].str.split()
df = df.explode('tokens').reset_index(drop=True)
如果您打印df:,这是输出
content_ID content tokens
0 id_A eating apple eating
1 id_A eating apple apple
2 id_B i love eat fruits and orange i
3 id_B i love eat fruits and orange love
4 id_B i love eat fruits and orange eat
5 id_B i love eat fruits and orange fruits
6 id_B i love eat fruits and orange and
7 id_B i love eat fruits and orange orange