对具有多个值的数据帧字符串列进行一次热编码



我有一个数据框架"df1"由1245行组成,具有列文本(对象类型)和主题(对象类型)。主题列包含不同的数字,对应于文本标签。下面是一个例子:

text                                                topic
1207    June 2019: The French Facility for Global Envi...   3 12 7
1208    May 2019: Participants from multi-stakeholder ...   8
1209    2 July 2019: UN Member States have reached agr...   1 7
1210    30 June 2019: The G20 Leaders’ Summit and asso...   7 8 9 11 12 13 14 15 17

我想获得一个像这样的热编码形式(也添加一个'S'在列名之前的数字):

text                                                S1  S2  S3 ..... S7  S8 S9 etc.
1207    June 2019: The French Facility for Global Envi...   0    0   1  ..... 1   0  0
1208    May 2019: Participants from multi-stakeholder ...   0    0   0 ...... 0   1  0
1209    2 July 2019: UN Member States have reached agr...   1    0   0  ..... 1   0  0
1210    30 June 2019: The G20 Leaders’ Summit and asso...   0    0   0  ......1   1  1

这里的"困难"是我的文本是多标签的,所以简单的one-hot编码代码不适合我的情况。你知道吗?

如果只使用pandas,您可以这样做:

import pandas as pd

data = [['June 2019: The French Facility for Global Envi...', '3 12 7'],
['May 2019: Participants from multi-stakeholder ...','8'],
['2 July 2019: UN Member States have reached agr...','1 7'],
['30 June 2019: The G20 Leaders’ Summit and asso...','7 8 9 11 12 13 14 15 17']]
df = pd.DataFrame(data , columns=['text', 'topic'])
# creating list of strings where each value is one number out of topic column
unique_values = ' '.join(df['topic'].values.tolist()).split(' ')
# creating new column for each value in unique_values
for number in unique_values:
df[f'S{number}'] = 0

# changing 0 to 1 for every Snumber column where topic contains number
for idx, row in df.iterrows():
for number in row['topic'].split(' '):
df.loc[idx, f'S{number}'] = 1
df.drop('topic', axis=1, inplace=True)

结果:


text                                                S3  S12 S7  S8  S1  S9  S11 S13 S14 S15 S17
0   June 2019: The French Facility for Global Envi...   1   1   1   0   0   0   0   0   0   0   0
1   May 2019: Participants from multi-stakeholder ...   0   0   0   1   0   0   0   0   0   0   0
2   2 July 2019: UN Member States have reached agr...   0   0   1   0   1   0   0   0   0   0   0
3   30 June 2019: The G20 Leaders’ Summit and asso...   0   1   1   1   0   1   1   1   1   1   1

稍微修改一下数据(出于可读性原因…):

from io import StringIO
import pandas as pd
s = """id,text,topic
1207,One,1 2 5
1208,Two,3
1209,Three,1 4
1210,Four,1 2 3"""
df = pd.read_csv(StringIO(s))
df.topic = df.topic.str.split(' ').apply(lambda x: [int(y) for y in x])
b = np.zeros((df.topic.size, max(max(x) for x in df.topic) + 1))
for i in df.index:
b[i, df.topic[i]] = 1
idx = {'id': df.id, 'text': df.text}
idx.update({f'S{i}': b[:, i] for i in range(1, b.shape[1])})
idx
df = pd.DataFrame(idx)
print(df.set_index('id').to_markdown())

这给你:

|   id | text   |   S1 |   S2 |   S3 |   S4 |   S5 |
|-----:|:-------|-----:|-----:|-----:|-----:|-----:|
| 1207 | One    |    1 |    1 |    0 |    0 |    1 |
| 1208 | Two    |    0 |    0 |    1 |    0 |    0 |
| 1209 | Three  |    1 |    0 |    0 |    1 |    0 |
| 1210 | Four   |    1 |    1 |    1 |    0 |    0 |

最新更新