我像这样标记数据:
Data = {'text': ['when can I decrease the contribution to my health savings?', 'I love my guinea pig', 'I love my dog'],
'start':[43, 10, 10],
'end':[57,19, 12],
'entity':['hsa', 'pet', 'pet'],
'value':['health savings', 'guinea pig', 'dog']
}
df = pd.DataFrame(Data)
text start end entity value
0 .. health savings 43 57 hsa health savings
1 I love my guinea pig 10 19 pet guinea pig
2 I love my dog 10 12 pet dog
想要将句子拆分为单词并标记每个单词。如果单词与实体相关联,请使用该实体对其进行标记。
我已经尝试了这个问题的方法:将熊猫中的句子拆分为句号和单词
但是该方法仅在值是"狗"等单个单词时才有效,但如果值是"豚鼠"之类的短语,则此方法不起作用
想要执行 BIO 标记。B 代表短语的开头。我代表一个短语的内部。O代表外面。
因此,所需的输出将是:
Sentence # Word Entity
0 Sentence: 0 when O
1 Sentence: 0 can O
2 Sentence: 0 I O
3 Sentence: 0 decrease O
4 Sentence: 0 the O
5 Sentence: 0 contribution O
6 Sentence: 0 to O
7 Sentence: 0 my O
8 Sentence: 0 health B-hsa
9 Sentence: 0 savings? I-hsa
10 Sentence: 1 I O
11 Sentence: 1 love O
12 Sentence: 1 my O
13 Sentence: 1 guinea B-pet
14 Sentence: 1 pig I-pet
15 Sentence: 2 I O
16 Sentence: 2 love O
17 Sentence: 2 my O
18 Sentence: 2 dog B-pet
使用:
df1 = (df.set_index(['value','entity'], append=True)
.text.str.split(expand=True)
.stack()
.reset_index(level=3, drop=True)
.reset_index(name='Word')
.rename(columns={'level_0':'Sentence'}))
df1['Sentence'] = 'Sentence: ' + df1['Sentence'].astype(str)
w = df1['Word'].str.replace(r'[^ws]+', '')
splitted = df1.pop('value').str.split()
e = df1.pop('entity')
m1 = splitted.str[0].eq(w)
m2 = [b in a for a, b in zip(splitted, w)]
df1['Entity'] = np.select([m1, m2 & ~m1], ['B-' + e, 'I-' + e], default='O')
print (df1)
Sentence Word Entity
0 Sentence: 0 when O
1 Sentence: 0 can O
2 Sentence: 0 I O
3 Sentence: 0 decrease O
4 Sentence: 0 the O
5 Sentence: 0 contribution O
6 Sentence: 0 to O
7 Sentence: 0 my O
8 Sentence: 0 health B-hsa
9 Sentence: 0 savings? I-hsa
10 Sentence: 1 I O
11 Sentence: 1 love O
12 Sentence: 1 my O
13 Sentence: 1 guinea B-pet
14 Sentence: 1 pig I-pet
15 Sentence: 2 I O
16 Sentence: 2 love O
17 Sentence: 2 my O
18 Sentence: 2 dog B-pet
解释:
- 首先通过
Series.str.split
和DataFrame.stack
DataFrame.set_index
创建新DataFrame
- 通过
DataFrame.rename_axis
、DataFrame.reset_index
和rename
进行一些数据清理
将 - 字符串附加到列
Sentence
- 使用
Series.str.replace
删除标点符号 - 对提取列使用
DataFrame.pop
,对列表使用split
- 通过比较拆分列表的第一个值来创建掩码
m1
- 创建掩码以比较列表的所有值
- 按
numpy.select
创建新列
步骤1:通过以下代码根据空格拆分列值:
s = df['value'].str.split(' ').apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1) # to line up with df's index
s.name = 'value' # needs a name to join
del df['value']
df1 = df.join(s)
df1 =df1.reset_index()
上述步骤会将您的短语分解为单个单词
步骤2:df1
将具有新值列的值,现在您需要做的就是将实体列w.r.t更改为新的value
列
prev_id = 'x'
for idx,ser in df1.iterrows():
if ser.text == prev_id:
df1.loc[idx,'entity'] = 'I-HSA'
else:
df1.loc[idx,'entity'] = 'B-HSA'
prev_id = ser.text
上面的代码更改了entity
字段,逻辑是类似的连续文本将值将根据要求更改值。
步骤3:在此之后,您的数据框与您发布的链接类似,只需应用相同的解决方案即可。
上面的答案是处理您的问题中提到的短语问题