熊猫拆分句子并标记短语以执行 BIO 标记



我像这样标记数据:

    Data = {'text': ['when can I decrease the contribution to my health savings?', 'I love my guinea pig', 'I love my dog'],
        'start':[43, 10, 10],
        'end':[57,19, 12],
        'entity':['hsa', 'pet', 'pet'],
        'value':['health savings', 'guinea pig', 'dog']
       } 
    df = pd.DataFrame(Data)
       text               start  end         entity     value
0   .. health savings      43    57          hsa      health savings
1   I love my guinea pig   10    19          pet      guinea pig
2   I love my dog          10    12          pet         dog

想要将句子拆分为单词并标记每个单词。如果单词与实体相关联,请使用该实体对其进行标记。

我已经尝试了这个问题的方法:将熊猫中的句子拆分为句号和单词

但是该方法仅在值是"狗"等单个单词时才有效,但如果值是"豚鼠"之类的短语,则此方法不起作用

想要执行 BIO 标记。B 代表短语的开头。我代表一个短语的内部。O代表外面。

因此,所需的输出将是:

    Sentence #  Word         Entity
0   Sentence: 0 when            O
1   Sentence: 0 can             O
2   Sentence: 0 I               O
3   Sentence: 0 decrease        O
4   Sentence: 0 the             O
5   Sentence: 0 contribution    O
6   Sentence: 0 to              O
7   Sentence: 0 my              O
8   Sentence: 0 health          B-hsa
9   Sentence: 0 savings?        I-hsa
10  Sentence: 1 I               O
11  Sentence: 1 love            O
12  Sentence: 1 my              O
13  Sentence: 1 guinea          B-pet
14  Sentence: 1 pig             I-pet
15  Sentence: 2 I               O
16  Sentence: 2 love            O
17  Sentence: 2 my              O
18  Sentence: 2 dog             B-pet

使用:

df1 = (df.set_index(['value','entity'], append=True)
         .text.str.split(expand=True)
         .stack()
         .reset_index(level=3, drop=True)
         .reset_index(name='Word')
         .rename(columns={'level_0':'Sentence'}))
df1['Sentence'] = 'Sentence: ' + df1['Sentence'].astype(str)
w = df1['Word'].str.replace(r'[^ws]+', '')
splitted = df1.pop('value').str.split()
e = df1.pop('entity')
m1 = splitted.str[0].eq(w)
m2 = [b in a for a, b in zip(splitted, w)]
df1['Entity'] = np.select([m1, m2 & ~m1], ['B-' + e, 'I-' + e],  default='O')

print (df1)
       Sentence          Word Entity
0   Sentence: 0          when      O
1   Sentence: 0           can      O
2   Sentence: 0             I      O
3   Sentence: 0      decrease      O
4   Sentence: 0           the      O
5   Sentence: 0  contribution      O
6   Sentence: 0            to      O
7   Sentence: 0            my      O
8   Sentence: 0        health  B-hsa
9   Sentence: 0      savings?  I-hsa
10  Sentence: 1             I      O
11  Sentence: 1          love      O
12  Sentence: 1            my      O
13  Sentence: 1        guinea  B-pet
14  Sentence: 1           pig  I-pet
15  Sentence: 2             I      O
16  Sentence: 2          love      O
17  Sentence: 2            my      O
18  Sentence: 2           dog  B-pet

解释

  1. 首先通过Series.str.splitDataFrame.stack DataFrame.set_index创建新DataFrame
  2. 通过DataFrame.rename_axisDataFrame.reset_indexrename进行一些数据清理
  3. 字符串附加到列Sentence
  4. 使用Series.str.replace删除标点符号
  5. 对提取列使用 DataFrame.pop,对列表使用 split
  6. 通过比较拆分列表的第一个值来创建掩码m1
  7. 创建掩码以比较列表的所有值
  8. numpy.select创建新列

步骤1:通过以下代码根据空格拆分列值:

s = df['value'].str.split(' ').apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1) # to line up with df's index
s.name = 'value' # needs a name to join
del df['value']
df1 = df.join(s)
df1 =df1.reset_index()

上述步骤会将您的短语分解为单个单词

步骤2df1将具有新值列的值,现在您需要做的就是将实体列w.r.t更改为新的value

prev_id = 'x'
for idx,ser in df1.iterrows():
    if ser.text == prev_id:
        df1.loc[idx,'entity'] = 'I-HSA'
    else:
        df1.loc[idx,'entity'] = 'B-HSA'
    prev_id = ser.text

上面的代码更改了entity字段,逻辑是类似的连续文本将值将根据要求更改值。

步骤3:在此之后,您的数据框与您发布的链接类似,只需应用相同的解决方案即可。

上面的答案是处理您的问题中提到的短语问题

最新更新