如何用熊猫把句子分成句子Id、单词和标签

我想将Panda数据帧转换为NER模型中可以使用的格式。

我有一个熊猫数据帧，如下所示：

```
Sentence_id    Sentence                                                       labels
1              Did not  enjoy the new Windows 8 and touchscreen functions.    Windows 8
1              Did not  enjoy the new Windows 8 and touchscreen functions.    touchscreen functions
```

是否可以将其转换为以下格式

```
Sentence_id    words          labels                                                       
1              Did            O
1              not            O
1              enjoy          O
1              the            O
1              new            O
1              Windows        B
1              8              I
1              and            O
1              touchscreen    B
1              functions      I
1              .              O
```

标签中的第一个单词应标记为"B"(开头(，标签中的以下单词应标记"I"(内部(。其他单词和标点符号应标记为O(外部(。

解决方案有点长。但是你可以用df.iterrows()来完成。

import string
ids = df.Sentence_id.unique().tolist()     ## Assuming name of your dataframe is df
sentences = df.Sentence.unique().tolist()
labels = df.labels.unique().tolist()
def get_label(word, labels):
if word == labels[0]:
return 'B'
elif word in labels and word!= labels[0]:
return 'I'
else:
return 'O'
data = {}
exclude = set(string.punctuation)
for _, row in df.iterrows():
words = ''.join(ch for ch in row['Sentence'] if ch not in exclude).split()
puncts = ''.join(ch for ch in row['Sentence'] if ch in exclude).split()
labels = row['labels'].split()
for word in words: 
if word in data:
if word in labels:
data[word][1] =  get_label(word, labels)
else:
data[word] = [row['Sentence_id'], get_label(word, labels)]
for punct in puncts:
data[punct] = [row['Sentence_id'],'O']
## Processing the dictionary to input into dataframe
ids = []
words = []
labels = []
for key, val in data.items():
words.append(key)
ids.append(data[key][0])
labels.append(data[key][1])
new_df = pd.DataFrame({'Sentence_id':ids, 'words':words, 'labels':labels})
new_df
Sentence_id words   labels
0   1           Did     O
1   1           .       O
2   1           not     O
3   1           enjoy   O
4   1           the     O
5   1           new     O
6   1           Windows B
7   1           8       I
8   1           and     O
9   1       touchscreen B
10  1         functions I

相关内容

最新更新

热门标签：