我想将Panda数据帧转换为NER模型中可以使用的格式。
我有一个熊猫数据帧,如下所示:
```
Sentence_id Sentence labels
1 Did not enjoy the new Windows 8 and touchscreen functions. Windows 8
1 Did not enjoy the new Windows 8 and touchscreen functions. touchscreen functions
```
是否可以将其转换为以下格式
```
Sentence_id words labels
1 Did O
1 not O
1 enjoy O
1 the O
1 new O
1 Windows B
1 8 I
1 and O
1 touchscreen B
1 functions I
1 . O
```
标签中的第一个单词应标记为"B"(开头(,标签中的以下单词应标记"I"(内部(。其他单词和标点符号应标记为O(外部(。
解决方案有点长。但是你可以用df.iterrows()
来完成。
import string
ids = df.Sentence_id.unique().tolist() ## Assuming name of your dataframe is df
sentences = df.Sentence.unique().tolist()
labels = df.labels.unique().tolist()
def get_label(word, labels):
if word == labels[0]:
return 'B'
elif word in labels and word!= labels[0]:
return 'I'
else:
return 'O'
data = {}
exclude = set(string.punctuation)
for _, row in df.iterrows():
words = ''.join(ch for ch in row['Sentence'] if ch not in exclude).split()
puncts = ''.join(ch for ch in row['Sentence'] if ch in exclude).split()
labels = row['labels'].split()
for word in words:
if word in data:
if word in labels:
data[word][1] = get_label(word, labels)
else:
data[word] = [row['Sentence_id'], get_label(word, labels)]
for punct in puncts:
data[punct] = [row['Sentence_id'],'O']
## Processing the dictionary to input into dataframe
ids = []
words = []
labels = []
for key, val in data.items():
words.append(key)
ids.append(data[key][0])
labels.append(data[key][1])
new_df = pd.DataFrame({'Sentence_id':ids, 'words':words, 'labels':labels})
new_df
Sentence_id words labels
0 1 Did O
1 1 . O
2 1 not O
3 1 enjoy O
4 1 the O
5 1 new O
6 1 Windows B
7 1 8 I
8 1 and O
9 1 touchscreen B
10 1 functions I