我有pandas数据帧,包含以下格式的信息:
sentence_num | sent_word | word_char>word_index//tr>|||
---|---|---|---|---|
0 | foo | B-foo | 1 | |
0 | foo | B-foo | o1 | |
0 | foo | B-foo | o1 | |
0 | [] | B-ws | []2 | [/tr>|
0 | bar | B-bar | B | 3 |
0 | bar | B-bar | a | 3 |
0 | bar | B-barr | 3 | [/tr>|
1 | john | B-名称j | <1>//tr>||
1 | john | B-name>o | 1 | |
1 | john | B-名称h | <1>//tr>||
1 | john | B-名称 | n<1>//tr>||
1 | [] | B-ws[] | ||
1 | doe | B序列 | d<1td>3[/tr>||
1 | doe | B序列 | >o | 3 |
1 | doe | B序列e | 3 |
使用布尔索引:
# is word_char not the first letter?
# and sent_word is not "[ ]"
m = ( df['sent_word'].str[0].ne(df['word_char'])
& df['sent_word'].ne('[ ]')
)
# for those rows, change the B into I
df.loc[m, 'tag'] = 'I'+df.loc[m, 'tag'].str[1:]
输出:
sentence_num sent_word tag word_char word_index
0 0 foo B-foo f 1
1 0 foo I-foo o 1
2 0 foo I-foo o 1
3 0 [ ] B-ws [ ] 2
4 0 bar B-bar b 3
5 0 bar I-bar a 3
6 0 bar I-bar r 3
7 1 john B-name j 1
8 1 john I-name o 1
9 1 john I-name h 1
10 1 john I-name n 1
11 1 [ ] B-ws [ ] 2
12 1 doe B-sur d 3
13 1 doe I-sur o 3
14 1 doe I-sur e 3