使用 nltk.word_tokenize会在 Pandas 数据框中生成错误"expected string or bytes-like object"



对于以下数据框架:

index      sentences                                            category
1          the side effects are terrible !                         SSRI
2          They are killing me,,, I want to stop                   SNRI
3          I need to contact my physicians ?                        SSRI
4          How to stop it.. I am surprised because of its effect.   SSRI
5                                                                   SSRI
6                    NAN                                            SNRI

我试图将句子列中的句子归为句子。句子列具有一些无空值。这是我的代码,但行不通。

df["sentences"] = df.sentences.replace (r'[^a-zA-Z]', '', regex= True, inplace = True)
df["tokenized_sents"] = df["sentences"].apply(nltk.word_tokenize)

i alo尝试了:

df["sentences"] = df.sentences.replace (r'[^a-zA-Z]', 'null', regex= True, inplace = True)

它创建以下错误:

expected string or bytes-like object

任何建议?

#  I added NaN, None to your date for demonstration, please check below first df.
print(df)  
df["tokenized_sents"] = df["sentences"].fillna("").map(nltk.word_tokenize)
print(df)

第一次打印,

   index                                          sentences category
0      1                    the side effects are terrible !     SSRI
1      2              They are killing me,,, I want to stop     SNRI
2      3                  I need to contact my physicians ?     SSRI
3      4  How to stop it.. I am surprised because of its...     SSRI
4      5                                                NaN     SNRI
5      5                                               None     None

第二印刷,

   index                                          sentences category  
0      1                    the side effects are terrible !     SSRI   
1      2              They are killing me,,, I want to stop     SNRI   
2      3                  I need to contact my physicians ?     SSRI   
3      4  How to stop it.. I am surprised because of its...     SSRI   
4      5                                                NaN     SNRI   
5      5                                               None     None   
                                     tokenized_sents  
0             [the, side, effects, are, terrible, !]  
1  [They, are, killing, me, ,, ,, ,, I, want, to,...  
2          [I, need, to, contact, my, physicians, ?]  
3  [How, to, stop, it.., I, am, surprised, becaus...  
4                                                 []  
5                                                 []  

顺便说一句,如果您明确使用inplace=True,则不必再次将其分配给原始DF。

df.sentences.replace(r'[^a-zA-Z]', '', regex=True, inplace=True)
#  instead of, df["sentences"] = df.sentences.replace(r'[^a-zA-Z]', '', regex=True, inplace=True)

相关内容

最新更新