使用Python从列中删除特定单词

数据最初来自PDF，用于对数据进行进一步分析。有一个[identity]列，其中一些值拼写错误，即它包含拼写错误或特殊字符。

希望从列中删除不需要的字符。

输入数据：

identity
UK25463AC
ID:- UN67342OM
#ID!?
USA5673OP

预期输出：

identity
UK25463AC
UN67342OM
NAN
USA5673OP

到目前为止我尝试过的脚本：

stop_word = ['#ID!?','ID:-']
pat = '|'.join(r"b{}b".format(x) for x in stop_words)
df['identity'] = df['identity'].str.replace(pat, '')

所以我不知道如何处理这个问题

从预期输出中删除单词边界bb是必要的，并且由于添加了特殊的正则表达式chcareerre.escape，因此将Series.replace用于空字符串，如果仅空字符串缺少值：

import re
stop_words = ['#ID!?','ID:-']
pat = '|'.join(r"{}".format(re.escape(x)) for x in stop_words)
df['identity'] = df['identity'].replace(pat, '', regex=True).replace('', np.nan)
print (df)
identity
0   UK25463AC
1   UN67342OM
2         NaN
3   USA5673OP

相关内容

最新更新

热门标签：