如何仅当数据帧中的某个单词前面有一个数字时才替换它?



我正在尝试在数据框架中搜索有关字典中列出的某些单词如果存在的话,它将用值键替换。

units_dic= {'grams':['g','Grams'],
                'kg'   :['kilogram','kilograms']}

问题是有些单位的缩写是字母,所以它也将替换所有字母,我只想在数字之前进行替换才能确保它是一个单位。

dataframe

    Id | test 
    ---------
    1  |'A small paperclip has a mass of about 111 g'
    2  |'1 kilogram =1000 g'
    3  |'g is the 7th letter in the ISO basic Latin alphabet'

替换循环

  x = df.copy()
  for k in units_dic:
      for i in range(len(x['test'])):
          for w in units_dic[k]:
              x['test'][i] = str(x['test'][i]).replace(str(w), str(k))

输出

    Id | test 
    ---------
    1  |'A small paperclip has a mass of about 111 grams'
    2  |'1 kg =1000 grams'
    3  |'grams is the 7th letter in the ISO basic Latin alphabet'

刷新词典以及摘要的正则表达式。

import re
d = {i: k for k, v in units_dic.items() for i in v}
u = r'|'.join(d)
v = fr'(d+s?)b({u})b'
df.assign(test=[re.sub(v, lambda x: x.group(1) + d[x.group(2)], el) for el in df.test])

   Id                                               test
0   1    A small paperclip has a mass of about 111 grams
1   2                                   1 kg =1000 grams
2   3  g is the 7th letter in the ISO basic Latin alp...

尝试:

for key, val in units_dic.items(): 
    df['test'] = df['test'].replace("d+[ ]*" + "|".join(val) , key , regex=True)

我们可以在此处使用lookbehind的CC_1功能,我们可以指定它需要先于数字和可选 a whitespace:

for k, v in units_dic.items():
    df['test'] = df['test'].str.replace(f"(?<=[0-9])s*({'|'.join(v)})b", f' {k}')

print(df)
   Id                                               test
0   1  'A small paperclip has a mass of about 111 grams'
1   2                                 '1 kg =1000 grams'
2   3  'g is the 7th letter in the ISO basic Latin al...

说明
首先,我们使用RAW FSTRING:fr'sometext'

正则表达式:

  • ?<=[0-9] =先于数字
  • s*是一个空格
  • "|".join(v)给我们您的字典背部中的值,由 |界定是REGEX中的or操作员

相关内容

最新更新