将多次迭代的函数应用于pandas系列的最佳方式



我需要将带条件的文本替换应用于具有许多不同迭代的panda系列。实现这一目标的最佳方式是什么?

不过,我的第一步是定义一个函数,在函数中迭代,然后应用它,但这显然不起作用,因为每行只返回一个值(仅限第一次迭代(:

def numberreplace(x):
matches = {'FIRST':'1ST',
'SECOND':'2ND',
'THIRD':'3RD',
'FOURTH':'4TH',
'FIFTH':'5TH',
'SIXTH':'6TH',
'SEVENTH':'7TH',
'EIGTH':'8TH',
'NINTH':'9TH',
'TENTH':'10TH'}
for key in matches.keys():
if (' '+key+'' in x) or (x.startswith(key)):
x = x.replace(key, matches[key])
return x
else:
return x
data['STREET REFORMAT'] = data['STREET REFORMAT'].apply(numberreplace)

我的另一个想法是在apply语句之外定义一个列表,迭代字典键的列表,然后将函数应用于具有list元素的行,然而,我不确定如何将多参数函数应用于序列,也不确定如何指定哪个参数是序列的"row"参数。

def numberreplace(row,k):
matches = {'FIRST':'1ST',
'SECOND':'2ND',
'THIRD':'3RD',
'FOURTH':'4TH',
'FIFTH':'5TH',
'SIXTH':'6TH',
'SEVENTH':'7TH',
'EIGTH':'8TH',
'NINTH':'9TH',
'TENTH':'10TH'}
if (' '+k+'' in row) or (row.startswith(k)):
row = row.replace(k, matches[k])
return row
return row

nummatches = ['FIRST','SECOND','THIRD','FOURTH','FIFTH','SIXTH','SEVENTH','EIGHTH','NINTH','TENTH]
for match in nummatches:
data['STREET REFORMAT'] = data['STREET REFORMAT'].apply(numberreplace(match))

在具有许多行和许多替换字符串的数据帧上运行此应用函数的最有效方法是什么?

您不需要在for循环中返回x,只需要在循环结束后返回一次。试试这个:

def numberreplace(x):
matches = {'FIRST':'1ST',
'SECOND':'2ND',
'THIRD':'3RD',
'FOURTH':'4TH',
'FIFTH':'5TH',
'SIXTH':'6TH',
'SEVENTH':'7TH',
'EIGTH':'8TH',
'NINTH':'9TH',
'TENTH':'10TH'}
for key in matches.keys():
if (' '+key+'' in x) or (x.startswith(key)):
x = x.replace(key, matches[key])
return x
data['STREET REFORMAT'] = data['STREET REFORMAT'].apply(numberreplace)

重写函数的一种方法是使用regex:

import re
def numberreplace(x):
matches = {'FIRST':'1ST',
'SECOND':'2ND',
'THIRD':'3RD',
'FOURTH':'4TH',
'FIFTH':'5TH',
'SIXTH':'6TH',
'SEVENTH':'7TH',
'EIGTH':'8TH',
'NINTH':'9TH',
'TENTH':'10TH'}
for key in matches.keys():
x = re.sub(re.compile(f"s*{key}"), matches[key], x)
return x
data['STREET REFORMAT'] = data['STREET REFORMAT'].apply(numberreplace)

这将用等效的matches替换任何出现的keys,如果找不到匹配项,则返回原始字符串。这个解决方案比使用string.replace()方法的解决方案快大约2倍,因此它可能对具有许多行和许多替换字符串的大型数据帧有用。

您可以创建两个条件并使用字典matches:执行mask,而不是循环

data = pd.DataFrame({"STREET REFORMAT":["FIRST", "THIRD", "IAMNINTH", "EIGTHISME"]})
cond1 = data["STREET REFORMAT"].str.contains("|".join(fr"b{i}b" for i in matches))
cond2 = data["STREET REFORMAT"].str.contains("|".join(fr"^{i}" for i in matches))
print (data["STREET REFORMAT"].mask(cond1|cond2, data["STREET REFORMAT"].replace(matches, regex=True)))
0         1ST
1         3RD
2    IAMNINTH
3     8THISME
Name: STREET REFORMAT, dtype: object

最新更新