pandas apply() results in UnboundLocalError



我有一个数据帧(df_cluster(,它有两列[Customer Id,cluster]。大约有13个集群,我试图使用python中的apply((为每个集群分配一个名称。我过去也使用过同样的功能,效果很好,但现在我得到了";UnboundLocalError";错误

如果我做错了什么,请告诉我。我对apply((的理解是,它跨轴传递函数(在这种情况下,函数cluster_name将为每行传递(

这是代码

def cluster_name(df):
if df['cluster'] == 1:
value = 'A'
elif df['cluster'] == 2:
value = 'B'    
elif df['cluster'] == 3:
value = 'C'
elif df['cluster'] == 4:
value = 'D'
elif df['cluster'] == 5:
value = 'E'
elif df['cluster'] == 6:
value = 'F'
elif df['cluster'] == 7:
value = 'G'
return value
df_cluster['cluster_name'] = df_cluster.apply(cluster_name, axis = 1)

错误

UnboundLocalError                         Traceback (most recent call last)
<ipython-input-16-b64f3fdc1260> in <module>
16     return value
17 
---> 18 df_cluster['cluster_name'] = df_cluster.apply(cluster_name, axis = 1)
19 df_cluster['cluster_name'].value_counts()
/opt/cloudera/parcels/Anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
6926             kwds=kwds,
6927         )
-> 6928         return op.get_result()
6929 
6930     def applymap(self, func):
/opt/cloudera/parcels/Anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/apply.py in get_result(self)
184             return self.apply_raw()
185 
--> 186         return self.apply_standard()
187 
188     def apply_empty_result(self):
/opt/cloudera/parcels/Anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/apply.py in apply_standard(self)
290 
291         # compute the result using the series generator
--> 292         self.apply_series_generator()
293 
294         # wrap results
/opt/cloudera/parcels/Anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/apply.py in apply_series_generator(self)
319             try:
320                 for i, v in enumerate(series_gen):
--> 321                     results[i] = self.f(v)
322                     keys.append(v.name)
323             except Exception as e:
<ipython-input-16-b64f3fdc1260> in cluster_name(df)
14     elif df['cluster'] == 7:
15         value = 'G'
---> 16     return value
17 
18 df_cluster['cluster_name'] = df_cluster.apply(cluster_name, axis = 1)
UnboundLocalError: ("local variable 'value' referenced before assignment", 'occurred at index 0')
'''

函数中缺少else

def cluster_name(df):
if df['cluster'] == 1:
value = 'A'
elif df['cluster'] == 2:
value = 'B'    
elif df['cluster'] == 3:
value = 'C'
elif df['cluster'] == 4:
value = 'D'
elif df['cluster'] == 5:
value = 'E'
elif df['cluster'] == 6:
value = 'F'
elif df['cluster'] == 7:
value = 'G'
else:
value = ...
return value

否则,如果df['cluster']不在值{1,2,…,7}中,则不会设置value,这会导致异常。

您的问题似乎已经在评论中得到了回答,所以我将提出一种更面向熊猫的方法来解决您的问题。将apply(axis=1)与DataFrame一起使用非常慢,而且几乎没有必要(与迭代数据帧中的行相同(,因此更好的方法是使用矢量化方法。最简单的方法是定义集群->cluster_name映射,并使用map方法:

df = pd.DataFrame(
{"cluster": [1,2,3,4,5,6,7]}
)
# repeat this dataframe 10000 times
df = pd.concat([df] * 10000)

应用方法:

def mapping_func(row):
if row['cluster'] == 1:
value = 'A'
elif row['cluster'] == 2:
value = 'B'    
elif row['cluster'] == 3:
value = 'C'
elif row['cluster'] == 4:
value = 'D'
elif row['cluster'] == 5:
value = 'E'
elif row['cluster'] == 6:
value = 'F'
elif row['cluster'] == 7:
value = 'G'
else:
# This is a "catch-all" in case none of the values in the column are 1-7
value = "Z"

return value
%timeit df.apply(mapping_func, axis=1)
# 1.32 s ± 91.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

.map进近

mapping_dict = {
1: "A",
2: "B",
3: "C",
4: "D",
5: "E",
6: "F",
7: "G"
}
# the `fillna` is our "catch-all" statement.
#  essentially if `map` encounters a value not in the dictionary
#  it will place a NaN there. So I fill those NaNs with "Z" to
#  be consistent with the above example
%timeit df["cluster"].map(mapping_dict).fillna("Z")
# 4.87 ms ± 195 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

我们可以看到,带字典的map方法比apply快得多,同时也避免了长链的if/elif语句。

  • 手动创建if-else函数被高估,可能会错过一个条件
  • 由于您将字母分配为'cluster_name',因此使用string.ascii_uppercase获得所有字母的list,并将zip将它们分配给'cluster'中的唯一值
    • 从压缩后的值和.map创建dict以创建'cluster_name'
  • 此实现使用列中的唯一值来创建映射,因此"local variable 'value' referenced before assignment"不会出现问题。
    • 在发生错误的情况下,这是因为当列中有一个值不符合if-else条件时,return value会执行,这意味着函数中未分配value
import pandas as pd
import string
# test dataframe
df = pd.DataFrame({'cluster': range(1, 11)})
# unique values from the cluster column
clusters = sorted(df.cluster.unique()) 
# create a dict to map
cluster_map = dict(zip(clusters, string.ascii_uppercase))
# create the cluster_name column
df['cluster_name'] = df.cluster.map(cluster_map)
# df
cluster cluster_name
0        1            A
1        2            B
2        3            C
3        4            D
4        5            E
5        6            F
6        7            G
7        8            H
8        9            I
9       10            J

最新更新