我有一个数据帧(df_cluster(,它有两列[Customer Id,cluster]。大约有13个集群,我试图使用python中的apply((为每个集群分配一个名称。我过去也使用过同样的功能,效果很好,但现在我得到了";UnboundLocalError";错误
如果我做错了什么,请告诉我。我对apply((的理解是,它跨轴传递函数(在这种情况下,函数cluster_name将为每行传递(
这是代码
def cluster_name(df):
if df['cluster'] == 1:
value = 'A'
elif df['cluster'] == 2:
value = 'B'
elif df['cluster'] == 3:
value = 'C'
elif df['cluster'] == 4:
value = 'D'
elif df['cluster'] == 5:
value = 'E'
elif df['cluster'] == 6:
value = 'F'
elif df['cluster'] == 7:
value = 'G'
return value
df_cluster['cluster_name'] = df_cluster.apply(cluster_name, axis = 1)
错误
UnboundLocalError Traceback (most recent call last)
<ipython-input-16-b64f3fdc1260> in <module>
16 return value
17
---> 18 df_cluster['cluster_name'] = df_cluster.apply(cluster_name, axis = 1)
19 df_cluster['cluster_name'].value_counts()
/opt/cloudera/parcels/Anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
6926 kwds=kwds,
6927 )
-> 6928 return op.get_result()
6929
6930 def applymap(self, func):
/opt/cloudera/parcels/Anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/apply.py in get_result(self)
184 return self.apply_raw()
185
--> 186 return self.apply_standard()
187
188 def apply_empty_result(self):
/opt/cloudera/parcels/Anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/apply.py in apply_standard(self)
290
291 # compute the result using the series generator
--> 292 self.apply_series_generator()
293
294 # wrap results
/opt/cloudera/parcels/Anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/apply.py in apply_series_generator(self)
319 try:
320 for i, v in enumerate(series_gen):
--> 321 results[i] = self.f(v)
322 keys.append(v.name)
323 except Exception as e:
<ipython-input-16-b64f3fdc1260> in cluster_name(df)
14 elif df['cluster'] == 7:
15 value = 'G'
---> 16 return value
17
18 df_cluster['cluster_name'] = df_cluster.apply(cluster_name, axis = 1)
UnboundLocalError: ("local variable 'value' referenced before assignment", 'occurred at index 0')
'''
函数中缺少else
:
def cluster_name(df):
if df['cluster'] == 1:
value = 'A'
elif df['cluster'] == 2:
value = 'B'
elif df['cluster'] == 3:
value = 'C'
elif df['cluster'] == 4:
value = 'D'
elif df['cluster'] == 5:
value = 'E'
elif df['cluster'] == 6:
value = 'F'
elif df['cluster'] == 7:
value = 'G'
else:
value = ...
return value
否则,如果df['cluster']
不在值{1,2,…,7}中,则不会设置value
,这会导致异常。
您的问题似乎已经在评论中得到了回答,所以我将提出一种更面向熊猫的方法来解决您的问题。将apply(axis=1)
与DataFrame一起使用非常慢,而且几乎没有必要(与迭代数据帧中的行相同(,因此更好的方法是使用矢量化方法。最简单的方法是定义集群->cluster_name映射,并使用map
方法:
df = pd.DataFrame(
{"cluster": [1,2,3,4,5,6,7]}
)
# repeat this dataframe 10000 times
df = pd.concat([df] * 10000)
应用方法:
def mapping_func(row):
if row['cluster'] == 1:
value = 'A'
elif row['cluster'] == 2:
value = 'B'
elif row['cluster'] == 3:
value = 'C'
elif row['cluster'] == 4:
value = 'D'
elif row['cluster'] == 5:
value = 'E'
elif row['cluster'] == 6:
value = 'F'
elif row['cluster'] == 7:
value = 'G'
else:
# This is a "catch-all" in case none of the values in the column are 1-7
value = "Z"
return value
%timeit df.apply(mapping_func, axis=1)
# 1.32 s ± 91.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
.map
进近
mapping_dict = {
1: "A",
2: "B",
3: "C",
4: "D",
5: "E",
6: "F",
7: "G"
}
# the `fillna` is our "catch-all" statement.
# essentially if `map` encounters a value not in the dictionary
# it will place a NaN there. So I fill those NaNs with "Z" to
# be consistent with the above example
%timeit df["cluster"].map(mapping_dict).fillna("Z")
# 4.87 ms ± 195 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
我们可以看到,带字典的map
方法比apply
快得多,同时也避免了长链的if/elif
语句。
- 手动创建
if-else
函数被高估,可能会错过一个条件 - 由于您将字母分配为
'cluster_name'
,因此使用string.ascii_uppercase
获得所有字母的list
,并将zip
将它们分配给'cluster'
中的唯一值- 从压缩后的值和
.map
创建dict
以创建'cluster_name'
列
- 从压缩后的值和
- 此实现使用列中的唯一值来创建映射,因此
"local variable 'value' referenced before assignment"
不会出现问题。- 在发生错误的情况下,这是因为当列中有一个值不符合
if-else
条件时,return value
会执行,这意味着函数中未分配value
- 在发生错误的情况下,这是因为当列中有一个值不符合
import pandas as pd
import string
# test dataframe
df = pd.DataFrame({'cluster': range(1, 11)})
# unique values from the cluster column
clusters = sorted(df.cluster.unique())
# create a dict to map
cluster_map = dict(zip(clusters, string.ascii_uppercase))
# create the cluster_name column
df['cluster_name'] = df.cluster.map(cluster_map)
# df
cluster cluster_name
0 1 A
1 2 B
2 3 C
3 4 D
4 5 E
5 6 F
6 7 G
7 8 H
8 9 I
9 10 J