我认为这不是一个新问题,但我认为显示此消息很奇怪 - 局部变量df_ret赋值前引用 - 这是我重新计算不平衡数据集的函数:
def down_sample(df, target, positive_label, negative_label):
positives = df.filter(df[target] == positive_label)
negatives = df.filter(df[target] == negative_label)
num_positives = positives.count()
num_negatives = negatives.count()
if (num_positives > num_negatives): # down_sample positives
sampled_df = positives.sample(withReplacement=False,
fraction=num_negatives/num_positives,
seed=SEED)
df_ret = sampled_df.union(negatives)
return df_ret
错误消息"局部变量df_ret赋值前引用"在这里非常准确 - 函数运行并且num_positives > num_negatives
的if
条件不为真,因此if
块中的代码从未运行过,因此从未分配df_ret
变量(从未声明和初始化(。
您可以使用几种模式来解决此问题,具体取决于此函数的客户端的期望:
如果不满足
if
条件,则在函数中引发异常,则让调用方catch
异常在
if
块之前初始化df_ret
变量,以便函数在不满足if
条件时返回默认值
来自 gladiesgoodluck 的好答案,我还要添加一个快速修复,即进一步缩进return
命令,使其仅在满足if
条件时执行。 您的代码将变为:
def down_sample(df, target, positive_label, negative_label):
positives = df.filter(df[target] == positive_label)
negatives = df.filter(df[target] == negative_label)
num_positives = positives.count()
num_negatives = negatives.count()
if (num_positives > num_negatives): # down_sample positives
sampled_df = positives.sample(withReplacement=False,
fraction=num_negatives/num_positives,
seed=SEED)
df_ret = sampled_df.union(negatives)
return df_ret
return something_else # OPTIONAL