Tensorflow issue with softmax

我有一个Tensorflow多类分类器，它生成nan或inf，同时使用tf.nn.softmax计算概率。请参阅以下片段(logits的形状为batch_size x 6，因为我有6个类，输出是一个热编码的)。CCD_ 6为1024。

logits = tf.debugging.check_numerics(logits, message='bad logits', name=None)
probabilities = tf.nn.softmax(logits=logits, name='Softmax')
probabilities = tf.debugging.check_numerics(probabilities, message='bad probabilities', name=None)

分类器在最后一条语句中失败，因为它在probabilities中找到了nan或inf。logits是干净的，否则第一个语句将失败。

从我读到的关于tf.nn.softmax的内容来看，它可以处理logits中非常大和非常小的值。我已经在交互模式中验证了这一点。

>>> with tf.Session() as s:
...   a = tf.constant([[1000, 10], [-100, -200], [3, 4.0]])
...   sm = tf.nn.softmax(logits=a, name='Softmax')
...   print(a.eval())
...   print(sm.eval())
...
[[1000.   10.]
[-100. -200.]
[   3.    4.]]
[[1.         0.        ]
[1.         0.        ]
[0.26894143 0.7310586 ]]

然后，我尝试剪裁logits中的值，现在整个过程都正常了。请参阅下面修改后的代码段。

logits = tf.debugging.check_numerics(logits, message='logits', name=None)
safe_logits = tf.clip_by_value(logits, -15.0, 15.0)
probabilities = tf.nn.softmax(logits=safe_logits, name='Softmax')
probabilities = tf.debugging.check_numerics(probabilities, message='bad probabilities', name=None)

在第二个语句中，我将logits中的值剪裁为-15和15，这在某种程度上阻止了softmax计算中的nan/inf。所以，我能够解决手头的问题。

然而，我仍然不明白为什么这个剪辑有效？(我应该提到的是，在-20和20之间的剪裁不起作用，并且模型在probabilities中使用nan或inf时失败)。

有人能帮我理解为什么会这样吗？

我使用的是tensorflow 1.15.0，运行在一个64位实例上。

首先要查看的是值本身，您已经这样做了。第二个要看的地方是梯度。即使该值看起来合理，如果梯度非常陡峭，反向投影最终也会分解梯度和值。

例如，如果logits是由log(x)之类的东西生成的，那么0.001的x将生成-6.9。看起来很温和。但是梯度是1000！这将在后向/前向道具期间快速分解梯度和值。

# Pretend this is the source value that is fed to a function that generates the logit. 
>>> x = tf.Variable(0.001)
# Let's operate on the source value to generate the logit. 
>>> with tf.GradientTape() as tape:
...   y = tf.math.log(x)
... 
# The logit looks okay... -6.9. 
>>> y
<tf.Tensor: shape=(), dtype=float32, numpy=-6.9077554>
# But the gradient is exploding. 
>>> tape.gradient(y,x)
<tf.Tensor: shape=(), dtype=float32, numpy=999.99994>
>>>

剪裁logit似乎侧重于生成较小的值以提供给softmax，但这可能不是它有帮助的原因。(事实上，softmax可以处理值为tf.float32.max的logit，这没有问题，所以logit的值不太可能是问题所在)。真正可能发生的情况是，当你剪辑到15时，你也将梯度设置为零，而logit本来是20，具有爆炸性的梯度。因此，剪裁该值也会引入剪裁的渐变。

# This is same source variable as above. 
>>> x = tf.Variable(0.001)
# Now let's operate with clipping. 
>>> with tf.GradientTape() as tape:
...   y = tf.clip_by_value(tf.math.log(x), -1., 1.)
... 
# The clipped logit still looks okay... 
>>> y
<tf.Tensor: shape=(), dtype=float32, numpy=-1.0>
# What may be more important is that the clipping has also zeroed out the gradient
>>> tape.gradient(y,x)
<tf.Tensor: shape=(), dtype=float32, numpy=0.0>

相关内容

最新更新

热门标签：