关于apex的一个bug

具体报错内容如下:

1 2	`if cached_x.grad_fn.next_functions[1][0].variable is not x: IndexError: tuple index out of range`

先说具体的官方的issue和解决方法
https://github.com/NVIDIA/apex/issues/1227
https://github.com/NVIDIA/apex/issues/694#issuecomment-918833904

zwithz用户已经指出应该修改这个文件，但是后面的用户classicsong说这个方法是有问题的

apex/amp/utils.py
# change this line (line 113)
- if cached_x.grad_fn.next_functions[1][0].variable is not x:
# 改成如下:
+ if cached_x.grad_fn.next_functions[0][0].variable is not x:

其实zwithz说的是对的，这是apex的一个bug
不同版本的torch的grad_fn数量

import torch
print(torch.__version__)
x = torch.ones(2, 2, requires_grad=True)
y = x + 2
z = y * y * 3
out = z.mean()
half = out.half()
print(len(half.grad_fn.next_functions))

1.7 版本

1.7.0+cu110
2

1.8版本

1.8.0.post3
2

1.9版本

1.9.0+cu111
2

1.10版本

1.10.2
1

关于grad_fn

grad_fn保存了链式计算的图
创建一个张量并设置requires_grad=True, 那么这个tensor经过计算后的值都会写携带grad_fn属性，，用来追踪计算历史
Autograd是反向自动微分系统。从概念上讲，autograd记录了一个图，记录了你执行运算时产生数据的所有运算，给你一个有向无
环图，其叶子是输入张量，根是输出张量。通过追踪这个图从根到叶，你可以使用链式规则自动计算梯度。
在内部，autograd将这个图表示为Function对象的图（真正的表达式），autograd建立一个代表计算梯度的函数的图（每个torch.Tensor的.grad_fn属性是这个图的入口）。当前向传播完成后，我们在反向传播中评估这个图以计算梯度。

官方示例

如果你有一个模型，你按照损失的方向，使用它的.grad_fn属性，你会看到一个计算的图，看起来像这样。

input -> conv2d -> relu -> maxpool2d -> conv2d -> relu -> maxpool2d
      -> flatten -> linear -> relu -> linear -> relu -> linear
      -> MSELoss
      -> loss
因此，当我们调用loss.backward()时，整个图被微分，并且图中所有require_grad=True的张量都会有其.grad张量的梯度积累。
print(loss.grad_fn)  # MSELoss
print(loss.grad_fn.next_functions[0][0])  # Linear
print(loss.grad_fn.next_functions[0][0].next_functions[0][0])  # ReLU
------>
<MseLossBackward0 object at 0x7f09e8c788d0>
<AddmmBackward0 object at 0x7f09e8c78978>
<AccumulateGrad object at 0x7f09e8c78978>

torch

#apex #fp16

apex兼容torch1.10时的一个bug

https://johnson7788.github.io/2022/03/08/apex%E7%9A%84%E4%B8%80%E4%B8%AAbug/

作者

Johnson

发布于

2022年3月8日

许可协议

知乎博客检索上一篇

信息提取的方式总结下一篇