deepspeed-ninja报错解决

deepspeed训练模型时ninja报错解决

1、报错如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
[1/3] /usr/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/TH -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/THC -isystem /home/wac/johnson/anaconda3/envs/gpt/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -std=c++17 -c /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
FAILED: multi_tensor_adam.cuda.o
/usr/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/TH -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/THC -isystem /home/wac/johnson/anaconda3/envs/gpt/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -std=c++17 -c /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
nvcc fatal : Value 'c++17' is not defined for option 'std'
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/TH -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/THC -isystem /home/wac/johnson/anaconda3/envs/gpt/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
subprocess.run(
File "/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
_write_ninja_file_and_build_library(
File "/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'
python-BaseException
[2023-08-23 16:36:02,371] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2598308

或者报错如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
[1/3] /usr/bin/nvcc  -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/TH -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/THC -isystem /home/wac/johnson/anaconda3/envs/gpt/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -DBF16_AVAILABLE -std=c++17 -c /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o 
FAILED: multi_tensor_adam.cuda.o
/usr/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/TH -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/THC -isystem /home/wac/johnson/anaconda3/envs/gpt/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -DBF16_AVAILABLE -std=c++17 -c /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
In file included from /usr/include/cuda_runtime.h:83,
from <command-line>:
/usr/include/crt/host_config.h:138:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported!
138 | #error -- unsupported GNU version! gcc versions later than 8 are not supported!
| ^~~~~
In file included from /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/c10/core/ScalarType.h:3,
from /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/c10/core/StorageImpl.h:4,
from /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/c10/core/Storage.h:3,
from /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/c10/core/TensorImpl.h:8,
from /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/c10/core/GeneratorImpl.h:8,
from /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/ATen/core/Generator.h:22,
from /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/ATen/CPUGeneratorImpl.h:3,
from /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/ATen/Context.h:3,
from /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/ATen/ATen.h:7,
from /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu:11:
/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/c10/util/BFloat16.h:11:10: fatal error: cuda_bf16.h: No such file or directory
11 | #include <cuda_bf16.h>
| ^~~~~~~~~~~~~
compilation terminated.
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/TH -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/THC -isystem /home/wac/johnson/anaconda3/envs/gpt/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DBF16_AVAILABLE -c /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
subprocess.run(
File "/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
_write_ninja_file_and_build_library(
File "/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'
python-BaseException
[2023-08-23 16:56:02,423] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2621324

其实都是nvcc编译某个包时出现问题,例如第1个报错是主要是因为cuda编译时指定了非法的c++标准版本导致的。查看错误日志可以看到,在编译cuda文件时指定了-std=c++17,但是nvcc不支持c++17标准。第二个报错是gcc版本问题,

我们可以查询CUDA Compilers (github.com),查看该cuda是支持gcc8版本的,真实原因其实是nvcc版本问题。

2、分析

检查cuda

1
2
3
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |

检查编译使用的nvcc,发现nvcc没有更新

1
2
3
4
5
/usr/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

真实的nvcc版本如下

--version
1
2
3
4
5
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

3、解决

尝试使用正确的nvcc编译,发现是可以变成成功的,输出文件multi_tensor_adam.cuda.o

1
/usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/TH -isystem /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/include/THC -isystem /home/wac/johnson/anaconda3/envs/gpt/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -DBF16_AVAILABLE -std=c++17 -c /home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o

那么删掉sudo rm /usr/bin/nvcc即可,注意不要建立软连接,直接删掉,删掉后检查nvcc是否是我们想要的版本,再次运行deepspeed训练,发现已经可以了。

1
2
which nvcc
/usr/local/cuda/bin/nvcc

deepspeed-ninja报错解决
https://johnson7788.github.io/2023/08/23/deepspeed-ninja%E6%8A%A5%E9%94%99%E8%A7%A3%E5%86%B3/
作者
Johnson
发布于
2023年8月23日
许可协议