搭建1个机器学习服务器

目的

为了搭建一套深度学习系统,我们从零开始组装深度学习服务器。

报价对比

我们考虑线上云机器和线下服务器算力的价格对比。

供应商 时长 价格 说明 其它 数据来源
AWS 3年 32W+(不含显卡价格) 云厂商不卖消费级显卡,所以价格昂贵
阿里云 3年 62w(含显卡) 云厂商不卖消费级显卡,所以价格昂贵
自行搭建 3年 3w + 3.4w = 6.4w 4090显卡 *2 + 服务器 额外电费和网费
机房供应商购买 3年 26w 4090显卡 *2 + 服务器 额外电费和网费
机房供应商租赁 3年 12000/月/台 *36 4090显卡 *2 + 服务器 额外电费和网费

核算下来,自行搭建的性价比最高。

服务器选购

普通服务器价格

多卡服务器价格

DELL R750XA 4310 2 32G4 960GSSD216T3 H755 双口千 不含显卡,1400w2,51500元含税, 显卡单价21000
其它供应商: 能支持8和gpu的机器 我们俗称8卡机。不是几万就能买到的。2路r750机架式服务器最多支持2个卡,塔式服务器T640 8
3.5盘位 42162/128G/960G SSD+16T企业盘/RTX4090-24G2/GPU套件/H730P-2G/2000W*2/二年 价格 79500元含税

机架式服务器R750XA 82.5盘位43162/128G/960G SSD+2.4T企业盘7/RTX4090-24G2/GPU套件/H745-4G2400W*2/导轨/三年 价格97300元含税

浪潮 NF8480M6 5318H(18C,150W,2.5GHz)4 32G4960GSSD2 16T3 PM8222-8|双口干 不含显卡 1300w*2, 71500元含税, 4090显卡 22000含税

DELL R750XA 5318Y2 32G4 960GSSD216T3 H755 双口千 不含显卡1400w*2,63500元含税

综合对比下来,支持多卡的新服务器价格都很贵,考虑二手服务器

二手服务器

选用4028GR, 支持 2个Intel 26xx V3或V4 系列2011-3针CPU, 机器一共24个内存插槽,支持DDR4代 4G 8G 16G 32G64G 128G内存容量,支持24条128GDDR4 V3 V4平台的内存,频率只能支持到2133 2400。
4个超微1600W功率 共:3200W功率(支持2+2冗余),支持热插拔 热替换主板集成 Matrox G200eR 显卡 256M显存 VGA输出接口
可以扩展大显卡,支持8/10张 GPU显

序号 货物名称 型号 参数及配置 数量 单价(元) 合计(元)
1 服务器 4028GR E5-2696V42 32G4(现代2133mhz) 480G SSD1 1T SAS16 双电1600W 1 15820 15820

所以准备选用二手服务器

显卡选用公版涡轮4090D, 三块显卡,13700*3, 一共41000含税,不含税一个的价格12094。

总计扣除税费减免,约47000元。

服务器测试和安装

查看系统信息

16块1T的硬盘,做成Raid5, 1块做热备份。做成ext4格式,挂载。

1
2
3
4
5
6
7
8
9
10
11
df -h
文件系统 大小 已用 可用 已用% 挂载点
tmpfs 13G 2.9M 13G 1% /run
/dev/sda2 439G 28G 389G 7% /
tmpfs 63G 0 63G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
efivarfs 256K 169K 83K 67% /sys/firmware/efi/efivars
/dev/sdb1 13T 28K 12T 1% /media/wac/backup
/dev/sda1 511M 6.1M 505M 2% /boot/efi
tmpfs 13G 80K 13G 1% /run/user/128
tmpfs 13G 68K 13G 1% /run/user/1002

测试CPU,2699v4CPU ,内存,硬盘

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
lscpu
架构: x86_64
CPU 运行模式: 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
字节序: Little Endian
CPU: 88
在线 CPU 列表: 0-87
厂商 ID: GenuineIntel
型号名称: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
CPU 系列: 6
型号: 79
每个核的线程数: 2
每个座的核数: 22
座: 2

海力士内存,Hynix Semiconductor
sudo dmidecode --type 17
# dmidecode 3.3
Getting SMBIOS data from sysfs.
SMBIOS 3.0.0 present.

Handle 0x003E, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x003C
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 72 bits
Size: 32 GB
Form Factor: DIMM
Set: None
Locator: P1_DIMMA1


测试内存
sudo apt-get install sysbench
sysbench memory run


sysbench memory run
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Running memory speed test with the following options:
block size: 1KiB
total size: 102400MiB
operation: write
scope: global

安装显卡

注意显卡的供电情况。

显卡供电计算,机箱预留了8口Pin供电共16个,1张显卡需要3个,共耗掉9个,机箱空间也不足以插8卡,可能4-6卡就没位置了。电源也就顶峰了。

NVIDIA官方推荐RTX 4090显卡搭配850W电源,而GeForce RTX 4090的TGP高达600W,推荐电源功率不低于850W12。

Cuda驱动安装

  1. 系统信息确认

    cat /etc/release
    DISTRIB_ID=Ubuntu
    DISTRIB_RELEASE=22.04
    DISTRIB_CODENAME=jammy
    DISTRIB_DESCRIPTION=”Ubuntu 22.04.2 LTS”
    PRETTY_NAME=”Ubuntu 22.04.2 LTS”
    NAME=”Ubuntu”
    VERSION_ID=”22.04”
    VERSION=”22.04.2 LTS (Jammy Jellyfish)”
    VERSION_CODENAME=jammy

  2. 驱动安装前的依赖包安装

Ubuntu 上安装常用的编译工具和依赖
sudo apt update
sudo apt install build-essential
sudo apt install libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev
sudo apt install libncurses5-dev libncursesw5-dev xz-utils tk-dev libffi-dev
gcc –version
make –version

  1. sudo sh cuda_12.1.0_530.30.02_linux.run

    cat /var/log/nvidia-installer.log

    驱动安装时发生报错
    warning: the compiler differs from the one used to build the kernel The kernel was built by: x86_64-linux-gnu-gcc-12 (Ubuntu 12.3.0-1ubuntu122.04) 12.3.0 You are using: cc (Ubuntu 11.4.0-1ubuntu122.04) 11.4.0 Warning: Compiler version check failed: The major and minor number of the compiler used to compile the kernel: x86_64-linux-gnu-gcc-12 (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38 does not match the compiler used here:

由于安装 CUDA 的编译器版本与编译内核的 GCC 版本不匹配导致的

  1. 升级gcc

    sudo apt update
    sudo apt install gcc-12 g++-12

    切换默认的 GCC 版本

    sudo update-alternatives –install /usr/bin/gcc gcc /usr/bin/gcc-11 10
    sudo update-alternatives –install /usr/bin/gcc gcc /usr/bin/gcc-12 20
    sudo update-alternatives –config gcc

    gcc –version

    gcc –version
    gcc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
    Copyright (C) 2022 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions. There is NO
    warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE

再次安装发生报错,发现是系统内核不匹配

make[3]: *** [scripts/Makefile.build:243: /tmp/selfgz11096/NVIDIA-Linux-x86_64-530.30.02/kernel/nvidia/i2c_nvswitch.o] Error 1 make[3]: Target ‘/tmp/selfgz11096/NVIDIA-Linux-x86_64-530.30.02/kernel/‘ not remade because of errors. make[2]: *** [/usr/src/linux-headers-6.8.0-40-generic/Makefile:1926: /tmp/selfgz11096/NVIDIA-Linux-x86_64-530.30.02/kernel] Error 2 make[2]: Target ‘modules’ not remade because of errors. make[1]: *** [Makefile:240: __sub-make] Error 2 make[1]: Target ‘modules’ not remade because of errors. make[1]: Leaving directory ‘/usr/src/linux-headers-6.8.0-40-generic’ make: *** [Makefile:82: modules] Error 2 ERROR: The nvidia kernel module was not created.

查看官方文档,找出对应的内核和其它依赖,重新安装内核

Ubuntu 22.04.z (z <= 4) LTS: 6.5.0-27

查看可用的内核版本

apt list –all-versions linux-image-generic

安装特定版本的内核

sudo apt update
sudo apt install linux-image-6.5.0-27-generic linux-headers-6.5.0-27-generic

查看/boot/grub/grub.cfg,数一下submenu和menuentry,确定内核6.5.0-27的位置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
cat /boot/grub/grub.cfg
#
# DO NOT EDIT THIS FILE
#
# It is automatically generated by grub-mkconfig using templates
# from /etc/grub.d and settings from /etc/default/grub
#

### BEGIN /etc/grub.d/00_header ###
if [ -s $prefix/grubenv ]; then
set have_grubenv=true
load_env
fi
if [ "${initrdfail}" = 2 ]; then
set initrdfail=
elif [ "${initrdfail}" = 1 ]; then
set next_entry="${prev_entry}"
set prev_entry=
save_env prev_entry
if [ "${next_entry}" ]; then
set initrdfail=2
fi
fi
if [ "${next_entry}" ] ; then
set default="${next_entry}"
set next_entry=
save_env next_entry
set boot_once=true
else
set default="1>2"
fi

if [ x"${feature_menuentry_id}" = xy ]; then
menuentry_id_option="--id"
else
menuentry_id_option=""
fi

export menuentry_id_option

if [ "${prev_saved_entry}" ]; then
set saved_entry="${prev_saved_entry}"
save_env saved_entry
set prev_saved_entry=
save_env prev_saved_entry
set boot_once=true
fi

function savedefault {
if [ -z "${boot_once}" ]; then
saved_entry="${chosen}"
save_env saved_entry
fi
}
function initrdfail {
if [ -n "${have_grubenv}" ]; then if [ -n "${partuuid}" ]; then
if [ -z "${initrdfail}" ]; then
set initrdfail=1
if [ -n "${boot_once}" ]; then
set prev_entry="${default}"
save_env prev_entry
fi
fi
save_env initrdfail
fi; fi
}
function recordfail {
set recordfail=1
if [ -n "${have_grubenv}" ]; then if [ -z "${boot_once}" ]; then save_env recordfail; fi; fi
}
function load_video {
if [ x$feature_all_video_module = xy ]; then
insmod all_video
else
insmod efi_gop
insmod efi_uga
insmod ieee1275_fb
insmod vbe
insmod vga
insmod video_bochs
insmod video_cirrus
fi
}

if [ x$feature_default_font_path = xy ] ; then
font=unicode
else
insmod part_gpt
insmod ext2
set root='hd0,gpt2'
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root --hint-bios=hd0,gpt2 --hint-efi=hd0,gpt2 --hint-baremetal=ahci0,gpt2 670c96ba-e335-4331-914c-c6e7306b2e5f
else
search --no-floppy --fs-uuid --set=root 670c96ba-e335-4331-914c-c6e7306b2e5f
fi
font="/usr/share/grub/unicode.pf2"
fi

if loadfont $font ; then
set gfxmode=auto
load_video
insmod gfxterm
set locale_dir=$prefix/locale
set lang=zh_CN
insmod gettext
fi
terminal_output gfxterm
if [ "${recordfail}" = 1 ] ; then
set timeout=30
else
if [ x$feature_timeout_style = xy ] ; then
set timeout_style=hidden
set timeout=0
# Fallback hidden-timeout code in case the timeout_style feature is
# unavailable.
elif sleep --interruptible 0 ; then
set timeout=0
fi
fi
### END /etc/grub.d/00_header ###

### BEGIN /etc/grub.d/05_debian_theme ###
set menu_color_normal=white/black
set menu_color_highlight=black/light-gray
### END /etc/grub.d/05_debian_theme ###

### BEGIN /etc/grub.d/10_linux ###
function gfxmode {
set gfxpayload="${1}"
if [ "${1}" = "keep" ]; then
set vt_handoff=vt.handoff=7
else
set vt_handoff=
fi
}
if [ "${recordfail}" != 1 ]; then
if [ -e ${prefix}/gfxblacklist.txt ]; then
if [ ${grub_platform} != pc ]; then
set linux_gfx_mode=keep
elif hwmatch ${prefix}/gfxblacklist.txt 3; then
if [ ${match} = 0 ]; then
set linux_gfx_mode=keep
else
set linux_gfx_mode=text
fi
else
set linux_gfx_mode=text
fi
else
set linux_gfx_mode=keep
fi
else
set linux_gfx_mode=text
fi
export linux_gfx_mode
menuentry 'Ubuntu' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-670c96ba-e335-4331-914c-c6e7306b2e5f' {
recordfail
load_video
gfxmode $linux_gfx_mode
insmod gzio
if [ x$grub_platform = xxen ]; then insmod xzio; insmod lzopio; fi
insmod part_gpt
insmod ext2
set root='hd0,gpt2'
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root --hint-bios=hd0,gpt2 --hint-efi=hd0,gpt2 --hint-baremetal=ahci0,gpt2 670c96ba-e335-4331-914c-c6e7306b2e5f
else
search --no-floppy --fs-uuid --set=root 670c96ba-e335-4331-914c-c6e7306b2e5f
fi
linux /boot/vmlinuz-6.8.0-40-generic root=UUID=670c96ba-e335-4331-914c-c6e7306b2e5f ro quiet splash $vt_handoff
initrd /boot/initrd.img-6.8.0-40-generic
}
submenu 'Advanced options for Ubuntu' $menuentry_id_option 'gnulinux-advanced-670c96ba-e335-4331-914c-c6e7306b2e5f' {
menuentry 'Ubuntu, with Linux 6.8.0-40-generic' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.8.0-40-generic-advanced-670c96ba-e335-4331-914c-c6e7306b2e5f' {
recordfail
load_video
gfxmode $linux_gfx_mode
insmod gzio
if [ x$grub_platform = xxen ]; then insmod xzio; insmod lzopio; fi
insmod part_gpt
insmod ext2
set root='hd0,gpt2'
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root --hint-bios=hd0,gpt2 --hint-efi=hd0,gpt2 --hint-baremetal=ahci0,gpt2 670c96ba-e335-4331-914c-c6e7306b2e5f
else
search --no-floppy --fs-uuid --set=root 670c96ba-e335-4331-914c-c6e7306b2e5f
fi
echo 'Loading Linux 6.8.0-40-generic ...'
linux /boot/vmlinuz-6.8.0-40-generic root=UUID=670c96ba-e335-4331-914c-c6e7306b2e5f ro quiet splash $vt_handoff
echo 'Loading initial ramdisk ...'
initrd /boot/initrd.img-6.8.0-40-generic
}
menuentry 'Ubuntu, with Linux 6.8.0-40-generic (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.8.0-40-generic-recovery-670c96ba-e335-4331-914c-c6e7306b2e5f' {
recordfail
load_video
insmod gzio
if [ x$grub_platform = xxen ]; then insmod xzio; insmod lzopio; fi
insmod part_gpt
insmod ext2
set root='hd0,gpt2'
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root --hint-bios=hd0,gpt2 --hint-efi=hd0,gpt2 --hint-baremetal=ahci0,gpt2 670c96ba-e335-4331-914c-c6e7306b2e5f
else
search --no-floppy --fs-uuid --set=root 670c96ba-e335-4331-914c-c6e7306b2e5f
fi
echo 'Loading Linux 6.8.0-40-generic ...'
linux /boot/vmlinuz-6.8.0-40-generic root=UUID=670c96ba-e335-4331-914c-c6e7306b2e5f ro recovery nomodeset dis_ucode_ldr
echo 'Loading initial ramdisk ...'
initrd /boot/initrd.img-6.8.0-40-generic
}
menuentry 'Ubuntu, with Linux 6.5.0-27-generic' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.5.0-27-generic-advanced-670c96ba-e335-4331-914c-c6e7306b2e5f' {
recordfail
load_video
gfxmode $linux_gfx_mode
insmod gzio
if [ x$grub_platform = xxen ]; then insmod xzio; insmod lzopio; fi
insmod part_gpt
insmod ext2
set root='hd0,gpt2'
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root --hint-bios=hd0,gpt2 --hint-efi=hd0,gpt2 --hint-baremetal=ahci0,gpt2 670c96ba-e335-4331-914c-c6e7306b2e5f
else
search --no-floppy --fs-uuid --set=root 670c96ba-e335-4331-914c-c6e7306b2e5f
fi
echo 'Loading Linux 6.5.0-27-generic ...'
linux /boot/vmlinuz-6.5.0-27-generic root=UUID=670c96ba-e335-4331-914c-c6e7306b2e5f ro quiet splash $vt_handoff
echo 'Loading initial ramdisk ...'
initrd /boot/initrd.img-6.5.0-27-generic
}
menuentry 'Ubuntu, with Linux 6.5.0-27-generic (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.5.0-27-generic-recovery-670c96ba-e335-4331-914c-c6e7306b2e5f' {
recordfail
load_video
insmod gzio
if [ x$grub_platform = xxen ]; then insmod xzio; insmod lzopio; fi
insmod part_gpt
insmod ext2
set root='hd0,gpt2'
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root --hint-bios=hd0,gpt2 --hint-efi=hd0,gpt2 --hint-baremetal=ahci0,gpt2 670c96ba-e335-4331-914c-c6e7306b2e5f
else
search --no-floppy --fs-uuid --set=root 670c96ba-e335-4331-914c-c6e7306b2e5f
fi
echo 'Loading Linux 6.5.0-27-generic ...'
linux /boot/vmlinuz-6.5.0-27-generic root=UUID=670c96ba-e335-4331-914c-c6e7306b2e5f ro recovery nomodeset dis_ucode_ldr
echo 'Loading initial ramdisk ...'
initrd /boot/initrd.img-6.5.0-27-generic
}
}

### END /etc/grub.d/10_linux ###

### BEGIN /etc/grub.d/10_linux_zfs ###
### END /etc/grub.d/10_linux_zfs ###

### BEGIN /etc/grub.d/20_linux_xen ###

### END /etc/grub.d/20_linux_xen ###

### BEGIN /etc/grub.d/20_memtest86+ ###
### END /etc/grub.d/20_memtest86+ ###

### BEGIN /etc/grub.d/30_os-prober ###
### END /etc/grub.d/30_os-prober ###

### BEGIN /etc/grub.d/30_uefi-firmware ###
menuentry 'UEFI Firmware Settings' $menuentry_id_option 'uefi-firmware' {
fwsetup
}
### END /etc/grub.d/30_uefi-firmware ###

### BEGIN /etc/grub.d/35_fwupd ###
### END /etc/grub.d/35_fwupd ###

### BEGIN /etc/grub.d/40_custom ###
# This file provides an easy way to add custom menu entries. Simply type the
# menu entries you want to add after this comment. Be careful not to change
# the 'exec tail' line above.
### END /etc/grub.d/40_custom ###

### BEGIN /etc/grub.d/41_custom ###
if [ -f ${config_directory}/custom.cfg ]; then
source ${config_directory}/custom.cfg
elif [ -z "${config_directory}" -a -f $prefix/custom.cfg ]; then
source $prefix/custom.cfg
fi
### END /etc/grub.d/41_custom ###

修改sudo vim /etc/default/grub,我的内核6.5.0-27是在第2个submenu的第3个menuentry位置上,所以改成如下

GRUB_DEFAULT=”1>2”

修改完成后,更新grub

sudo update-grub

重启后查看内核

uname -r
6.5.0-27-generic

从新安装驱动,继续保持,发现是兼容性问题,

ubuntu上安装cuda时报错 /tmp/selfgz2486/NVIDIA-Linux-x86_64-530.30.02/kernel/common/inc/nv-mm.h:88:60: warning: passing argument 4 of ‘get_user_pages’ makes pointer from integer without a cast [-Wint-conversion] 88 | return get_user_pages(current, current->mm, start, nr_pages, write, | ^~~~~~~~ | | | long unsigned int ./include/linux/mm.h:2431:59: note: expected ‘struct page **’ but argument is of type ‘long unsigned int’ 2431 | unsigned int gup_flags, struct page **pages); | ~~~~~~~~~~~~~~^~~~~ /tmp/selfgz2486/NVIDIA-Linux-x86_64-530.30.02/kernel/common/inc/nv-mm.h:88:16: error: too many arguments to function ‘get_user_pages’ 88 | return get_user_pages(current, current->mm, start, nr_pages, write,

下载最新的驱动程序

wget https://developer.download.nvidia.com/compute/cuda/12.6.0/local_installers/cuda_12.6.0_560.28.03_linux.run

sudo sh cuda_12.6.0_560.28.03_linux.run
安装成功

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Will install libglvnd libraries.
Will install libEGL vendor library config file to /usr/share/glvnd/egl_vendor.d
-> Searching for conflicting files:
-> done.
-> Installing 'NVIDIA Accelerated Graphics Driver for Linux-x86_64' (560.28.03):
-> No NVIDIA modules detected in the initramfs.
-> The initramfs will not be rebuild.
executing: '/usr/sbin/ldconfig'...
executing: '/usr/sbin/depmod -a '...
executing: '/usr/bin/systemctl daemon-reload'...
-> done.
-> Driver file installation is complete.
-> Running post-install sanity check:
-> done.
-> Post-install sanity check passed.
-> Would you like to run the nvidia-xconfig utility to automatically update your X configuration file so that the NVIDIA X driver will be used when you restart X? Any pre-existing X configuration file will be backed up. (Answer: No)
-> Installation of the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version: 560.28.03) is now complete. Please update your xorg.conf file as appropriate; see the file /usr/share/doc/NVIDIA_GLX-1.0/README.txt for details.

查看驱动, nvidia-smi

/usr/local/cuda/bin/nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Fri_Jun_14_16:34:21_PDT_2024
Cuda compilation tools, release 12.6, V12.6.20
Build cuda_12.6.r12.6/compiler.34431801_0

安装CUDNN,深度学习加速库

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb

sudo dpkg -i cuda-keyring_1.1-1_all.deb

sudo apt-get update

sudo apt-get -y install cudnn9-cuda-12

多显卡查看

显卡测试

1
2
3
4
python -c 'import torch; print(torch.cuda.is_available()); print(torch.cuda.device_count())'
True
3

训练测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
cat train.py
import os
import torch
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset

# 检查是否可用GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# 加载模型和分词器
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2).to(device)

# 加载数据集
dataset = load_dataset("imdb", split="train[:1%]") # 仅加载1%的数据用于快速测试
tokenized_dataset = dataset.map(lambda x: tokenizer(x['text'], padding="max_length", truncation=True, max_length=128), batched=True)

# 设置训练参数
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="no",
per_device_train_batch_size=16,
num_train_epochs=1, # 为了测试速度设置为1个epoch
logging_dir="./logs",
logging_steps=10,
no_cuda=False # 默认会使用CUDA,如果存在的话
)

# 创建Trainer实例
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
)

# 训练模型
trainer.train()

# 打印训练速度
print(f"Training completed on {device}")

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
python train.py
Using device: cuda
/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/transformers/training_args.py:1525: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
warnings.warn(
0%| | 0/6 [00:00<?, ?it/s]/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py:79: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
/home/wac/johnson/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
{'train_runtime': 6.5146, 'train_samples_per_second': 38.375, 'train_steps_per_second': 0.921, 'train_loss': 0.3995473384857178, 'epoch': 1.0}
100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:06<00:00, 1.09s/it]
Training completed on cuda

参考链接

CUDA Installation Guide for Linux (nvidia.com)

CUDA Toolkit 12.6 Downloads | NVIDIA Developer

在 Linux 上安装 cuDNN — NVIDIA cuDNN v9.3.0 文档 — Installing cuDNN on Linux — NVIDIA cuDNN v9.3.0 documentation


搭建1个机器学习服务器
https://johnson7788.github.io/2024/08/26/%E6%90%AD%E5%BB%BA1%E4%B8%AA%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E6%9C%8D%E5%8A%A1%E5%99%A8/
作者
Johnson
发布于
2024年8月26日
许可协议