작업용 데스크탑에 gpu 를 업그레이드 하면서 남는 gpu 카드가 생겼는데 이걸 서버에 붙여 써보려고 한다. 1050Ti
인데, 추가 보조 전원도 필요없는 SFF 카드여서 서버용으로 충분(?)하지 않을까 싶다.
고사양도 아니고 Pascal 아키텍처로 연식이 오래된 것이긴 하지만 회사 업무 환경에서는 인터넷을 활용한 이런저런 테스트가 쉽지 않다보니,, 원격으로 가볍게 테스트 해볼 환경으로 쓸만할 것 같다. 도커를 활용하는게 더 나을 것 같아 호스트에 네이티브 cuda toolkit 을 지저분하게 설치하지 않고 nvidia container runtime (nvidia-docker
) 까지만 구성해보려고 한다.
GPU 서버 셋업은 진짜 오랜만에 해보는 것 같은데 문서도 잘 나오고, 카드별로 딱 맞춰진 환경이 아니더라도 패키지 매니저로 대부분 대응이 된다.
nvidia 그래픽 드라이버 설치 (Using dnf)
드라이버 설치도 인터넷만 된다면 dnf
로 뚝딱 설치가 가능하다. Extra packages 레포를 먼저 등록하고
# sudo dnf install epel-release
Installed:
epel-release-8-18.el8.noarch
Complete!
# sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
Adding repo from: https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
커널 개발 패키지를 설치해주려 했다가 보니 곧장 되지는 않아서 일단 다음 단계로 넘어갔는데
# uname -r
4.18.0-372.26.1.el8_6.x86_64
# sudo dnf install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
cuda-rhel8-x86_64 3.8 MB/s | 2.4 MB 00:00
Extra Packages for Enterprise Linux 8 - x86_64 297 kB/s | 16 MB 00:53
No match for argument: kernel-devel-4.18.0-372.26.1.el8_6.x86_64
No match for argument: kernel-headers-4.18.0-372.26.1.el8_6.x86_64
Error: Unable to find a match: kernel-devel-4.18.0-372.26.1.el8_6.x86_64 kernel-headers-4.18.0-372.26.1.el8_6.x86_64
이번 단계에서 마침 커널 코어 버전을 올려주는 것 같아
# sudo dnf install nvidia-driver nvidia-settings
Upgraded:
libwayland-client-1.21.0-1.el8.x86_64 llvm-libs-15.0.7-1.module+el8.8.0+1144+0a4e73bd.x86_64 mesa-dri-drivers-22.3.0-2.el8.x86_64
mesa-filesystem-22.3.0-2.el8.x86_64 mesa-libxatracker-22.3.0-2.el8.x86_64
Installed:
dnf-plugin-nvidia-2.0-1.el8.noarch egl-wayland-1.1.9-3.el8.x86_64
kernel-core-4.18.0-477.15.1.el8_8.x86_64 kmod-nvidia-535.54.03-4.18.0-477.15.1-3:535.54.03-3.el8_8.x86_64
libvdpau-1.4-2.el8.x86_64 mesa-vulkan-drivers-22.3.0-2.el8.x86_64
nvidia-driver-3:535.54.03-1.el8.x86_64 nvidia-driver-libs-3:535.54.03-1.el8.x86_64
nvidia-kmod-common-3:535.54.03-1.el8.noarch nvidia-libXNVCtrl-3:535.54.03-1.el8.x86_64
nvidia-settings-3:535.54.03-1.el8.x86_64 vulkan-loader-1.3.239.0-1.el8.x86_64
Complete!
다시 커널 개발 패키지 설치를 시도해보니 정상적으로 설치되었다.
# sudo dnf install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)
Installed:
bison-3.0.4-10.el8.x86_64 flex-2.6.1-9.el8.x86_64 kernel-devel-4.18.0-477.15.1.el8_8.x86_64 kernel-headers-4.18.0-477.15.1.el8_8.x86_64 m4-1.4.18-7.el8.x86_64 make-1:4.2.1-11.el8.x86_64
Complete!
드라이버를 마저 설치해주고
# sudo dnf install nvidia-driver-cuda
Installed:
cuda-license-10-1-10.1.243-1.x86_64 cuda-nvml-dev-10-1-10.1.243-1.x86_64
nvidia-driver-cuda-3:535.54.03-1.el8.x86_64 nvidia-driver-cuda-libs-3:535.54.03-1.el8.x86_64
nvidia-persistenced-3:535.54.03-1.el8.x86_64 ocl-icd-2.2.12-1.el8.x86_64
opencl-filesystem-1.0-6.el8.noarch
Complete!
# sudo dnf install cuda-driver
Installed:
cuda-drivers-535.54.03-1.x86_64 libX11-devel-1.6.8-5.el8.x86_64
libXau-devel-1.0.9-3.el8.x86_64 libxcb-devel-1.13.1-1.el8.x86_64
nvidia-driver-NVML-3:535.54.03-1.el8.x86_64 nvidia-driver-NvFBCOpenGL-3:535.54.03-1.el8.x86_64
nvidia-driver-devel-3:535.54.03-1.el8.x86_64 nvidia-libXNVCtrl-devel-3:535.54.03-1.el8.x86_64
nvidia-modprobe-3:535.54.03-1.el8.x86_64 nvidia-xconfig-3:535.54.03-1.el8.x86_64
xorg-x11-proto-devel-2020.1-3.el8.noarch
Complete!
드라이버 설치 직후 nvidia-smi
명령을 실행하면 다음과 같이 통신 에러가 발생하는데
# nvidia-smi
NVIDIA-SMI has failed because it couldn\'t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
재부팅 하고나면 정상적으로 동작한다. 4GB 의 작고 소중한 그래픽 메모리 ..
# sudo shutdown -r now
# nvidia-smi
Sat Jul 22 12:04:23 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1050 Ti Off | 00000000:01:00.0 On | N/A |
| 40% 37C P8 N/A / 75W | 246MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 6190 G /usr/libexec/Xorg 71MiB |
| 0 N/A N/A 6300 G /usr/bin/gnome-shell 171MiB |
+---------------------------------------------------------------------------------------+
nvidia Docker 설치
친절한 가이드 문서를 따라 먼저 도커와 컨테이너 등등을 업데이트 해주고
# sudo dnf config-manager --add-repo=https://download.docker.com/linux/centos/docker-ce.repo
Adding repo from: https://download.docker.com/linux/centos/docker-ce.repo
# sudo dnf repolist -v
...
Repo-id : docker-ce-stable
Repo-name : Docker CE Stable - x86_64
Repo-revision : 1688760774
Repo-updated : Sat Jul 8 05:12:54 2023
Repo-pkgs : 174
Repo-available-pkgs: 174
Repo-size : 3.2 G
Repo-baseurl : https://download.docker.com/linux/centos/8/x86_64/stable
Repo-expire : 172800 second(s) (last: Sat Jul 22 11:53:10 2023)
Repo-filename : /etc/yum.repos.d/docker-ce.repo
...
# sudo dnf install -y https://download.docker.com/linux/centos/7/x86_64/stable/Packages/containerd.io-1.4.3-3.1.el7.x86_64.
Upgraded:
containerd.io-1.6.21-3.1.el8.x86_64 docker-ce-3:24.0.4-1.el8.x86_64
Complete!
# sudo systemctl --now enable docker
Created symlink /etc/systemd/system/multi-user.target.wants/docker.service → /usr/lib/systemd/system/docker.service.
# sudo docker run --rm hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
719385e32844: Pull complete
Digest: sha256:926fac19d22aa2d60f1a276b66a20eb765fbeea2db5dbdaafeb456ad8ce81598
Status: Downloaded newer image for hello-world:latest
Hello from Docker!
This message shows that your installation appears to be working correctly.
To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
(amd64)
3. The Docker daemon created a new container from that image which runs the
executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it
to your terminal.
To try something more ambitious, you can run an Ubuntu container with:
$ docker run -it ubuntu bash
Share images, automate workflows, and more with a free Docker ID:
https://hub.docker.com/
For more examples and ideas, visit:
https://docs.docker.com/get-started/
컨테이너 툴킷을 이어서 설치해주는데 rocky 에서는 가이드 문서에서 제공되는 스크립트를 그대로 실행하면 Unsupported distribution
에러가 발생한다.
# echo $(. /etc/os-release;echo $ID$VERSION_ID)
rocky8.6
# distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
# Unsupported distribution!
# Check https://nvidia.github.io/libnvidia-container
rocky8.x
와 같이 마이너 버전을 바꿔봐도 제대로 동작을 안 해서 보니 락희는 공식 지원하는 배포판이 아닌 것 같다.
대신 rhel8.4
로 고정시켜주고 나머지 단계를 이어서 진행한다.
# distribution=rhel8.4 && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | sudo tee /etc/yum.repos.
d/nvidia-container-toolkit.repo
[libnvidia-container]
name=libnvidia-container
baseurl=https://nvidia.github.io/libnvidia-container/stable/centos8/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=1
gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
[libnvidia-container-experimental]
name=libnvidia-container-experimental
baseurl=https://nvidia.github.io/libnvidia-container/experimental/centos8/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=0
gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
# sudo dnf clean expire-cache --refresh
Cache was expired
0 files removed
# sudo dnf install -y nvidia-container-toolkit
Installed:
libnvidia-container-tools-1.13.5-1.x86_64 libnvidia-container1-1.13.5-1.x86_64 nvidia-container-toolkit-1.13.5-1.x86_64
Complete!
도커 설정까지 마무리 해주면
# sudo nvidia-ctk runtime configure --runtime=docker
INFO[0000] Loading docker config from /etc/docker/daemon.json
INFO[0000] Config file does not exist, creating new one
INFO[0000] Wrote updated config to /etc/docker/daemon.json
INFO[0000] It is recommended that the docker daemon be restarted.
# sudo systemctl restart docker
# sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
Unable to find image 'nvidia/cuda:11.6.2-base-ubuntu20.04' locally
11.6.2-base-ubuntu20.04: Pulling from nvidia/cuda
56e0351b9876: Pull complete
0e353182dfa4: Pull complete
63add13c711b: Pull complete
1210b79751b0: Pull complete
eb1e2ff09225: Pull complete
Digest: sha256:4b0c83c0f2e66dc97b52f28c7acf94c1461bfa746d56a6f63c0fef5035590429
Status: Downloaded newer image for nvidia/cuda:11.6.2-base-ubuntu20.04
Sat Jul 22 03:26:28 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1050 Ti Off | 00000000:01:00.0 Off | N/A |
| 40% 32C P8 N/A / 75W | 145MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
끝.
참고
- https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html#centos8
- https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installing-on-centos-7-8