解决Docker镜像中Paddle与Onnxruntime找不到CUDA或CUDNN库文件的问题

问题描述

在使pytorch/pytorchnvidia-cuda的Docker镜像下使用Paddle或者Onnxruntime-gpu可能会出现找不到CUDA或者CUDNN文件的报错。

Paddle报错:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
W1120 01:52:39.805847   549 dynamic_loader.cc:314] The third-party dynamic library (libcudnn.so) that Paddle depends on is not configured correctly. (error code is /usr/local/cuda/lib64/libcudnn.so: cannot open shared object file: No such file or directory)
Suggestions:
1. Check if the third-party dynamic library (e.g. CUDA, CUDNN) is installed correctly and its version is matched with paddlepaddle you installed.
2. Configure third-party dynamic library environment variables as follows:
- Linux: set LD_LIBRARY_PATH by `export LD_LIBRARY_PATH=...`
- Windows: set PATH by `set PATH=XXX;
Traceback (most recent call last):
File "deploy/pipeline/pipeline.py", line 1321, in <module>
main()
File "deploy/pipeline/pipeline.py", line 1308, in main
pipeline.run_multithreads()
File "deploy/pipeline/pipeline.py", line 179, in run_multithreads
self.predictor.run(self.input)
File "deploy/pipeline/pipeline.py", line 533, in run
self.predict_video(input, thread_idx=thread_idx)
File "deploy/pipeline/pipeline.py", line 993, in predict_video
classes, scores = self.video_action_predictor.predict(
File "deploy/pipeline/pphuman/video_action_infer.py", line 192, in predict
input_tensor.copy_from_cpu(inputs)
File "/usr/local/lib/python3.8/dist-packages/paddle/inference/wrapper.py", line 52, in tensor_copy_from_cpu
self._copy_from_cpu_bind(data)
RuntimeError: (PreconditionNotMet) Cannot load cudnn shared library. Cannot invoke method cudnnGetVersion.
[Hint: cudnn_dso_handle should not be null.] (at /paddle/paddle/phi/backends/dynload/cudnn.cc:64)

Onnxruntime-gpu报错:

1
2
3
2024-11-15 02:09:12.841326104 [E:onnxruntime:Default, provider_bridge_ort.cc:1480 TryGetProviderInfo_CUDA] /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1193 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcublasLt.so.11: cannot open shared object file: No such file or directory

2024-11-15 02:09:12.841349327 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:743 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements to ensure all dependencies are met.

解决方法:

首先要确保你的使用pytorch/pytorchnvidia-cuda镜像带有CUDNN,即镜像名中应该带有cudnn字样。

在镜像环境带有CUDA与CUDNN文件,而Paddle与Onnxruntime等软件仍无法识别到时,则可以通过手动指定CUDA与CUDNN库目录的方法来解决。

nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu20.04镜像为例:

定位库文件:

首先创建一个临时容器用于寻找CUDA与CUDNN库的位置

1
sudo docker run -it --gpus all --rm nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu20.04 /bin/bash

其中--rm参数代表容器退出后自动删除。

执行

1
find / -name *libcudnn*

得到结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
/usr/share/doc/libcudnn8
/usr/share/lintian/overrides/libcudnn8
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8
/usr/lib/x86_64-linux-gnu/libcudnn.so.8
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.5.0
/var/lib/dpkg/info/libcudnn8.md5sums
/var/lib/dpkg/info/libcudnn8.list

可知该镜像的CUDNN库位于/usr/lib/x86_64-linux-gnu

再执行

1
find / -name *libcublas*

得到结果

1
2
3
4
5
6
7
/usr/share/doc/libcublas-11-7
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcublasLt.so.11.10.3.66
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcublasLt.so.11
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcublas.so.11.10.3.66
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcublas.so.11
/var/lib/dpkg/info/libcublas-11-7.md5sums
/var/lib/dpkg/info/libcublas-11-7.list

可知该镜像的CUDA算子位于/usr/local/cuda-11.7/targets/x86_64-linux/lib

添加环境变量:

为了让Paddle与Onnxruntime能够找到CUDA与CUDNN的位置,需要将上面找到的两个路径添加到环境变量LD_LIBRARY_PATH,如果是临时使用容器的话,可以用该指令来临时指定(仅在当前终端有效)

1
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/cuda-11.7/targets/x86_64-linux/lib:$LD_LIBRARY_PATH

其中/usr/lib/x86_64-linux-gnu:/usr/local/cuda-11.7/targets/x86_64-linux/lib要根据自己在前两步获得的算子路径来修改,路径间用:隔开。

而在构造镜像时,在Dockerfile中增加:

1
RUN echo export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/cuda-11.7/targets/x86_64-linux/lib:$LD_LIBRARY_PATH >> ~/.bashrc

则可在进入容器时自动指定CUDA与CUDNN算子文件位置。