邀测Docker
lxvicvicvic 发布于2020-12 浏览:3923 回复:3
1
收藏
快速回复
最后编辑于2020-12

宿主机 Centos7,cudatollkit, nvidia-container-runtime都正常安装

Docker image: https://public-codelab.bj.bcebos.com/docker-images/codelab_gpu.0.3.0.tar.gz

修改examples/cls_cnn_ch.json中的 "PADDLE_USE_GPU": 1 后运行:

!python3 run_with_json.py --param_path examples/cls_cnn_ch.json

得到错误:

INFO: 12-02 11:35:35: base_dataset_reader.py:110 * 139892197275456 set data_generator and start.......
W1202 11:35:37.351891 6466 dynamic_loader.cc:167] You may need to install 'nccl2' from NVIDIA official website: https://developer.nvidia.com/nccl/nccl-downloadbefore install PaddlePaddle.
/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/paddle/fluid/executor.py:1070: UserWarning: The following exception is not an EOF exception.
"The following exception is not an EOF exception.")
ERROR: 12-02 11:35:37: custom_trainer.py:116 * 139892197275456 traceback.format_exc():Traceback (most recent call last):
File "../../wenxin/training/custom_trainer.py", line 59, in train_and_eval
return_numpy=self.return_numpy)
File "textone_pro/training/controler.py", line 437, in controler.BaseTrainer.run
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/paddle/fluid/parallel_executor.py", line 303, in run
return_numpy=return_numpy)
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1071, in run
six.reraise(*sys.exc_info())
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/six.py", line 703, in reraise
raise value
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1066, in run
return_merged=return_merged)
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1156, in _run_impl
program._compile(scope, self.place)
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/paddle/fluid/compiler.py", line 443, in _compile
places=self._places)
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/paddle/fluid/compiler.py", line 396, in _compile_data_parallel
self._exec_strategy, self._build_strategy, self._graph)
paddle.fluid.core_avx.EnforceNotMet:

--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
0 std::string paddle::platform::GetTraceBackString(std::string&&, char const*, int)
1 paddle::platform::EnforceNotMet::EnforceNotMet(paddle::platform::ErrorSummary const&, char const*, int)
2 paddle::platform::dynload::GetNCCLDsoHandle()
3 void std::__once_call_impl(ncclComm**, int, int*)::{lambda()#1} ()> >()
4 paddle::platform::NCCLContextMap::NCCLContextMap(std::vector > const&, ncclUniqueId*, unsigned long, unsigned long)
5 paddle::platform::NCCLCommunicator::InitFlatCtxs(std::vector > const&, std::vector > const&, unsigned long, unsigned long)
6 paddle::framework::ParallelExecutorPrivate::InitNCCLCtxs(paddle::framework::Scope*, paddle::framework::details::BuildStrategy const&)
7 paddle::framework::ParallelExecutorPrivate::InitOrGetNCCLCommunicator(paddle::framework::Scope*, paddle::framework::details::BuildStrategy*)
8 paddle::framework::ParallelExecutor::ParallelExecutor(std::vector > const&, std::vector > const&, std::string const&, paddle::framework::Scope*, std::vector > const&, paddle::framework::details::ExecutionStrategy const&, paddle::framework::details::BuildStrategy const&, paddle::framework::ir::Graph*)

----------------------
Error Message Summary:
----------------------
PreconditionNotMetError: The third-party dynamic library (libnccl.so) that Paddle depends on is not configured correctly. (error code is libnccl.so: cannot open shared object file: No such file or directory)
Suggestions:
1. Check if the third-party dynamic library (e.g. CUDA, CUDNN) is installed correctly and its version is matched with paddlepaddle you installed.
2. Configure third-party dynamic library environment variables as follows:
- Linux: set LD_LIBRARY_PATH by `export LD_LIBRARY_PATH=...`
- Windows: set PATH by `set PATH=XXX; at (/paddle/paddle/fluid/platform/dynload/dynamic_loader.cc:194)


INFO: 12-02 11:35:37: params.py:41 * 139892197275456 ./output/cls_cnn_ch/save_checkpoints/checkpoints_step_1/model.meta
INFO: 12-02 11:35:37: params.py:48 * 139892197275456 {
"deploy_type": 4,
"encrypt_type": null,
"framework_version": "bml-code-lab-public-v1.0.0",
"is_encryption": false,
"job_type": "text_classification",
"model_type": "",
"net_type": "CnnClassification",
"pretrain_model_type": "",
"pretrain_model_version": "",
"stat_file_name": "wenxin_stat",
"task_type": "train"
}
INFO: 12-02 11:35:37: params.py:41 * 139892197275456 ./output/cls_cnn_ch/save_inference_model/inference_step_1/infer_data_params.json
INFO: 12-02 11:35:37: params.py:48 * 139892197275456 {
"fields": [
"text_a#src_ids",
"text_a#seq_lens"
]
}
INFO: 12-02 11:35:37: params.py:41 * 139892197275456 ./output/cls_cnn_ch/save_inference_model/inference_step_1/model.meta
INFO: 12-02 11:35:37: params.py:48 * 139892197275456 {
"deploy_type": 4,
"encrypt_type": null,
"framework_version": "bml-code-lab-public-v1.0.0",
"is_encryption": false,
"job_type": "text_classification",
"model_type": "",
"net_type": "CnnClassification",
"pretrain_model_type": "",
"pretrain_model_version": "",
"stat_file_name": "wenxin_stat",
"task_type": "train"
}
Traceback (most recent call last):
File "run_with_json.py", line 115, in
run_trainer(_params)
File "run_with_json.py", line 101, in run_trainer
trainer.train_and_eval()
File "../../wenxin/training/custom_trainer.py", line 119, in train_and_eval
raise e
File "../../wenxin/training/custom_trainer.py", line 59, in train_and_eval
return_numpy=self.return_numpy)
File "textone_pro/training/controler.py", line 437, in controler.BaseTrainer.run
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/paddle/fluid/parallel_executor.py", line 303, in run
return_numpy=return_numpy)
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1071, in run
six.reraise(*sys.exc_info())
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/six.py", line 703, in reraise
raise value
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1066, in run
return_merged=return_merged)
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1156, in _run_impl
program._compile(scope, self.place)
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/paddle/fluid/compiler.py", line 443, in _compile
places=self._places)
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/paddle/fluid/compiler.py", line 396, in _compile_data_parallel
self._exec_strategy, self._build_strategy, self._graph)
paddle.fluid.core_avx.EnforceNotMet:

--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
0 std::string paddle::platform::GetTraceBackString(std::string&&, char const*, int)
1 paddle::platform::EnforceNotMet::EnforceNotMet(paddle::platform::ErrorSummary const&, char const*, int)
2 paddle::platform::dynload::GetNCCLDsoHandle()
3 void std::__once_call_impl(ncclComm**, int, int*)::{lambda()#1} ()> >()
4 paddle::platform::NCCLContextMap::NCCLContextMap(std::vector > const&, ncclUniqueId*, unsigned long, unsigned long)
5 paddle::platform::NCCLCommunicator::InitFlatCtxs(std::vector > const&, std::vector > const&, unsigned long, unsigned long)
6 paddle::framework::ParallelExecutorPrivate::InitNCCLCtxs(paddle::framework::Scope*, paddle::framework::details::BuildStrategy const&)
7 paddle::framework::ParallelExecutorPrivate::InitOrGetNCCLCommunicator(paddle::framework::Scope*, paddle::framework::details::BuildStrategy*)
8 paddle::framework::ParallelExecutor::ParallelExecutor(std::vector > const&, std::vector > const&, std::string const&, paddle::framework::Scope*, std::vector > const&, paddle::framework::details::ExecutionStrategy const&, paddle::framework::details::BuildStrategy const&, paddle::framework::ir::Graph*)

----------------------
Error Message Summary:
----------------------
PreconditionNotMetError: The third-party dynamic library (libnccl.so) that Paddle depends on is not configured correctly. (error code is libnccl.so: cannot open shared object file: No such file or directory)
Suggestions:
1. Check if the third-party dynamic library (e.g. CUDA, CUDNN) is installed correctly and its version is matched with paddlepaddle you installed.
2. Configure third-party dynamic library environment variables as follows:
- Linux: set LD_LIBRARY_PATH by `export LD_LIBRARY_PATH=...`
- Windows: set PATH by `set PATH=XXX; at (/paddle/paddle/fluid/platform/dynload/dynamic_loader.cc:194)

terminate called without an active exception
W1202 11:35:37.705427 6513 init.cc:226] Warning: PaddlePaddle catches a failure signal, it may not work properly
W1202 11:35:37.705468 6513 init.cc:228] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle
W1202 11:35:37.705476 6513 init.cc:231] The detail failure signal is:

W1202 11:35:37.705487 6513 init.cc:234] *** Aborted at 1606880137 (unix time) try "date -d @1606880137" if you are using GNU date ***
W1202 11:35:37.709525 6513 init.cc:234] PC: @ 0x0 (unknown)
W1202 11:35:37.709700 6513 init.cc:234] *** SIGABRT (@0x3e800001942) received by PID 6466 (TID 0x7f3abdb98700) from PID 6466; stack trace: ***
W1202 11:35:37.712968 6513 init.cc:234] @ 0x7f3b30774980 (unknown)
W1202 11:35:37.716092 6513 init.cc:234] @ 0x7f3b303affb7 gsignal
W1202 11:35:37.719053 6513 init.cc:234] @ 0x7f3b303b1921 abort
W1202 11:35:37.721267 6513 init.cc:234] @ 0x7f3b063c784a __gnu_cxx::__verbose_terminate_handler()
W1202 11:35:37.722939 6513 init.cc:234] @ 0x7f3b063c5f47 __cxxabiv1::__terminate()
W1202 11:35:37.724933 6513 init.cc:234] @ 0x7f3b063c5f7d std::terminate()
W1202 11:35:37.726735 6513 init.cc:234] @ 0x7f3b063c5c5a __gxx_personality_v0
W1202 11:35:37.729219 6513 init.cc:234] @ 0x7f3b2c672b97 _Unwind_ForcedUnwind_Phase2
W1202 11:35:37.731604 6513 init.cc:234] @ 0x7f3b2c672e7d _Unwind_ForcedUnwind
W1202 11:35:37.734496 6513 init.cc:234] @ 0x7f3b30773000 __GI___pthread_unwind
W1202 11:35:37.737349 6513 init.cc:234] @ 0x7f3b3076aae5 __pthread_exit
W1202 11:35:37.738099 6513 init.cc:234] @ 0x55db59fb1e49 PyThread_exit_thread
W1202 11:35:37.738323 6513 init.cc:234] @ 0x55db59e35b23 PyEval_RestoreThread.cold.796
W1202 11:35:37.741436 6513 init.cc:234] @ 0x7f3aca40ee69 pybind11::gil_scoped_release::~gil_scoped_release()
W1202 11:35:37.741873 6513 init.cc:234] @ 0x7f3aca4f7976 _ZZN8pybind1112cpp_function10initializeIZN6paddle6pybind10BindReaderEPNS_6moduleEEUlRNS2_9operators6reader22LoDTensorBlockingQueueERKSt6vectorINS2_9framework9LoDTensorESaISC_EEE1_bIS9_SG_EINS_4nameENS_9is_methodENS_7siblingENS_10call_guardIINS_18gil_scoped_releaseEEEEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNES11_
W1202 11:35:37.744927 6513 init.cc:234] @ 0x7f3aca42c679 pybind11::cpp_function::dispatcher()
W1202 11:35:37.745769 6513 init.cc:234] @ 0x55db59f32914 _PyMethodDef_RawFastCallKeywords
W1202 11:35:37.746541 6513 init.cc:234] @ 0x55db59f32a31 _PyCFunction_FastCallKeywords
W1202 11:35:37.747300 6513 init.cc:234] @ 0x55db59f9f39e _PyEval_EvalFrameDefault
W1202 11:35:37.747999 6513 init.cc:234] @ 0x55db59ee2160 _PyEval_EvalCodeWithName
W1202 11:35:37.748723 6513 init.cc:234] @ 0x55db59ee2925 _PyFunction_FastCallDict
W1202 11:35:37.749497 6513 init.cc:234] @ 0x55db59f9beea _PyEval_EvalFrameDefault
W1202 11:35:37.750239 6513 init.cc:234] @ 0x55db59f31e7b _PyFunction_FastCallKeywords
W1202 11:35:37.751001 6513 init.cc:234] @ 0x55db59f9a740 _PyEval_EvalFrameDefault
W1202 11:35:37.751672 6513 init.cc:234] @ 0x55db59f31e7b _PyFunction_FastCallKeywords
W1202 11:35:37.752458 6513 init.cc:234] @ 0x55db59f9a740 _PyEval_EvalFrameDefault
W1202 11:35:37.753166 6513 init.cc:234] @ 0x55db59ee285b _PyFunction_FastCallDict
W1202 11:35:37.753934 6513 init.cc:234] @ 0x55db59f014d3 _PyObject_Call_Prepend
W1202 11:35:37.754766 6513 init.cc:234] @ 0x55db59ef3ffe PyObject_Call
W1202 11:35:37.755129 6513 init.cc:234] @ 0x55db59ff2f77 t_bootstrap
W1202 11:35:37.755313 6513 init.cc:234] @ 0x55db59fad818 pythread_wrapper
W1202 11:35:37.758723 6513 init.cc:234] @ 0x7f3b307696db start_thread
Aborted

请问这个是docker内的cuda没有装好么?docker内CPU跑该样例是可以的,并且docker内的终端里只能调出nvidia-smi,nvcc找不到

收藏
点赞
1
个赞
共3条回复 最后由JavaRoom回复于2020-12
#4JavaRoom回复于2020-12
#2 lxvicvicvic回复
已解决: 在notebook中即使使用export CUDA_VISIBLE_DEVICES='0'也无法解决问题,但是在终端中先export CUDA_VISIBLE_DEVICES='0'后再运行示例程序就可以了。
展开

哈哈哈,我也搞定了。

不试试centos8吗?

0
#3春水shine回复于2020-12

赞!!!

0
#2lxvicvicvic回复于2020-12

已解决:

在notebook中即使使用export CUDA_VISIBLE_DEVICES='0'也无法解决问题,但是在终端中先export CUDA_VISIBLE_DEVICES='0'后再运行示例程序就可以了。

0
快速回复
TOP
切换版块