单机多卡训练异常,单卡训练没问题。求助
学大哥哥哥 发布于2021-06 浏览:2740 回复:5
0
收藏

控制台提示错误信息:

W0619 03:01:49.530189 46720 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W0619 03:01:49.535388 46720 device_context.cc:422] device: 0, cuDNN Version: 7.6.
INFO 2021-06-19 03:01:52,836 launch_utils.py:327] terminate all the procs
ERROR 2021-06-19 03:01:52,836 launch_utils.py:582] ABORT!!! Out of all 4 trainers, the trainer process with rank=[1, 3] was aborted. Please check its log.
INFO 2021-06-19 03:01:55,839 launch_utils.py:327] terminate all the procs

 

log记录:

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0 paddle::pybind::MultiDeviceFeedReader::CheckNextStatus()
1 paddle::pybind::MultiDeviceFeedReader::WaitFutures(std::__exception_ptr::exception_ptr*)
2 paddle::framework::SignalHandle(char const*, int)
3 paddle::platform::GetCurrentTraceBackString[abi:cxx11]()

----------------------
Error Message Summary:
----------------------
FatalError: `Termination signal` is detected by the operating system.
[TimeInfo: *** Aborted at 1624017534 (unix time) try "date -d @1624017534" if you are using GNU date ***]
[SignalInfo: *** SIGTERM (@0x7687) received by PID 30357 (TID 0x7fe69524d740) from PID 30343 ***]



W0618 12:01:02.782593 30477 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W0618 12:01:02.786732 30477 device_context.cc:422] device: 0, cuDNN Version: 7.6.


--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0 paddle::imperative::Tracer::TraceOp(std::string const&, paddle::imperative::NameVarBaseMap const&, paddle::imperative::NameVarBaseMap const&, paddle::framework::AttributeMap, paddle::platform::Place const&, bool, std::map, std::allocator > > const&)
1 paddle::imperative::PreparedOp::Prepare(paddle::imperative::NameVarBaseMap const&, paddle::imperative::NameVarBaseMap const&, paddle::framework::OperatorWithKernel const&, paddle::platform::Place const&, paddle::framework::AttributeMap const&)
2 paddle::imperative::PreparedOp paddle::imperative::PrepareImpl(paddle::imperative::details::NameVarMapTrait::Type const&, paddle::imperative::details::NameVarMapTrait::Type const&, paddle::framework::OperatorWithKernel const&, paddle::platform::Place const&, paddle::framework::AttributeMap const&)
3 paddle::platform::DeviceContextPool::Get(paddle::platform::Place const&)
4 std::__future_base::_Deferred_state(std::map > >, std::less, std::allocator > > > > >*, paddle::platform::Place)::{lambda()#1} ()>, std::unique_ptr > >::_M_complete_async()
5 std::__future_base::_State_baseV2::_M_set_result(std::function ()>, bool)
6 std::__future_base::_State_baseV2::_M_do_set(std::function ()>*, bool*)
7 std::_Function_handler (), std::__future_base::_Task_setter > >, std::__future_base::_Result_base::_Deleter>, std::_Bind_simple(std::map > >, std::less, std::allocator > > > > >*, paddle::platform::Place)::{lambda()#1} ()>, std::unique_ptr > > >::_M_invoke(std::_Any_data const&)
8 paddle::platform::CUDADeviceContext::CUDADeviceContext(paddle::platform::CUDAPlace)
9 paddle::platform::CUDAContext::CUDAContext(paddle::platform::CUDAPlace const&, paddle::platform::stream::Priority const&)
10 paddle::platform::stream::CUDAStream::Init(paddle::platform::Place const&, paddle::platform::stream::Priority const&)
11 paddle::framework::SignalHandle(char const*, int)
12 paddle::platform::GetCurrentTraceBackString[abi:cxx11]()

----------------------
Error Message Summary:
----------------------
FatalError: `Termination signal` is detected by the operating system.
[TimeInfo: *** Aborted at 1624017663 (unix time) try "date -d @1624017663" if you are using GNU date ***]
[SignalInfo: *** SIGTERM (@0x76fd) received by PID 30477 (TID 0x7f349a41b740) from PID 30461 ***]

收藏
点赞
0
个赞
共5条回复 最后由kjysxa回复于2022-03
#6kjysxa回复于2022-03

+1

0
#5舍焰回复于2021-12

如何解决的?

0
#4用户已被禁言回复于2021-12

可以提交工单呦~

0
#3周冬东东回复于2021-12

解决了吗

0
#2来自星星的春哥回复于2021-12

解决了么,我也是这个问题

0
TOP
切换版块