训练作业API

更新时间：2025-01-10

本文将介绍自定义作业下的训练作业的API，您如果是初次使用相关产品，可以参考相关指南。

鉴权机制

在使用API前，您需要完成鉴权操作，可以参考鉴权认证机制中的介绍。

接口详细介绍

平台开放了5个API供用户调用：

训练作业-创建
训练作业-停止
训练作业-删除
训练作业-作业列表
训练作业-作业详情

【训练作业-创建】

接口：https://aip.baidubce.com/rpc/2.0/easydl/pro/remotetrain/create

method: post

请求参数：

参数	是否必选	类型	可选值	说明
name	是	string	-	作业名称，1-50个字符，不能有表情字符
description	是	string	-	作业描述， 1-500个字符，不能有表情字符
train_config	是	map		训练配置项
+ env_type	是	string	CCR：百度CCR镜像环境	训练环境类型
+output_UrlSource	是	string	BOS：存储来源是bos；PFS：存储来源是PFS
+ output_path	是	string	-	训练产出输出路径
+code_UrlSource	是	string	BOS：存储来源是bos；PFS：存储来源是PFS
+ code_path	否	string	-	启动代码路径，可在自定义镜像中定义训练代码，通过启动命令来使用镜像中的训练代码
+ start_cmd	是	string	-	启动命令
+ image	否	string	-	镜像，自定义镜像下必选，eg:/ns-project/paddle
+ tag	否	string	-	tag，自定义镜像下必选
+ repository_id	否	string	-	BML平台绑定的仓库的ID，自定义镜像下必选
+ distribute_strategy	否	string	Horovod；PaddleFleet; PytorchDDP	分布式策略，自定义镜像下必选
dataset_config	是	map		数据集配置项
+trainset_UrlSource	否	string	BOS：存储来源是bos；PFS：存储来源是PFS
+ trainset_path	否	string	-	训练数据集，可在自定义镜像中定义
+evalset_UrlSource	否	string	BOS：存储来源是bos；PFS：存储来源是PFS
+ evalset_path	否	string	-	评估数据集，可在自定义镜像中定义
resource_config	是	map		资源池配置项
+ resource_type	是	string	PUBLIC：公共资源池；USER:用户资源池	资源池类型，默认PUBLIC
+ node_count	是	int	-	节点数,公共资源池支持的节点数范围为1-4
+ resource_name	是	string	资源池名称，公共资源池：GPU_V100、GPU_P4、CPU_4C_16G、CPU_16C_64G。租户资源池请填写在BML平台注册的资源池名称	资源池名称
+ max_duration	是	int	1-168	最大运行时间，单位小时
+ gpu_type	否	string	-	加速卡
+ gpu_num	否	int	-	加速卡数量，单位张
+ cpu_num	否	int	-	cpu数量，单位核
+ memory	否	int	-	内存，单位G
+ rdma	否	bool	-	是否开启RDMA
prjId	是	string	-	项目ID (新增)

参数	类型	说明
result	int	作业ID
log_id	int	日志ID

【训练作业-停止】

接口：https://aip.baidubce.com/rpc/2.0/easydl/pro/remotetrain/stop

method: post

参数：

参数	是否必选	类型	参数位置	可选值	说明
id	是	int	RequestBody	-	作业ID
prjId	是	string	RequestBody	-	项目ID(新增)

参数	类型	说明
log_id	int	日志ID
result	bool	是否停止成功

【训练作业-删除】

接口：https://aip.baidubce.com/rpc/2.0/easydl/pro/remotetrain/delete

method: post

参数：

参数	是否必选	类型	参数位置	可选值	说明
id	是	int	RequestBody	-	作业ID
prjId	是	string	RequestBody	-	项目ID(新增)

参数	类型	说明
log_id	int	日志ID
result	bool	是否删除成功

【训练作业-作业列表】

接口：https://aip.baidubce.com/rpc/2.0/easydl/pro/remotetrain/list

method: post

参数：

参数	是否必选	类型	参数位置	可选值	说明
start	是	int	RequestBody	-	分页offset值，缺损值为0
num	是	int	RequestBody	-	分页limit值，缺损值为20
prjId	是	string	RequestBody	-	项目ID(新增)

参数	类型	说明
total	int	-
id	int	作业ID
name	string	作业名称
description	string	作业描述
train_status	string	训练状态：UNTRAIN（编辑中未训练的）, SUCCESSED（训练完成）, RUNNING（训练中），STOPPED（训练终止），FAILED（训练失败），PENDING（排队中）
run_time	int	训练时间，单位分钟
resource_type	string	资源池类型
resource_name	string	资源池名称
framework	string	框架
create_time	int	创建时间
log_id	int	日志ID

【训练作业-作业详情】

接口：https://aip.baidubce.com/rpc/2.0/easydl/pro/remotetrain/detail

method: post

参数：

参数	是否必选	类型	参数位置	可选值	说明
id	是	int	RequestBody	-	作业ID
prjId	是	string	RequsetBody	-	项目ID(新增)

参数	类型	说明
id	int	-
name	string	作业名称
description	string	作业描述
train_status	string	训练状态：UNTRAIN（编辑中未训练的）, SUCCESSED（训练完成）, RUNNING（训练中），STOPPED（训练终止），FAILED（训练失败），PENDING（排队中）
run_time	int	训练时间，单位分钟
create_time	int	创建时间
train_config	map	训练配置
+ code_path	string	启动代码路径
+ start_cmd	string	启动命令
+ output_path	string	输出路径
+ framework	string	训练配置：框架
+ env_type	string	训练配置项：训练环境类型。DEFAULT：平台预置环境；CCR：百度CCR镜像环境
+ image	string	训练配置项：镜像
+ tag	string	训练配置项：镜像tag
+ distribute_strategy	string	训练配置项：分布式策略，Horover；PaddleFleet; PytorchDDP
+ repository_id	string	训练配置项：镜像对应在BML平台绑定的仓库的ID
+ trainset_path	string	数据集配置项：训练数据集
+ evalset_path	string	数据集配置项：评估数据集
+ evalset_path	string	数据集配置项：评估数据集
resource_config	map	资源池配置项
+ node_count	int	节点数
+ resource_name	string	资源池名称
+ max_duration	int	最大运行时间，单位小时
+ gpu_type	string	加速卡
+ gpu_num	int	加速卡数量，单位张
+ cpu_num	string	cpu数量，单位核
+ memory	int	内存，单位G
+ rdma	bool	是否开启RDMA
dataset_config	map	数据集配置项
+ trainset_path	string	训练数据集地址
+ evalset_path	string	测试数据集地址
log_id	int	日志ID

错误码说明

错误码	错误信息
406000	internal server error
406001	param[%s] invalid
406008	[%s] quota exceeded
406012	job does not exist
406013	job stop failed
406014	job can not delete
406015	job name cannot be empty
406016	the job name is too long
406017	job description cannot be empty
406018	job description is too long
406019	code file address error
406020	startup command error
406021	output path error
406022	training dataset address error
406023	test dataset address error
406024	illegal language version
406025	illegal language version
406026	job name already exists
406027	job status exception
406028	resource type error
406029	there is no remaining free quota, please activate payment or recharge
406030	missing image parameters
406031	missing tag parameters
406032	missing distributed strategy parameter
406033	image registry is incorrect
406034	training environment error
406035	the %s quantity has reached the upper limit
406036	the emoji character is not supported
406037	no operation permission
406038	parameter too long
406100	resource pool does not exist
406101	node over limit
406102	cpu over limit
406103	memory over limit
406104	gpu over limit
406105

自动搜索作业

可视化建模