原生多模态

更新时间：2025-04-11

原生多模态基础大模型通过多个模态联合建模实现协同优化，多模态理解能力优秀；具备更精进的语言能力，理解、生成、逻辑、记忆能力全面提升，去幻觉、逻辑推理、代码能力显著提升。

支持模型列表

模型名称	模型版本	model 参数值	max_completion_tokens取值范围
ERNIE 4.5	ERNIE-4.5-8K-Preview	ernie-4.5-8k-preview	[2，8192]
Llama-4-Maverick	Llama-4-Maverick-17B-128E-Instruct	llama-4-maverick-17b-128e-instruct	[2，8192]
Llama-4-Scout	Llama-4-Scout-17B-16E-Instruct	llama-4-scout-17b-16e-instruct	[2，8192]

使用方法

输入

支持的输入模态

支持以下输入组合：

文本输入
图片输入
图片+文本输入

输入多模态数据的方式

输入的图片、文件支持 Base64 编码或公网 URL。以下示例代码均以传入公网 URL 为例，如果需要传入 Base64 编码，请参见：图片Base 64 编码输入

输出

当前支持以流式、非流式形式调用多模态模型。

支持的输出模态

重要

当前仅支持文本输出；Llama 4 仅支持文本输出，ERNIE 4.5 后续推出音频。

快速开始

您可以通过两种方式将图像传入模型：图像 URL 和 Base64 编码。与文本信息相同，图像信息也需要使用用户角色进行输入，即"role": "user"。以下是使用视觉理解模型的简单调用示例。

前提条件

调用本文API，需使用API Key鉴权方式。使用API Key鉴权调用API流程，具体调用流程，请查看认证鉴权。

文本输入

模型支持接收纯文本作为输入。

curl --location 'https://qianfan.bj.baidubce.com/v2/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer your-api-key' \
--data '{
    "model": "ernie-4.5-8k-preview",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "介绍几个上海著名景点"
                }
            ]
        }
    ],
    "stream":false
}'

图片输入

模型支持接收纯图片作为输入。

模型支持传入多张图片。对输入图片的要求如下：

单个图片文件的大小不超过10 MB;
图片数量受模型图文总 Token 上限（即最大输入）的限制，所有图片的总 Token 数必须小于模型的最大输入;

curl --location 'https://qianfan.bj.baidubce.com/v2/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer your-api-key' \
--data '{
    "model": "ernie-4.5-8k-preview",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://testimage.bj.bcebos.com/image1.jpg"
                    }
                }
            ]
        }
    ],
    "stream": false
}'

图片+文本输入

模型支持接收图片+文本作为输入。

模型支持传入多张图片。对输入图片的要求如下：

单个图片文件的大小不超过10 MB;
图片数量受模型图文总 Token 上限（即最大输入）的限制，所有图片的总 Token 数必须小于模型的最大输入;

curl --location 'https://qianfan.bj.baidubce.com/v2/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer your-api-key' \
--data '{
    "model": "ernie-4.5-8k-preview",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What are in these images? Is there any difference between them?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://testimage.bj.bcebos.com/image1.jpg"
                    }
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://testimage.bj.bcebos.com/image2.png"
                    }
                }
            ]
        }
    ],
    "stream": false
}'

搜索增强

注意：只有ERNIE 4.5支持搜索增强。

curl --location 'https://qianfan.bj.baidubce.com/v2/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer your-api-key' \
--data '{
    "model": "ernie-4.5-8k-preview",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "2024年奥运会乒乓球男单冠军是谁"
                }
            ]
        }
    ],
    "web_search": {
        "enable": true,
        "enable_trace": true
    },
    "stream": false,
    "max_tokens": 512
}'

多轮对话

第一轮

请求

curl --location 'https://qianfan.bj.baidubce.com/v2/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer your-api-key' \
--data '{
    "model": "ernie-4.5-8k-preview",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What are in these images? Is there any difference between them?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://testimage.bj.bcebos.com/image1.jpg"
                    }
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://testimage.bj.bcebos.com/image2.png"
                    }
                }
            ]
        }
    ],
    "stream": false
}'

{
    "id": "as-t4ntazzewk",
    "object": "chat.completion",
    "created": 1742198333,
    "model": "ernie-4.5-8k-preview",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The first image depicts a serene natural landscape with a wooden boardwalk leading through a lush green field under a blue sky with scattered clouds. The second image shows a close-up of a hand interacting with a small, intricately designed figurine that resembles a turtle with detailed patterns and textures, surrounded by green foliage. The primary difference between the two images is that the first is a wide-angle photograph of an outdoor scene, while the second is a close-up photograph of a hand and a decorative object."
            },
            "finish_reason": "stop",
            "flag": 0
        }
    ],
    "usage": {
        "prompt_tokens": 3019,
        "completion_tokens": 102,
        "total_tokens": 3121
    }
}

第二轮

请求

curl --location 'https://qianfan.bj.baidubce.com/v2/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer your-api-key' \
--data '{
    "model": "ernie-4.5-8k-preview",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What are in these images? Is there any difference between them?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://testimage.bj.bcebos.com/image1.jpg"
                    }
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://testimage.bj.bcebos.com/image2.png"
                    }
                }
            ]
        },
        {
                "role": "assistant",
                "content": "The first image depicts a serene natural landscape with a wooden boardwalk leading through a lush green field under a blue sky with scattered clouds. The second image shows a close-up of a hand interacting with a small, intricately designed figurine that resembles a turtle with detailed patterns and textures, surrounded by green foliage. The primary difference between the two images is that the first is a wide-angle photograph of an outdoor scene, while the second is a close-up photograph of a hand and a decorative object."
        },
        {
                "role": "user",
                "content": "不够详细，再补充解析细节一点哇"
        }
    ],
    "stream": false
}'

{
    "id": "as-14hvmej1ts",
    "object": "chat.completion",
    "created": 1742198435,
    "model": "ernie-4.5-8k-preview",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "当然，让我们更详细地解析这两张图片的差异：\n\n1. **主题和内容**：\n   - **第一张图片**：这是一张风景照片，展示了一个自然环境。照片中有一条木制栈道穿过一片绿色的草地，背景是蓝天和一些树木。整体氛围宁静，色彩鲜艳，给人一种开阔和放松的感觉。\n   - **第二张图片**：这是一张特写照片，展示了一只手正在与一个装饰性物品互动。这个物品看起来像是一只带有复杂图案和纹理的乌龟，周围还有一些绿色的植物装饰。整体氛围更加细腻和手工艺感，色彩丰富且细节精致。\n\n2. **视角和构图**：\n   - **第一张图片**：采用了广角视角，捕捉了大面积的自然景观。栈道从前景延伸到背景，引导观众的视线深入画面。天空和地面的比例均衡，给人一种空间感和深度感。\n   - **第二张图片**：采用了特写视角，聚焦于手和乌龟的细节。手指和乌龟的纹理清晰可见，背景模糊，突出主体。构图紧凑，细节丰富，吸引观众注意手工艺品的精细制作。\n\n3. **色彩和光线**：\n   - **第一张图片**：色彩以绿色和蓝色为主，光线自然，可能是拍摄于白天。阳光照射在草地和栈道上，形成明暗对比，增强了画面的立体感。\n   - **第二张图片**：色彩更加多样，乌龟的绿、蓝、黄等颜色鲜艳，光线柔和，可能是室内拍摄。光线集中在手和乌龟上，突出细节和质感。\n\n4. **情感和氛围**：\n   - **第一张图片**：传达出一种宁静、自然和放松的情感，适合用于展示自然风光或户外活动的场景。\n   - **第二张图片**：传达出一种细腻、手工艺和专注的情感，适合用于展示手工艺品或创意作品的场景。\n\n总的来说，这两张图片在主题、视角、色彩和情感上都有显著的不同，分别展示了自然景观和手工艺品的美感。"
            },
            "finish_reason": "stop",
            "flag": 0
        }
    ],
    "usage": {
        "prompt_tokens": 3130,
        "completion_tokens": 394,
        "total_tokens": 3524
    }
}

图片 Base 64 编码输入

如果您需要上传本地图像，可以将图像转成 Base 64 编码后输入。以下是一个兼容 OpenAI 接口规范的示例。

from openai import OpenAI
import os
import base64

client = OpenAI(
    api_key = os.getenv("OPENAI_API_KEY"),  
    base_url="https://qianfan.baidubce.com/v2",  
)

# Function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


# Path to your image
image_path = "image1.jpg"

# Getting the Base64 string
base64_image = encode_image(image_path)

response = client.chat.completions.create(
    model="ernie-4.5-8k-preview",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What is in this image?",
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}"
                     },
                },
            ],
        }
    ],
    stream=True
)

for chunk in response:
  if chunk.choices[0].delta.content is not None:
    print(chunk.choices[0].delta.content, end="")

图像参数使用说明

大模型每一次调用都是无状态的，您需要自行管理传入给模型的信息。如果需要模型多次理解同一张图像，请在每次请求时都传入该图。
支持单图和多图，每一张图片大小不超过10MB，多张图片输入的总token不超过模型上下文长度。如ERNIE-4.5模型，不超过8K token的图片输入。
图片格式： a. 图片base64：JPG、JPEG、PNG和BMP类型，传入的格式需为：data:image/<图片格式>;base64,<Base64编码> b. 图片公网url：支持JPG、JPEG、PNG、BMP和WEBP类型
详细参数格式说明

参数名称	类型	是否必须	描述	备注
type	string	是	只有一个取值: image_url
image_url	obj	是	输入的图片信息	1、支持多图，图片数量不限制，超过8K token会报token超限 2、单图最大不超过10MB（url下载后图片大小，或base64保存图片后大小）
+url	string	是	图片的公网url或者base64	1、支持格式： base64：JPG、JPEG、PNG和BMP等类型 url：支持JPG、JPEG、PNG、BMP和WEBP等类型 2、若为base64，传入的格式需为：data:image/<图片格式>;base64,<Base64编码>
+detail	string	否	图像/分辨率质量	low表示低分辨率，high表示高分辨率

错误码

如果请求错误，服务器返回的JSON文本包含以下参数。


code	错误码
msg	错误描述信息，帮助理解和解决发生的错误
type	错误类型

更多相关错误码，请查看推理服务V2版本错误码说明。