开源模型token计算说明
更新时间:2025-05-14
token长度获取方式
千帆提供token计算器,用户可以登录token计算器页面,获取文本、图片的token长度。
开源模型token计算方法
-
以deepseek-v3为例,在huggingface上下载模型token计算相关的两个文件,分别是:
- tokenizer.json
- tokenizer_config.json
- 创建model_tokenizer.py文件,该文件与上面下载的两个文件放在同一个目录下。
- model_tokenizer.py代码如下:
# pip3 install transformers
# python3 model_tokenizer.py
import transformers
chat_tokenizer_dir = "./"
tokenizer = transformers.AutoTokenizer.from_pretrained(
chat_tokenizer_dir, trust_remote_code=True
)
text = "开源模型token计算说明"
result = tokenizer.encode(text)
print("ids:",result)
count = len(result)
print("token数量:",count)
- 运行model_tokenizer.py文件,输出结果如下:
ids: [83649, 8842, 33912, 4339, 6977]
token数量: 5
- 由此表明,"开源模型token计算说明"的token数量为5,并且给出了每一个token的id,用户可以通过id在tokenizer.json文件当中找到其对应的字符含义。
复杂输入的token计算方式
- 当你的输入当中有多轮对话,又有tools工具定义,此时如果要计算token长度,需要借助tokenizer_config.json文件当中chat_template定义。
- 以qwen3-8b为例,其输入如下:
{
"model": "qwen3-8b",
"messages": [
{
"role": "user",
"content": "查一下上海和北京现在的天气"
}
],
"tools": [{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "天气查询工具",
"parameters": {
"properties": {
"location": {
"description": "地理位置,精确到区县级别",
"type": "string"
},
"time": {
"description": "时间,格式为YYYY-MM-DD",
"type": "string"
}
},
"type": "object"
}
}
}],
"stream": false,
"enable_thinking":false,
"tool_choice" : "auto",
"tool_options" : {"thoughts_output" : true}
}
- 经过chat_template转换以后变为如下结构:
<|im_start|>system
# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "get_current_weather", "description": "天气查询工具", "parameters": {"properties": {"location": {"description": "地理位置,精确到区县级别", "type": "string"}, "time": {"description": "时间,格式为YYYY-MM-DD", "type": "string"}}, "type": "object"}}}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call><|im_end|>
<|im_start|>user
查一下上海和北京现在的天气<|im_end|>
<|im_start|>assistant
<think>
</think>
- 计算上述文本token长度代码如下:
# pip3 install transformers
# python3 model_tokenizer.py
import transformers
chat_tokenizer_dir = "./"
tokenizer = transformers.AutoTokenizer.from_pretrained(
chat_tokenizer_dir, trust_remote_code=True
)
text = """<|im_start|>system
# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "get_current_weather", "description": "天气查询工具", "parameters": {"properties": {"location": {"description": "地理位置,精确到区县级别", "type": "string"}, "time": {"description": "时间,格式为YYYY-MM-DD", "type": "string"}}, "type": "object"}}}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call><|im_end|>
<|im_start|>user
查一下上海和北京现在的天气<|im_end|>
<|im_start|>assistant
<think>
</think>"""
result = tokenizer.encode(text)
print("ids:",result)
count = len(result)
print("token数量:",count)
- 运行model_tokenizer.py文件,输出结果如下:
ids: [151644, 8948, 198, 2, 13852, 271, 2610, 1231, 1618, 825, 476, 803, 5746, 311, 7789, 448, 279, 1196, 3239, 382, 2610, 525, 3897, 448, 729, 32628, 2878, 366, 15918, 1472, 15918, 29, 11874, 9492, 510, 27, 15918, 397, 4913, 1313, 788, 330, 1688, 497, 330, 1688, 788, 5212, 606, 788, 330, 455, 11080, 69364, 497, 330, 4684, 788, 330, 104307, 51154, 102011, 497, 330, 13786, 788, 5212, 13193, 788, 5212, 2527, 788, 5212, 4684, 788, 330, 111692, 3837, 108639, 26939, 23836, 24342, 105972, 497, 330, 1313, 788, 330, 917, 14345, 330, 1678, 788, 5212, 4684, 788, 330, 20450, 3837, 68805, 17714, 28189, 18506, 40175, 497, 330, 1313, 788, 330, 917, 9207, 2137, 330, 1313, 788, 330, 1700, 30975, 532, 522, 15918, 1339, 2461, 1817, 729, 1618, 11, 470, 264, 2951, 1633, 448, 729, 829, 323, 5977, 2878, 220, 151657, 151658, 11874, 9492, 510, 151657, 198, 4913, 606, 788, 366, 1688, 11494, 8066, 330, 16370, 788, 366, 2116, 56080, 40432, 31296, 151658, 151645, 198, 151644, 872, 198, 32876, 100158, 100633, 33108, 68990, 104718, 104307, 151645, 198, 151644, 77091, 198, 151667, 271, 151668]
token数量: 181
- 用户输入token长度为181,与大模型推理结果返回的token长度一致。