-
Notifications
You must be signed in to change notification settings - Fork 757
Description
GPT-5 analysis:
日志在 04:2104:26 之间持续出现 vLLM “最大上下文 20480 tokens,但请求 22k61k tokens” 的报错,所有调用都返回 400。(见 AAAA_8mcp_QwenAgent_8B_1203test1.log 中多段 ValueError)
这些超长请求来自当前 Hydra 配置:data.max_prompt_length=12288 + data.max_response_length=8192,再加上 multi-turn hermes 模板、系统提示等,引发实际 token 数远超 20480,同时又设置了 data.truncation=error,导致既不截断也不丢弃,直接把超长样本送给 vLLM。
每次 400 之后,AgentLightning 仍将该 rollout 记为“已完成”,但 reward/triplet 均为 None(日志中的大量警告)。随后同一个任务会被 re-claim 再跑一次,旧的失败结果和新的成功结果都可能滞留在 server 的 result 缓存里。
AgentModeDaemon._async_run_until_finished 不会清理 server 端过期的结果,只要 len(_completed_rollouts) >= _total_tasks_queued 就退出;因此当旧的失败结果在新一轮采样期间被取回时,局部计数会出现“Completed 33/32 tasks”这样的状态,最终在 get_train_data_batch 里触发断言。(agentlightning/verl/daemon.py)
结论
触发断言的根因是超长 prompt/response 导致 vLLM 连续返回 400,旧的失败 rollouts 没有被正确清理,与新一轮任务混在一起,造成 _completed_rollouts 数量大于 _total_tasks_queued。
建议
在向 vLLM 发请求前就做 token 级裁剪:把 data.truncation 改为 truncate(或显式在代理层截断),确保 prompt_len + response_len <= 20480。必要时调小 max_prompt_length/max_response_length 或改写 hermes 模板以控制 token。
在 clear_data_and_server 中调用 server 端的清理接口(或新加接口)以丢弃上一轮剩余的 _completed_rollouts;或者在 get_train_data_batch 前对多余 rollouts 做过滤,仅保留本轮 _task_id_to_original_sample 中出现的 id,避免 len 失衡。
若仍需保护运行,可临时把断言改成日志告警并丢弃多余 rollouts,但根本解法仍是限制上下文长度,避免产生这些失败任务。
log:
�[36m(PatchedvLLMServer pid=2403114)�[0m ERROR 12-03 04:26:37 [serving_chat.py:222] File "lib/python3.10/site-packages/vllm/entrypoints/openai/serving_engine.py", line 499, in _normalize_prompt_text_to_input
�[36m(PatchedvLLMServer pid=2403114)�[0m ERROR 12-03 04:26:37 [serving_chat.py:222] return self._validate_input(request, input_ids, input_text)
�[36m(PatchedvLLMServer pid=2403114)�[0m ERROR 12-03 04:26:37 [serving_chat.py:222] File "lib/python3.10/site-packages/vllm/entrypoints/openai/serving_engine.py", line 563, in _validate_input
�[36m(PatchedvLLMServer pid=2403114)�[0m ERROR 12-03 04:26:37 [serving_chat.py:222] raise ValueError(
�[36m(PatchedvLLMServer pid=2403114)�[0m ERROR 12-03 04:26:37 [serving_chat.py:222] ValueError: This model's maximum context length is 20480 tokens. However, you requested 49088 tokens in the messages, Please reduce the length of the messages.
36m(TaskRunner pid=2397748)�[0m Warning: Reward is None for rollout rollout-ff673b89-b65c-47e6-a77a-125ca8770652, will be auto-set to 0.0.
�[36m(TaskRunner pid=2397748)�[0m Warning: Triplet is None for rollout rollout-ff673b89-b65c-47e6-a77a-125ca8770652.
�[36m(TaskRunner pid=2397748)�[0m Completed 33/32 tasks...
�[36m(TaskRunner pid=2397748)�[0m INFO: 127.0.0.1:47154 - "GET /task HTTP/1.1" 200 OK
�[36m(TaskRunner pid=2397748)�[0m INFO: 127.0.0.1:47162 - "GET /task HTTP/1.1" 200 OK
�[36m(TaskRunner pid=2397748)�[0m INFO: 127.0.0.1:47172 - "GET /task HTTP/1.1" 200 OK
�[36m(TaskRunner pid=2397748)�[0m INFO: 127.0.0.1:47188 - "GET /task HTTP/1.1" 200 OK
�[36m(TaskRunner pid=2397748)�[0m INFO: 127.0.0.1:47200 - "GET /task HTTP/1.1" 200 OK
�[36m(TaskRunner pid=2397748)�[0m INFO: 127.0.0.1:47210 - "GET /task HTTP/1.1" 200 OK
�[36m(TaskRunner pid=2397748)�[0m INFO: 127.0.0.1:47212 - "GET /task HTTP/1.1" 200 OK
�[36m(TaskRunner pid=2397748)�[0m INFO: 127.0.0.1:47226 - "GET /task HTTP/1.1" 200 OK
�[36m(TaskRunner pid=2397748)�[0m All tasks finished.
Traceback (most recent call last):
File "lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "ib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "lib/python3.10/site-packages/agentlightning/verl/main.py", line 4, in
main()
File "lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File " /lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File " /lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File " /lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File " /lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File " /lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in
lambda: hydra.run(
File " /lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File " /lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File " /lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File " /lib/python3.10/site-packages/agentlightning/verl/entrypoint.py", line 12, in main
run_ppo(config)
File " /lib/python3.10/site-packages/agentlightning/verl/entrypoint.py", line 26, in run_ppo
ray.get(runner.run.remote(config))
File " /lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
File " /lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return func(*args, **kwargs)
File " /lib/python3.10/site-packages/ray/_private/worker.py", line 2849, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File " /lib/python3.10/site-packages/ray/_private/worker.py", line 937, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AssertionError): �[36mray::TaskRunner.run()�[39m (pid=2397748, ip=178.28.32.225, actor_id=46872e1f62fadf329947f7ec02000000, repr=<agentlightning.verl.entrypoint.TaskRunner object at 0x7f7da5245000>)
File " /lib/python3.10/site-packages/agentlightning/verl/entrypoint.py", line 152, in run
trainer.fit()
File " /lib/python3.10/site-packages/agentlightning/verl/trainer.py", line 353, in fit
metrics = self._train_step(batch_dict)
File " /lib/python3.10/site-packages/agentlightning/verl/trainer.py", line 95, in _train_step
batch, agent_metrics = self.agent_mode_daemon.get_train_data_batch(
File " /lib/python3.10/site-packages/agentlightning/verl/daemon.py", line 419, in get_train_data_batch
assert len(self._completed_rollouts) == self._total_tasks_queued
AssertionError