- Doctor / patient simulations.
- Multi-step DeepResearch pipelines.
- Adversarial games (proposer / solver).
Prerequisites
- You’ve completed the Qwen3-30B-A3B recipe (the example uses that model).
- Familiar with Customization.
Files
Quick start
Configuration
Launch script highlights
--rollout-max-context-len— total context budget across all turns. Larger than--rollout-max-response-lenbecause we accumulate.--global-batch-size 256 = 32 × 8— matches the rollout invariant.
Walkthrough — the agent loop
The shipped pipeline is solver → rewriter → selector:num_parallel solver
attempts run in parallel, each rewriter rewrites the previous solutions, and a
SelectorAgent picks one. Sampling params are set on args upstream by the rollout
helper, so run_agent_system only takes (args, sample).
agent_system.py
solver_worker, rewrite_worker, and
SelectorAgent.select all post to the same engine, just with different prompts. So
both agents are the same model updating in lockstep. For architecturally distinct
agents (separate models), see the MrlX repo.
Walkthrough — rollout integration
rollout_with_multi_agents.py exposes generate_with_multi_agents(args, sample, sampling_params, evaluation=False). Internally it:
- Sets
args.sampling_params = sampling_paramsandargs.tokenizer, then loads the custom multi-agent function fromargs.custom_multi_agent_function_path. - Calls
await custom_multi_agent_func(args, sample)to get the list of samples produced by the solver / rewriter / selector pipeline. - Returns the shuffled list of
Samples for the trainer to pack.
solver_worker /
rewrite_worker / SelectorAgent.select (which call batched_async_rm), so the
rollout integration itself is a thin wrapper.
Tuning knobs
| Knob | Effect |
|---|---|
MAX_TURNS | Conversation depth — longer = more context = slower |
incorrect_reward_weight / correct_reward_weight | Asymmetric shaping |
num_parallel | Rollouts per prompt running concurrently |
--rollout-max-context-len | Stops the conversation when budget is hit |
What to watch
loss_mask/role_split is heavily skewed, one role is dominating — typically the
verifier becomes verbose. Tighten its system prompt or reduce its max_tokens.
Troubleshooting
| Symptom | Fix |
|---|---|
| OOM mid-rollout | Reduce MAX_TURNS or --rollout-max-context-len |
| Both agents repeat each other | Verifier prompt is too permissive — make it adversarial |
| Reward never moves | Check that <final_answer> extraction matches the verifier output |
| Rollout much slower than baseline | Per-turn SGLang RTT × MAX_TURNS — consider async rollout |
Variations
VLM multi-turn
Replacecall_role with a VLM-aware caller that includes images in messages. Miles
supports VLM multi-turn natively — same pattern, just multimodal_train_inputs in the
sample dict (see Customization #13).

