Prerequisites
radixark/miles:latestcontainer.- Either a serper.dev API key (Google search backend) or ~135 GB free disk for the local Wikipedia retriever (see appendix).
- You completed Customization — this example uses a custom rollout function and reward.
Files
Quick start
1. Set up environment
2. Prepare data
3. Convert the model
4. Run
Configuration
Opengenerate_with_search.py and edit SEARCH_R1_CONFIGS:
Walkthrough — multi-turn rollout
The custom rollout lives ingenerate_with_search.py:generate. The loop is
straightforward but every step matters:
The two crucial details
- Loss masking. Tool/observation tokens get
loss_mask=0. Without this, the model learns to predict the search results, which is both wrong and wildly unhelpful. - Tokenization alignment. The model must see and the trainer must score the exact same tokens. Pre-tokenizing vs. re-tokenizing at training time can drift — that’s where the chat template verifier matters.
Walkthrough — reward
format_score=0.2 gives partial credit for the correct <answer>... shape even if the
content is wrong — keeps gradient flowing during early exploration.
Enabling TIS
The trajectory mixes model tokens (we want gradients) with tool tokens (we don’t). Without correction, the implicit policy ratio in the GRPO objective is off-policy — the search results came from a stochastic environment, not the model. Truncated Importance Sampling (TIS) corrects for this. To enable:- Set
"return_logprob": TrueinSEARCH_R1_CONFIGS. - Uncomment the TIS flags in
run_qwen2.5_3B.sh:
return_logprob=True, response post-processing is automatically disabled to keep
token / logp alignment.
What to watch
tis/effective_sample_size collapses below 0.5, your inference distribution has
drifted too far. Lower --lr or shorten max_turns.
Tuning knobs
| Knob | Effect |
|---|---|
max_turns | More turns = more retrieval, more drift |
topk | More retrieved snippets = longer context |
search_concurrency | Cap on simultaneous tool calls (mind your QPS limit) |
format_score | Partial credit for correct shape — higher = faster early shaping |
Troubleshooting
| Problem | Fix |
|---|---|
| ”Ray process stuck” | rm -rf /root/.cache, then rm -rf /root/.* if still stuck |
| Retriever 502 errors | lsof -i :8000 — make sure your local server is alive |
| Conda activation collisions | Deactivate the retriever env before launching training |
| EM stays at 0 | Check the answer extractor — most often a regex mismatch |
| Loss masks shifted by one token | Tokenizer added a leading space; align with add_special_tokens=False |
Variations
- Use Google instead of local. Set
"search_backend": "google"and add an API key. - Different tool. Replace
search_backendwith anything else — calculator, code exec, internal API. The pattern is identical. - Group RM. With multiple trajectories per prompt (GRPO), enable
--group-rmso rewards are computed in a batch. - Longer chains. Bump
max_turnsto 8+ for deep-reasoning tasks. Watchloss_mask/observation_fraction— if it dominates, the model is barely training.

