Aligning precision
The single most common class of bug. Walk through these checks before anything else.First training step
Is the rollout coherent?
If the very first rollout is gibberish:- Parameters didn’t load. Check Megatron’s logs — there should be a clear
“loaded checkpoint from …” line. If absent, fix
--load/--ref-load. - Parameter mapping is wrong. When
pp_size > 1, the second-stage layer IDs are a common offset bug. Save all parameters inload_weightsof the corresponding model in SGLang and verify they match the loaded checkpoint exactly. - SGLang dropped buffers. Some special buffers can be released during the parameter release process. Check if they’re re-loaded correctly after weight sync.
- Pretrained vs. instruct. Try the instruct version of the same architecture. If that works, your pretrained model + chat template combination is wrong.
Are log_probs and ref_log_probs exactly equal in step 1?
They should be: at step 1 the actor and reference are the same weights. If KL > 0:
- Non-deterministic kernels. Some Transformer Engine versions need
--attention-backend flashto force deterministic Flash Attention under context parallel. - Slightly off values (KL < 1e-4). Acceptable — kernel-level numerical jitter.
- Large values (KL > 1). Configuration error. Re-check parallelism and precision.
- Slightly elevated logp on instruct (~0.8 per token). Almost always a chat-template mismatch — your prompts don’t match the format the model was trained on. Run the chat template verifier.
Is grad_norm reasonable?
Step 1 with num_steps_per_rollout=1 should produce a tiny gradient. If it doesn’t:
- Megatron / TE bug. MoE in particular requires
--moe-permute-fusion. Check release notes for known fixes. - Reward signal is broken. Confirm rewards are computed from the right key (often a
--label-keytypo).
Second training step
If step 2 OOMs in colocate mode, Megatron’s offload→reload cycle is hitting the SGLang side. Lower--sglang-mem-fraction-static to 0.7 (or 0.6).
Separate train/rollout debugging
Miles ships flags to isolate one side:| Flag | What |
|---|---|
--debug-rollout-only | Don’t init Megatron — only spin up SGLang. |
--debug-train-only | Don’t init SGLang — only spin up Megatron. |
--save-debug-rollout-data /path/data_{rollout_id}.pt | Pickle every rollout to disk. |
--load-debug-rollout-data /path/data_{rollout_id}.pt | Replay rollouts from disk; auto-sets --debug-train-only. |
- Run with
--debug-rollout-only --save-debug-rollout-datato capture a few well-formed rollouts. - Switch to
--debug-train-only --load-debug-rollout-datato iterate on training changes (parallelism, optimizer, custom loss) with fixed inputs. Removes rollout randomness from your bisect.
Determinism for bisecting
When you need to A/B test a code change, bit-wise reproducibility is your friend. See the Reproducibility recipe for the exact flag set and env vars. The 25% throughput cost is worth it during development.Common kernel pitfalls
| Symptom | Likely culprit |
|---|---|
| Garbled rollout, parameters loaded fine | Chat template mismatch, or buffer drop in SGLang |
| KL ≠ 0 in step 1 | Non-det fused attention; force --attention-backend flash |
| MoE training collapses after ~50 steps | R3 not on, or routing not preserved |
| Gradient NaN/Inf | Bad chat template, or activation overflow in FP8 |
illegal memory access in SGLang | OOM in disguise — lower --sglang-mem-fraction-static |
JSONDecodeError from inductor | Cache corrupt; set TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 |
| NCCL hang during weight sync | NCCL_TIMEOUT=900; check NCCL_DEBUG=INFO |
Reading logs
Where to look:| Component | Path |
|---|---|
| Trainer stdout | wherever you redirected ray job submit |
| SGLang | /tmp/sglang/*.log (or --sglang-log-dir) |
| Ray workers | ~/.ray/session_latest/logs/worker-*.{out,err} |
| NCCL | NCCL_DEBUG=INFO NCCL_DEBUG_FILE=/tmp/nccl_%h_%p.log |
grep for these signal phrases:
Loaded checkpoint from— Megatron load OK.weight_sync— P2P / broadcast events.Continuous async rollout worker started— async worker alive.WorkerCrashed/RaySystemError— ranks died.forward_logp_max_diff— FP8 train/inference alignment.
When all else fails
- Drop to a tiny model (Qwen2.5-0.5B) on a known-good recipe (Reproducibility) to isolate framework vs. model.
git bisectbetween a known-working commit and HEAD. Determinism makes this trustworthy.- Open a GitHub issue with: launch script,
pip freeze, the first 200 lines of trainer stdout, and a short description.

