How it works
Bit-wise reproducibility requires three independent stacks to be deterministic:- Inference (SGLang) — every kernel must be deterministic.
- Training (Megatron-LM) — same.
- Communication (NCCL) — algorithm choice and CUBLAS workspace can be non-deterministic by default.
Quick start
We use the smallest Miles model (Qwen2.5-0.5B) on GSM8K so the loop fits in 5 minutes and you can reproduce the bit-stability check yourself.1. Disable FA3
Flash-Attention 3 currently has non-deterministic backward kernels. Drop it:2. Set the deterministic flags
3. Set the env vars (Ray env_vars)
| Variable | Why |
|---|---|
NCCL_ALGO=Ring | Forces a deterministic collective algorithm |
NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 | Disables non-deterministic Transformer-Engine kernels |
CUBLAS_WORKSPACE_CONFIG=:4096:8 | cuBLAS’s deterministic workspace allocation |
4. Download + convert + run
5. Verify
Run twice, then:What’s deterministic and what isn’t
| Component | Default | Deterministic mode |
|---|---|---|
| Megatron forward | non-det | ✅ via --deterministic-mode |
| Megatron backward | non-det | ✅ |
| SGLang kernels | non-det | ✅ via --sglang-enable-deterministic-inference |
| Flash-Attn 3 | non-det | ❌ — uninstall |
| NCCL collectives | non-det | ✅ via NCCL_ALGO=Ring |
| cuBLAS GEMM | non-det | ✅ via CUBLAS_WORKSPACE_CONFIG |
| TE fused kernels | non-det | ✅ via NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 |
| Python dataloader shuffle | seeded | ✅ already |
Troubleshooting
| Symptom | Likely cause |
|---|---|
| Hashes diverge after iter 1 | Flash-Attn 3 still installed |
| Hashes match for trainer but not SGLang | --sglang-attention-backend flashinfer not set |
| Hashes diverge across nodes | NCCL_ALGO=Ring not propagated to all workers |
| Hashes match locally but not on a different machine | cuDNN version mismatch |
Cost of determinism
Roughly:| Component | Throughput cost |
|---|---|
| Megatron deterministic mode | -3% to -8% |
| SGLang deterministic | -10% to -15% |
| NCCL Ring | -2% (vs. Tree) |
| Drop FA3 | -10% to -25% |
When to disable determinism
- Production training runs where the cost is too high.
- When you’ve already nailed the result and want maximum throughput.
- On hardware that doesn’t support deterministic kernels.
References
- SGLang deterministic inference blog
- Megatron-LM deterministic-mode docs

