Choose a precision
| Format | Block layout | Hardware | Models tested | Maturity |
|---|---|---|---|---|
| BF16 | — | All NVIDIA + AMD MI300X / MI325 / MI350 / MI355X | All | Baseline |
| FP8 block-wise (DeepSeek-style) | 128×128, FP32 scales | Hopper (H100 / H200), Blackwell (B200+) | Qwen3-4B, Qwen3-30B-A3B, DeepSeek-V3 / R1 | Generally available |
| MXFP8 | 1×32, UE8M0 scales | Blackwell only (B200, B300, GB200, GB300) | Qwen3-30B-A3B | Beta |
| NVFP4 (E2M1) | 1×16, two-level (FP8 + FP32) scales, MoE experts only | Blackwell, following the TransformerEngine NVFP4 reference | — | Experimental |
Rollout × training compatibility
Each row is a rollout (inference) precision; each column is the trainer’s forward precision. ✅ = supported; ✗ = not supported.| Rollout \ Train | BF16 | FP8 block-wise | MXFP8 | NVFP4 |
|---|---|---|---|---|
| BF16 | ✅ baseline | ✗ | ✗ | ✗ |
| FP8 block-wise | ✅ | ✅ Hopper + Blackwell | ✗ | ✗ |
| MXFP8 | ✅ | ✗ | ✅ Blackwell | ✗ |
| NVFP4 | ✗ | ✗ | ✗ | 🚧 coming soon |
scripts/run_qwen3_30b_a3b.py):
--rollout-mxfp8and--rollout-fp8are mutually exclusive.--train-mxfp8requires--rollout-mxfp8(no MXFP8-train + FP8-rollout combo).
Unified training recipe
| Stage | Typical pipeline | Miles unified low-precision |
|---|---|---|
| Rollout (forward) | FP8 / MXFP8 GEMM | FP8 / MXFP8 GEMM |
| Trainer (forward) | BF16 GEMM | FP8 / MXFP8 GEMM with matching quant config |
| Trainer (backward) | BF16 grads | BF16 backward (master weights in BF16) |
| Optimizer | BF16 master | BF16 master |
Modes
1. BF16 train + FP8 inference
The lowest-friction path. SGLang loads FP8 weights while the trainer keeps a BF16torch_dist checkpoint. There is precision drift between the two paths;
on MoE workloads, pair this with R3 (and optionally TIS).
2. Unified block-wise FP8 (DeepSeek-style)
Rollout and training share the same block-wise FP8 quantization. This is the recipe to use on Hopper, and the recipe DeepSeek-V3 / DeepSeek-R1 ship in. Block layout is 128×128 with FP32 scales.| Flag | Effect |
|---|---|
--transformer-impl transformer_engine | Routes Megatron’s forward through TransformerEngine so FP8 GEMM is engaged. |
--fp8-format e4m3 | Forward FP8 format used by TransformerEngine. |
--fp8-recipe blockwise | 128×128 block-wise quantization; sglang must serve weights in the matching layout. |
--use-tis | Truncated Importance Sampling for residual precision drift. |
NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1 in the Ray runtime env to use FP32
scales (miles/ray/actor_group.py already sets this in the actor env).
For models that already ship 128×128 block-wise FP8 weights (DeepSeek-V3,
DeepSeek-R1, Qwen/Qwen3-30B-A3B-FP8), point --hf-checkpoint at the
block-wise FP8 directory and let SGLang autodetect. Otherwise convert with
tools/convert_hf_to_fp8.py.
For MoE workloads, also consider --use-rollout-routing-replay (R3). The
canonical recipe leaves it commented out by default but the flag is available.
Reference recipes:
examples/low_precision/run-qwen3-4b-fp8.sh— single-node Qwen3-4B.examples/low_precision/run-qwen3-30b-a3b-fp8-two-nodes.sh— two-node Qwen3-30B-A3B.
3. Unified MXFP8 (Blackwell)
MXFP8 uses a finer block layout (1×32) with UE8M0 (power-of-two) scales packed asuint8. Weights are stored as float8_e4m3fn. This is the format wired
into the Blackwell path of the Qwen3-30B-A3B reference script.
Hardware: Blackwell only — B200, B300, GB200, GB300. The reference script
asserts the GPU class on enable (scripts/run_qwen3_30b_a3b.py).
Train flags — same Megatron knobs as FP8, with mxfp8 recipe:
*.weight tensor whose last dim is divisible by
32, except layernorm, embed, router, mlp.gate., norm, lm_head,
eh_proj, weights_proj (tools/convert_hf_to_mxfp8.py). The HF
config is rewritten with:
- No DeepEP / DeepGEMM yet — MoE all-to-all uses the cutlass MoE runner, which does not currently support EP. Plan EP/PP accordingly.
--train-mxfp8requires--rollout-mxfp8(the script enforces this).
scripts/run_qwen3_30b_a3b.py
with --rollout-mxfp8 --train-mxfp8 --hardware B200. There is no dedicated
shell script under examples/low_precision/ yet.
4. NVFP4 (experimental)
NVFP4 is FP4 E2M1 with 1D block scaling (group size 16) and a two-level scale (per-block FP8 + per-tensor FP32), following the TransformerEngine NVFP4 reference. Today only MoE expert GEMMs are quantized; dense layers stay in their original precision. The full unified NVFP4 recipe is in development.Hardware support
| GPU | BF16 | FP8 block-wise | MXFP8 | NVFP4 |
|---|---|---|---|---|
| NVIDIA H100 / H200 | ✅ | ✅ | ✗ | ✗ |
| NVIDIA B200 / B300 / GB200 / GB300 | ✅ | ✅ | ✅ | 🚧 in development |
| NVIDIA A100 | ✅ | ✗ | ✗ | ✗ |
| AMD MI300X / MI325 / MI350 / MI355X | ✅ | ✗ | ✗ | ✗ |
When BF16 is enough
- Dense models below ~30 B.
- A100 hardware (no FP8 GEMM).
- AMD hardware today.
- Bring-up of a new model architecture, where clean BF16 numerics simplify debugging.

