What W4A16 means
| Term | Bits | Notes |
|---|---|---|
| W4 | 4-bit weights | Group-quantized (typical group size 32–128) |
| A16 | 16-bit activations | BF16 activation pathway |
Calibration
Convert a BF16 HuggingFace checkpoint to INT4 withtools/convert_hf_to_int4.py
(GPTQ via llmcompressor):
| Flag | Default | Notes |
|---|---|---|
--quant-type | W4A16 | Also accepts W8A16. |
--num-calibration-samples | 256 | Calibration set size. |
--quant-group-size | 32 | GPTQ group size; 128 is also common. |
--max-sequence-length | 2048 | Calibration sequence length. |
--dampening-frac | 0.01 | GPTQ damping. |
--trust-remote-code | off | Pass when the HF config requires custom code. |
--hf-checkpoint at it; SGLang autodetects the quantization at load time.
Enabling QAT
QAT is currently driven by environment variables passed through Ray’s runtime env rather than CLI flags. The canonical recipe isexamples/low_precision/run-qwen3-30B-A3B-int4.sh:
--hf-checkpoint with a BF16 --ref-load torch_dist directory
so the KL anchor stays full-precision.
Tuning
| Symptom | Try |
|---|---|
| Eval reward drops noticeably vs BF16 | Lower OPEN_TRAINING_INT4_GROUP_SIZE (e.g. 64), or recalibrate with more samples. |
| Slower than BF16 | Confirm --sglang-cuda-graph-bs covers your batch sizes. |
Pairs with
- R3. Keeps MoE routing stable across the quantized forward.
- P2P weight transfer. INT4 weights are 4× smaller, so weight sync transfers less data.
- Speculative decoding. Compounds for end-to-end rollout speedup.
When QAT is not appropriate
- The model fits comfortably without it.
- The model architecture is still in development; introduce QAT after a BF16 baseline.
- Tasks that are highly precision-sensitive (some math and safety eval suites).

