radixark/miles#1046.
1. Model Introduction
DeepSeek-V4-Pro is a 49 B-active / 1.6 T-total MoE that scales up the same sparse-MLA + DSA-indexer + KV-compressor + hyper-connection stack as V4-Flash. The architecture family is identical; the deltas are size and a handful of tuned knobs (indexer top-k, output-projection groups, compression schedule). The miles + Megatron-Core integration ships in the same image as Flash and is selected with--model-name DeepSeek-V4-Pro-FP8.
Key highlights (deltas vs V4-Flash):
- Scaled-up V4 architecture: 61 layers (vs 43), hidden-size 7168 (vs 4096), 128 attention heads (vs 64),
ffn_hidden_size=3072andmoe_ffn_hidden_size=3072(vs 2048). All layers are MoE (same--moe-layer-freqpattern).q_lora_rank=1536(vs 1024); latent KV (kv_lora_rank=512,qk_head_dim=512,v_head_dim=512) is unchanged across V4. - Hybrid Attention with wider indexer and output projection:
index_topk=1024(vs Flash’s 512) — Pro keeps 64 indexer heads × 128 dim but picks twice as many KV per query. Grouped output projection useso_groups=16(vs 8), keepingo_lora_rank=1024. - KV compressors start heavily compressed: 60-element schedule
[128, 128, 4, 128, 4, 128, …, 4, 0]— Pro skips Flash’s two leading uncompressed layers and starts at ratio-128 (HCA) from layer 0. Middle layers still alternate 4× (CSA) and 128× (HCA); only the final layer is uncompressed. Compressor RoPE base (compress_rope_theta=160000) is shared with Flash. - MoE topology: 384 routed experts + 1 shared (vs Flash’s 256 + 1), top-6.
--moe-router-topk-scaling-factor 2.5(vs Flash 1.5) compensates for the larger expert pool. The first 3 layers (num_hash_layers=3) remain dense-routed via hash buckets. - Identical YaRN RoPE and context:
rope_theta=10000, YaRNfactor=16,original_max_position_embeddings=65536→ effective context length 1,048,576 tokens (1 M), same as Flash. - Hyper-connection (HC) routing:
hc_mult=4parallel streams with sinkhorn-normalized mixing, same as Flash (PP buffers stay 4-D). - FP8 weights with simulated FP8 QAT on indexer and compressor activations; default training is BF16 on the cast checkpoint and default rollout is FP8 in SGLang with
--sglang-attention-backend compressed.
2. Supported Variants
| Model | Active / Total | HF ID |
|---|---|---|
| DeepSeek-V4-Pro-FP8 | 49 B / 1.6 T | sgl-project/DeepSeek-V4-Pro-FP8 |
3. Quick start
3.1 One-line launch
full-train subcommand chains prepare-download → prepare-single → prepare-spmd → prepare-cp → train. Each stage has a sentinel-based skip so you can re-run safely after the first invocation.
3.2 Launcher path defaults
| Flag | Default | Use |
|---|---|---|
--data-dir | /root/datasets | HF datasets (e.g. dapo-math-17k, …) |
--model-dir | /root/models | parent directory holding the HF checkpoint and Megatron _torch_dist artifacts |
--model-local-dir | unset → same as --model-dir | local NVMe path on each node; prepare-cp rsyncs the HF checkpoint and _torch_dist here so the trainer reads from local disk (set it when --model-dir is on shared/remote storage) |
--save-dir | /root/models | training checkpoints under {save-dir}/{run-id}/checkpoints/ |
MILES_SCRIPT_<FIELD_NAME_UPPER> env vars (precedence: CLI flag > env var > built-in default) — see V4-Flash §3.2 for details.
4. Script breakdown
The under-the-hood stages are essentially identical to V4-Flash — see the V4-Flash Script breakdown and substitute the Pro model name and path defaults shown above.5. Example Recipe Configuration
5.1 Megatron Parallelism
These are the validated layouts shipped with the launcher; All parallelisms are supported, you can supply any other TP / EP / PP / CP combination that fits your compute.| Hardware | Nodes × GPUs | TP | PP | CP | EP | expert-TP | Pipeline layout |
|---|---|---|---|---|---|---|---|
| H200 | 32 × 8 = 256 | 8 | 8 | 1 | 32 | 1 | first 7 / last 6 layers |
5.2 Algorithm
Same as Flash — see V4-Flash §5.2 Algorithm.5.3 Rollout & SGLang
SGLANG_SKIP_CHECKPOINT_LOAD_CHECK=1, SGLANG_DSV4_FP4_EXPERTS=0, and the Pro-only pair SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256, SGLANG_JIT_DEEPGEMM_PRECOMPILE=0.
Megatron side: --qkv-format bshd (V4 needs bshd with CP-aware data slicing). The DSA indexer additionally supports replay via --use-rollout-indexer-replay (off by default).
5.4 Optimizer
--model-name DeepSeek-V4-Pro-FP8, which flips optimizer_offload=True in the launcher (scripts/run_deepseek_v4.py) and appends the three CPU-offload flags above. Adam states live on host RAM and are D2H/H2D-overlapped with the backward pass, freeing GPU memory for the 1.6 T weight + KV footprint. The --low-memory-resume flag (off by default) additionally puts optimizer states on CPU during ckpt resume to avoid OOM on the very first iteration.
6. Pairs Well With
- FP8 & Low Precision
- Architecture Support
- DeepSeek V4 Flash — sibling recipe; shares the V4-family architecture.

