radixark/miles#1046.
1. Model Introduction
DeepSeek-V4-Flash is a 13 B-active / 284 B-total MoE model with a substantially different attention stack from V3/R1. It ships in theradixark/miles:latest image. The larger DeepSeek-V4-Pro shares the same V4 architecture family at Pro scale.
Key highlights:
- Hybrid Attention (CSA + HCA): combines Compressed Sparse Attention (light compression) and Heavily Compressed Attention (heavy compression) layers — DeepSeek’s official V4 name (see HF model card §Introduction). Implementation uses low-rank Q (
q_lora_rank=1024), single-head latent KV (head_dim=512), grouped output projection (8 groups, LoRA rank 1024). A learned topk indexer (index_topk=512, 64 heads × 128 dim) picks 512 KV per query at runtime, inheriting V3.2’s DSA-style design. - KV compressors: 44-element compression schedule
compress_ratios = [0, 0, 4, 128, 4, 128, …, 4, 0]— first / last few layers are uncompressed (ratio 0), middle layers alternate 4× (CSA) and 128× (HCA). The compressor RoPE has its own base (compress_rope_theta=160000), separate from the main attention RoPE. - Hyper-connection (HC) routing: each layer expands hidden state into
hc_mult=4parallel streams and recombines via sinkhorn-normalized mixing. Pipeline-parallel buffers are 4-D[s, b, hc_mult, d]instead of 3-D. - YaRN RoPE on main attention:
rope_theta=10000, YaRNfactor=16,original_max_position_embeddings=65536→ effective context length 1,048,576 tokens (1 M). Per-head learnable attention sinks (one scalar per head, added to softmax denominator). - FP8 weights with simulated FP8 QAT on indexer and compressor activations.
2. Supported Variants
| Model | Active / Total | HF ID |
|---|---|---|
| DeepSeek-V4-Flash | 13 B / 284 B | sgl-project/DeepSeek-V4-Flash-FP8 |
3. Quick start
3.1 One-line launch
One command runs the full pipeline — dataset download, FP8 → BF16 cast, distributedtorch_dist conversion, and the training loop:
full-train subcommand chains prepare-download → prepare-single → prepare-spmd → prepare-cp → train. Each stage has a sentinel-based skip so you can re-run safely after the first invocation.
3.2 Launcher path defaults
The Python launcher (scripts/run_deepseek_v4.py) takes its path arguments from CLI flags. The defaults are:
| Flag | Default | Use |
|---|---|---|
--data-dir | /root/datasets | HF datasets (e.g. dapo-math-17k, …) |
--model-dir | /root/models | parent directory holding the HF checkpoint and Megatron _torch_dist artifacts as separate sibling sub-directories |
--model-local-dir | unset → same as --model-dir | local NVMe path on each node; prepare-cp rsyncs the HF checkpoint and _torch_dist here so the trainer reads from local disk instead of shared storage (only worth setting when --model-dir is on shared/remote storage) |
--save-dir | /root/models | training checkpoints under {save-dir}/{run-id}/checkpoints/ |
MILES_SCRIPT_<FIELD_NAME_UPPER> (e.g. MILES_SCRIPT_MODEL_DIR), with precedence CLI flag > env var > built-in default; run train --help to see each option’s [env var: …] name.
3.3 Colocated vs. disaggregated rollout
By default the launcher runs colocated: training and SGLang rollout share all--num-nodes × --num-gpus-per-node GPUs. Pass --rollout-num-nodes N (0 < N < --num-nodes) to run disaggregated: N nodes serve rollout, the rest train. The verified parallelism recipes are keyed on the training nodes, so the 8-node Flash recipe in disaggregated form is --num-nodes 16 --rollout-num-nodes 8 (8 train + 8 rollout — the validated layout).
4. Script breakdown
In this section, we explain whatfull-train does under the hood, and how to drive each stage manually if you need to debug or run outside the one-line launcher.
4.1 Download model + datasets
prepare-download subcommand does the dataset fetch automatically; pass --hf-checkpoint <path> to skip the model download when the FP8 weights are already on a shared filesystem.
4.2 HF → Megatron torch_dist conversion
The conversion happens in two stages — a single-rank FP8 → BF16 cast, followed by a distributed torch_dist shard:
prepare-spmd subcommand drives the same conversion.
4.3 Multi-node fan-out
The Python launcher manages Ray internally — start each pod with theradixark/miles:latest image and a working shared filesystem mounted at the same path on every node, then on the head node:
MILES_SCRIPT_EXTERNAL_RAY=1 and RAY_ADDRESS=… to point the launcher at an existing Ray cluster (for example, one that an orchestration layer has already brought up). When RAY_ADDRESS is unset, the launcher boots a local Ray head.
4.4 Notable quirks
- Custom
transformerspatch. miles shipswith_transformers_patch()(miles/utils/transformers_patch.py) so HF’sAutoConfig.from_pretrainedrecognizesmodel_type=deepseek_v4/deepseek_refuntil support lands upstream.
5. Example Recipe Configuration
5.1 Megatron Parallelism
These are the validated layouts shipped with the launcher; All parallelisms are supported, you can supply any other TP / EP / PP / CP combination that fits your compute.| Hardware | Nodes × GPUs | TP | PP | CP | EP | expert-TP | Pipeline layout |
|---|---|---|---|---|---|---|---|
| H200 | 8 × 8 = 64 | 8 | 8 | 1 | 8 | 1 | first 4 / last 3 layers |
| GB300 | 8 × 4 = 32 | 8 | 4 | 1 | 8 | 1 | first 11 / last 10 layers |
| GB300 | 8 × 4 = 32 | 2 | 8 | 2 | 4 | 1 | first 4 / last 3 layers |
5.2 Algorithm
Using GRPO as an example, you can configure the algorithm with the following flags:--moe-router-freeze-gate and --freeze-e-score-correction-bias are required and asserted on the mcore side — bias-update during RL is forbidden.
5.3 Rollout & SGLang
SGLANG_SKIP_CHECKPOINT_LOAD_CHECK=1, SGLANG_DSV4_FP4_EXPERTS=0, MILES_HACK_TRAIN_TORCH_DETERMINISTIC=1, and NCCL_ALGO=Ring.
On the Megatron side, V4 needs --qkv-format bshd with CP-aware data slicing. The DSA indexer additionally supports replay via --use-rollout-indexer-replay (off by default).
5.4 Optimizer
--low-memory-resume flag (off by default) puts optimizer states on CPU during ckpt resume to avoid OOM on the very first iteration.
6. Pairs Well With
- FP8 & Low Precision
- Architecture Support — the V4 plugin lives under
miles_plugins/models/deepseek_v4/.

