Variants
| Model | Active / Total | HF ID | Recipe |
|---|---|---|---|
| DeepSeek-V4-Pro | 49 B / 1.6 T | TBA | deepseek-v4-pro |
| DeepSeek-V4-Flash | 13 B / 284 B | sgl-project/DeepSeek-V4-Flash-FP8 | deepseek-v4-flash |
| DeepSeek-V3 | 37 B / 671 B | deepseek-ai/DeepSeek-V3 | deepseek |
| DeepSeek-R1 | 37 B / 671 B | deepseek-ai/DeepSeek-R1 | deepseek |
radixark/miles#1046 for tracking.
Fastest path to train
DeepSeek-V4-Flash needs 8 nodes of 8× H200 and theradixark/miles:latest image:
scripts/run_deepseek.py).
Pairs well with
- PD Disaggregation — 671 B is where PD really earns its keep.
- P2P Weight Transfer — amortize weight sync across ranks.
- Fault Tolerance — node failures are inevitable at 16-node scale.

