scripts/run-kimi-k25.sh, which sources the shared model definition in scripts/models/kimi-k2-thinking.sh.
1. Model Introduction
Kimi-K2.5 is an open-source, natively multimodal agentic model from Moonshot AI. It is built by continual pretraining on roughly 15 T mixed vision and text tokens on top of Kimi-K2-Base, and it pairs a Mixture-of-Experts (MoE) language backbone with a MoonViT vision encoder so a single model handles both image and text inputs. K2.5 keeps the 1 T-total / 32 B-active shape of the K2 family and extends the context window to 256K tokens. Architecture at a glance (from the model card):| Specification | Value |
|---|---|
| Architecture | Mixture-of-Experts (MoE) |
| Total / activated parameters | 1 T / 32 B |
| Layers (1 dense + 60 MoE) | 61 |
| Attention hidden dimension | 7168 |
| MoE hidden dimension (per expert) | 2048 |
| Attention heads | 64 |
| Routed experts (selected per token) | 384 (top-8) |
| Shared experts | 1 |
| Attention mechanism | Multi-head Latent Attention (MLA) |
| Activation | SwiGLU |
| Vocabulary size | 160K |
| Context length | 256K |
| Vision encoder | MoonViT (400 M) |
- Native multimodality. K2.5 is pretrained on both vision and language tokens, so it covers visual knowledge, cross-modal reasoning, and tool use grounded in images alongside text.
- Coding with vision. It generates code from visual specifications such as UI designs and video workflows, and drives tools for visual data processing.
- Agent swarm. It decomposes a complex task into parallel sub-tasks run by dynamically instantiated, domain-specific agents, rather than scaling a single agent.
2. Supported Variants
The K2.5 launcher expects two checkpoints under$BASE_DIR: an INT4 actor checkpoint and a BF16 reference checkpoint.
| Role | Checkpoint | Loaded with |
|---|---|---|
| Actor (trained) | $BASE_DIR/Kimi-K2.5-int4 | --hf-checkpoint |
| Reference | $BASE_DIR/Kimi-K2.5-bf16 | --ref-load |
3. Quick start
3.1 Prerequisites
The launcher references two environment variables but never sets them, so you should export them yourself before launch:$BASE_DIR directory must already hold the two K2.5 checkpoints from §2 alongside the DAPO-Math-17k training set (dapo-math-17k/dapo-math-17k.jsonl) and the AIME-2024 eval set (aime-2024.jsonl).
3.2 One-line launch
The script submits to an already-running Ray cluster (ray job submit --address http://127.0.0.1:8265); it does not run ray start --head itself. It also runs a pkill / ray stop cleanup pass at the top so a failed run can be re-launched cleanly.
3.3 Multi-node fan-out
Bring up Ray on every node before launching, the same way as the other Kimi recipes:4. Script breakdown
The launcher groups its flags into the arrays that are passed totrain.py. The model shape comes from MODEL_ARGS, which is sourced from scripts/models/kimi-k2-thinking.sh. That definition sets the MLA latent ranks (q_lora_rank=1536, kv_lora_rank=512, qk_head_dim=128, qk_pos_emb_head_dim=64, v_head_dim=128), the MoE routing (384 experts, top-8, sigmoid pre-softmax scoring, FP32 router, --moe-router-topk-scaling-factor 2.827), and RoPE (--rotary-base 50000, --rotary-scaling-factor 64.0). The K2.5 recipe then layers the following on top:
CKPT_ARGSwires up the dual checkpoint (INT4 actor via--hf-checkpoint, BF16 reference via--ref-load) together with--megatron-to-hf-mode bridgeand--model-name kimi_k25.ROLLOUT_ARGSandEVAL_ARGSconfigure GRPO sampling and periodic AIME evaluation (covered in §5.2).PERF_ARGSsets the parallelism layout and recomputation (§5.1).GRPO_ARGSandOPTIMIZER_ARGSset the algorithm and CPU-offloaded Adam (§5.2, §5.4).SGLANG_ARGSconfigures the colocated rollout engine (§5.3).
--colocate) across 32 nodes (--actor-num-nodes 32 --actor-num-gpus-per-node 8) and uses the miles router (--use-miles-router) with --update-weight-buffer-size $((4*512*1024*1024)).
5. Example Recipe Configuration
5.1 Megatron Parallelism
This is the validated layout shipped with the launcher. All parallelisms are supported, so you can supply any other TP / EP / PP / CP combination that fits your compute.| Hardware | Nodes × GPUs | TP | PP | CP | EP | expert-TP | --decoder-last-pipeline-num-layers | --max-tokens-per-gpu |
|---|---|---|---|---|---|---|---|---|
| H200 | 32 × 8 = 256 | 8 | 8 | 4 | 32 | 1 | 5 | 4096 |
--sequence-parallel) is on, and the trainer uses dynamic batching (--use-dynamic-batch-size) capped at --max-tokens-per-gpu 4096. Recomputation is full and uniform over a single layer:
5.2 Algorithm
The recipe uses GRPO with KL and entropy losses disabled:deepscaler reward:
5.3 Rollout & SGLang
The rollout engine is colocated with training, spanning 8 GPUs per engine with 8-way expert parallelism:--use-rollout-routing-replay flag replays the rollout-time MoE routing decisions during training so the two stages stay consistent. On the Megatron side, attention uses the Flash backend (--attention-backend flash).
The launcher sets the required env vars for you, including the INT4 QAT pair (OPEN_TRAINING_INT4_FAKE_QAT_FLAG=1, OPEN_TRAINING_INT4_GROUP_SIZE=32), a long NCCL timeout (NCCL_TIMEOUT=3600), CUDA_DEVICE_MAX_CONNECTIONS=1, and NVLink-gated NVLS (NCCL_NVLS_ENABLE follows the script’s NVLink autodetection).
5.4 Optimizer
CPU-offloaded Adam is combined with the distributed optimizer:--accumulate-allreduce-grads-in-fp32), and the attention softmax also runs in FP32 (--attention-softmax-in-fp32).
6. Pairs Well With
- INT4 QAT
- PD Disaggregation
- P2P Weight Transfer
- Fault Tolerance
- Kimi K2: sibling recipe; K2.5 reuses the K2-Thinking architecture.

