1. Model Introduction
Kimi-K2 is a state-of-the-art MoE language model from Moonshot AI with 32 B activated parameters and 1 T total parameters. Key highlights:- Trillion-parameter MoE: 1 T total / 32 B active per token, 61 layers (1 dense + the rest MoE), MLA attention shaped like DeepSeek-V3.
- Instruct and Thinking variants: Instruct is the general-purpose chat / agentic post-train; Thinking adds step-by-step reasoning with a 256 K context and ships in native INT4.
- DeepSeek-V3-shaped architecture: miles loads it through the DeepSeek-V3 path (one
sedaway), reusing the conversion + parallelism plumbing. - INT4 QAT target: Kimi-K2-Thinking is the canonical reference recipe for INT4 QAT in miles.
2. Supported Variants
| Model | Active / Total | HF ID |
|---|---|---|
| Kimi-K2-Instruct | 32 B / 1 T | moonshotai/Kimi-K2-Instruct |
| Kimi-K2-Thinking | 32 B / 1 T | moonshotai/Kimi-K2-Thinking |
3. Environment Setup
3.1 Required env vars
3.2 Download model + datasets
3.3 HF → Megatron torch_dist conversion
Convert across 4 nodes (mirror the DeepSeek-V3 procedure):
4. Launch
4.1 Quick start
ray job submit ...); neither runs ray start --head itself.
4.2 Multi-node fan-out
Bring up Ray on every node before launching:5. Recipe Configuration
5.1 Parallelism
Identical for both Instruct and Thinking:| TP | PP | CP | EP | expert-TP | decoder-last-pipeline-num-layers | max_tokens_per_gpu | GPUs |
|---|---|---|---|---|---|---|---|
| 8 | 8 | 4 | 32 | 1 | 5 | 16384 | 256 (32 × 8) |
--actor-num-nodes 32 --actor-num-gpus-per-node 8 --colocate --update-weight-buffer-size $((4*512*1024*1024)) to train.py.
5.2 Algorithm
| Script | Advantage | TIS |
|---|---|---|
| Instruct | GRPO (--eps-clip 0.2 --eps-clip-high 0.28) | – |
| Thinking | GRPO (--eps-clip 0.2 --eps-clip-high 0.28) | --use-tis |
--use-kl-loss --kl-loss-coef 0.00 --kl-loss-type low_var_kl --entropy-coef 0.00.
Rollout shape (both):
5.3 Rollout & SGLang
Identical for both:--moe-enable-deepep and --moe-token-dispatcher-type flex are on in the Instruct script but commented out in the Thinking script.
5.4 Optimizer
CPU Adam is enabled in both:5.5 Notable quirks
- Instruct loads the BF16 HF checkpoint by default (FP8 commented out) and reads eval data from
$BASE_DIR/rl_data/; Thinking loads the FP8 HF checkpoint by default and reads eval data from$BASE_DIR/. --global-batch-size 1024is commented out in both scripts.

