The four objects
| Object | Role | Lives in |
|---|---|---|
| Prompt dataset | Source of input examples | JSONL on disk (or --data-source-path) |
| Rollout (SGLang engines) | Generates responses given prompts | One or more SGLang servers behind a router |
| Reward model | Maps (prompt, response, label) → score | Built-in (--rm-type) or custom (--custom-rm-path) |
| Actor (Megatron / FSDP) | The model being trained | Megatron torch_dist checkpoint, or HF directory under FSDP |
| Reference | Frozen copy of the actor for KL anchoring | Loaded from --ref-load, never updated |
The training loop
The four-knob invariant
Two knobs govern the sampling half of the loop, two govern the training half, and they are locked into a single equation:Where every flag goes
Use this map when reading any launch script:| Argument group | Concerns |
|---|---|
MODEL_ARGS | Architecture constants (layers, hidden size, rotary base, …) |
CKPT_ARGS | Filesystem paths for the actor / reference / save directory |
ROLLOUT_ARGS | Prompt dataset, batch knobs, sampling parameters, reward type |
EVAL_ARGS | Eval dataset, cadence, sampling overrides for evaluation |
PERF_ARGS | Parallelism (TP/PP/CP/EP/ETP), recomputation, dynamic batching |
GRPO_ARGS | RL algorithm, KL, clipping, entropy bonus, advantage estimator |
OPTIMIZER_ARGS | Learning rate, schedule, weight decay, Adam betas |
SGLANG_ARGS | Engine TP, memory fraction, log level, --sglang-* passthrough |
Next
- Training Backend — Megatron-LM, parallelism, checkpoints, and hooks.
- Argument Groups — where each launch-script array belongs.
- Training Script Walkthrough — the launch script group by group, plus execution modes (colocation, sync/async, dynamic sampling, …).
- CLI Reference — every flag, grouped and fully catalogd.

