Miles Rl Training

Training

v1.0.0

关于此技能

适用场景: miles: enterprise-grade rl for large-scale model training. 本地化技能摘要: # miles: Enterprise-Grade RL for Large-Scale Model Training miles is a high-performance, enterprise-ready RL framework optimized for large-scale model post-training. Claude Code, Cursor, and Windsurf workflows.

功能特性

miles: Enterprise-Grade RL for Large-Scale Model Training
Choose miles when you need:
Training 1TB+ MoE models (DeepSeek V3, Qwen3-MoE)
FP8 or INT4 quantization-aware training
Bit-wise identical train-inference alignment
Coluding Coluding
[0]
[0]
更新于: 5/5/2026

技能概览

先看适用场景、限制条件和安装路径,再决定是否继续深入。

适用场景: miles: enterprise-grade rl for large-scale model training. 本地化技能摘要: # miles: Enterprise-Grade RL for Large-Scale Model Training miles is a high-performance, enterprise-ready RL framework optimized for large-scale model post-training. Claude Code, Cursor, and Windsurf workflows.

核心价值

推荐说明: miles-rl-training helps agents miles: enterprise-grade rl for large-scale model training. miles: Enterprise-Grade RL for Large-Scale Model Training miles is a high-performance, enterprise-ready RL framework

适用 Agent 类型

适用场景: miles: enterprise-grade rl for large-scale model training.

赋予的主要能力 · Miles Rl Training

适用任务: miles: Enterprise-Grade RL for Large-Scale Model Training
适用任务: Choose miles when you need:
适用任务: Training 1TB+ MoE models (DeepSeek V3, Qwen3-MoE)

! 使用限制与门槛

  • 限制说明: Choose miles when you need:
  • 限制说明: You need flexible backend swapping → use verl
  • 限制说明: Requires repository-specific context from the skill documentation

! 来源说明

此页面仍可作为安装与查阅参考。继续使用前,请结合上方适用场景、限制条件和上游仓库说明一起判断。

SKILL.md
Readonly

miles: Enterprise-Grade RL for Large-Scale Model Training

miles is a high-performance, enterprise-ready RL framework optimized for large-scale model post-training. Built as a production fork of slime, it addresses critical challenges in MoE training stability, low-precision training, and train-inference alignment.

When to Use miles

Choose miles when you need:

  • Training 1TB+ MoE models (DeepSeek V3, Qwen3-MoE)
  • FP8 or INT4 quantization-aware training
  • Bit-wise identical train-inference alignment
  • Speculative RL for maximum throughput
  • Production stability with enterprise support

Consider alternatives when:

  • You want the research-grade original → use slime
  • You need flexible backend swapping → use verl
  • You want PyTorch-native abstractions → use torchforge

Key Features

Low-Precision Training

  • Unified FP8: End-to-end FP8 for both inference and training
  • INT4 QAT: 1TB models on single-machine VRAM (H200)
  • Rollout Routing Replay (R3): Bit-wise expert alignment for MoE

Performance Optimizations

  • Speculative RL: 25%+ rollout speedup with online SFT draft models
  • Zero-Copy Weight Sync: CUDA IPC zero-copy mapping
  • Partial Rollout: Recycle half-finished trajectories

Train-Inference Alignment

  • TIS/MIS: Truncated/Masked Importance Sampling for off-policy correction
  • Kernel-level optimization: FlashAttention-3, DeepGEMM integration

Installation

bash
1# Recommended: Docker 2docker pull radixark/miles:latest 3docker run --rm --gpus all --ipc=host --shm-size=16g \ 4 -it radixark/miles:latest /bin/bash 5 6# From source 7git clone https://github.com/radixark/miles.git 8cd miles 9pip install -r requirements.txt 10pip install -e .

Quick Start

miles inherits slime's configuration system. Basic training:

bash
1python train.py \ 2 --advantage-estimator grpo \ 3 --model-name qwen3-30b-a3b \ 4 --hf-checkpoint /path/to/qwen3-30b-a3b-hf \ 5 --rollout-batch-size 512 \ 6 --n-samples-per-prompt 8

Workflow 1: Large MoE Training

Use this workflow for training large MoE models like DeepSeek V3 or Qwen3-MoE.

Prerequisites Checklist

  • H100/H200 GPUs with FP8 support
  • MoE model (DeepSeek V3, Qwen3-MoE)
  • Docker environment with miles

Step 1: Environment Setup

bash
1# FP8 block scaling (recommended for stability) 2export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1 3export CUDA_DEVICE_MAX_CONNECTIONS=1

Step 2: Configure Training

bash
1python train.py \ 2 --actor-num-gpus-per-node 8 \ 3 --rollout-num-gpus 8 \ 4 --hf-checkpoint /path/to/deepseek-v3 \ 5 --advantage-estimator grpo \ 6 --tensor-model-parallel-size 8 \ 7 --expert-model-parallel-size 4 \ 8 --prompt-data /path/to/data.jsonl \ 9 --num-rollout 3000

Verification Checklist

  • Model loads without errors
  • Routing decisions are consistent
  • No NaN/Inf in loss values

Workflow 2: Speculative RL Training

Use this workflow for maximum rollout throughput with EAGLE speculative decoding.

How Speculative RL Works

  1. Small draft model generates candidate tokens
  2. Target model verifies in parallel
  3. Draft model updated via online SFT to track policy

Step 1: Enable Speculative Decoding

miles supports EAGLE speculative decoding via SGLang:

bash
1python train.py \ 2 --actor-num-gpus-per-node 8 \ 3 --hf-checkpoint /path/to/target-model \ 4 --sglang-speculative-algorithm EAGLE \ 5 --sglang-speculative-num-steps 3 \ 6 --sglang-speculative-eagle-topk 1 \ 7 --sglang-speculative-num-draft-tokens 4 \ 8 --sglang-speculative-draft-model-path /path/to/draft-model \ 9 --advantage-estimator grpo \ 10 --prompt-data /path/to/data.jsonl

Step 2: Enable Online MTP Training (Optional)

For online SFT of draft model during training:

bash
1--mtp-num-layers 1 \ 2--enable-mtp-training \ 3--mtp-loss-scaling-factor 0.2

Note: Online MTP training requires a torch dist checkpoint with MTP weights. Add --mtp-num-layers 1 during checkpoint conversion from HuggingFace.

Expected Speedup

  • Standard rollout: Baseline
  • Speculative RL: 25-40% faster rollout
  • With partial rollout: Additional 10-15% throughput

Configuration Reference

miles inherits all slime arguments. See slime API Reference for the complete list.

Cluster Resources (from slime)

bash
1--actor-num-nodes 1 2--actor-num-gpus-per-node 8 3--rollout-num-gpus 8 4--rollout-num-gpus-per-engine 2 5--colocate

Megatron Parallelism (from slime)

bash
1--tensor-model-parallel-size 8 2--pipeline-model-parallel-size 2 3--expert-model-parallel-size 4 # MoE expert parallelism

Speculative Decoding (miles-specific)

bash
1--sglang-speculative-algorithm EAGLE 2--sglang-speculative-num-steps 3 3--sglang-speculative-eagle-topk 1 4--sglang-speculative-num-draft-tokens 4 5--sglang-enable-draft-weights-cpu-backup 6--sglang-speculative-draft-model-path /your/draft/model/path

Online MTP Training (miles-specific)

bash
1--mtp-num-layers 1 2--enable-mtp-training 3--mtp-loss-scaling-factor 0.2

Key Features (Conceptual)

The following features are documented in miles but specific CLI flags may vary. Consult the miles repository for latest configuration.

Unified FP8 Pipeline

End-to-end FP8 sampling and training that eliminates quantization-induced discrepancy causing RL collapse in MoE models.

Rollout Routing Replay (R3)

Records expert routing decisions during SGLang inference and replays them during Megatron training for bit-wise expert alignment.

How R3 Works:

  1. During SGLang inference, expert routing decisions are recorded
  2. Routing decisions stored in sample.rollout_routed_experts
  3. During Megatron training, routing is replayed instead of recomputed
  4. Ensures identical expert selection between train and inference

INT4 Quantization-Aware Training

Enables single-machine deployment of 1TB+ models (e.g., on H200).

Memory Savings with INT4:

Model SizeBF16 VRAMINT4 VRAMReduction
70B140GB45GB3.1x
235B470GB150GB3.1x
671B1.3TB420GB3.1x

Train-Inference Alignment

miles achieves "exactly 0 KL divergence" between training and inference through:

  • Flash Attention 3
  • DeepGEMM
  • Batch-invariant kernels from Thinking Machines Lab
  • torch.compile integration

Sample Data Structure

miles uses the same Sample dataclass as slime with the rollout_routed_experts field for MoE routing replay:

python
1@dataclass 2class Sample: 3 prompt: str | list[dict] 4 tokens: list[int] 5 response: str 6 reward: float | dict 7 loss_mask: list[int] 8 status: Status 9 metadata: dict 10 rollout_log_probs: list[float] 11 rollout_routed_experts: list[list[int]] # MoE routing for R3

See slime API Reference for the complete Sample definition.


Common Issues and Solutions

Issue: FP8 Training Collapse

Symptoms: Loss explodes, NaN values

Solutions:

  • Use block scaling: export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1
  • Reduce learning rate: --lr 5e-7
  • Ensure MoE routing is consistent between train/inference

Issue: Speculative Draft Drift

Symptoms: Low acceptance rate over time

Solutions:

  • Enable online MTP training to keep draft model aligned
  • Reduce speculative steps: --sglang-speculative-num-steps 2
  • Use CPU backup: --sglang-enable-draft-weights-cpu-backup

Issue: Train-Inference Mismatch

Symptoms: Policy divergence, reward collapse

Solutions:

  • Use TIS for off-policy correction: --use-tis --tis-threshold 0.9
  • Verify log probs match between SGLang and Megatron
  • Enable R3 for MoE models

Supported Models

FamilyModelsMoE Support
DeepSeekR1, V3, V3.2Full
Qwen2, 2.5, 3 (including MoE)Full
Llama3, 3.1, 3.3, 4Dense only
Gemma2, 3, 3NDense only
GLM4.5, 4.6, 4.7Dense only
MiniMaxM2, M2.1Full

Resources

常见问题与安装步骤

与页面结构化数据保持一致,便于搜索引擎理解。

安装步骤

  1. 1

    打开终端

    在你的项目目录中打开终端或命令行。

  2. 2

    执行安装命令

    运行:npx killer-skills add Coluding/generative-flow-adapters/miles-rl-training。CLI 会自动识别 IDE 或 AI Agent 并完成配置。

  3. 3

    开始使用技能

    Miles Rl Training 已启用,可立即在当前项目中调用。

? FAQ

Miles Rl Training 是什么?
适用场景: miles: enterprise-grade rl for large-scale model training. 本地化技能摘要: # miles: Enterprise-Grade RL for Large-Scale Model Training miles is a high-performance, enterprise-ready RL framework optimized for large-scale model post-training. Claude Code, Cursor, and Windsurf workflows.
如何安装 Miles Rl Training?
运行命令:npx killer-skills add Coluding/generative-flow-adapters/miles-rl-training。支持 Cursor、Windsurf、VS Code、Claude Code 等 19+ IDE/Agent。
Miles Rl Training 适用于哪些场景?
典型场景包括:适用任务: miles: Enterprise-Grade RL for Large-Scale Model Training、适用任务: Choose miles when you need:、适用任务: Training 1TB+ MoE models (DeepSeek V3, Qwen3-MoE)。
Miles Rl Training 支持哪些 IDE 或 Agent?
该技能兼容 Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer。可使用 Killer-Skills CLI 一条命令通用安装。
Miles Rl Training 有哪些限制?
限制说明: Choose miles when you need:;限制说明: You need flexible backend swapping → use verl;限制说明: Requires repository-specific context from the skill documentation。

相关技能

寻找 Miles Rl Training 的替代方案 (Alternative) 或可搭配使用的同类 community Skill?探索以下相关开源技能。

查看全部

openclaw-release-maintainer

Logo of openclaw
openclaw

本地化技能摘要: 🦞 # OpenClaw Release Maintainer Use this skill for release and publish-time workflow. It covers ai, assistant, crustacean workflows. Claude Code, Cursor, and Windsurf workflows.

333.8k
0
AI

widget-generator

Logo of f
f

本地化技能摘要: Generate customizable widget plugins for the prompts.chat feed system # Widget Generator Skill This skill guides creation of widget plugins for prompts.chat. It covers ai, artificial-intelligence, awesome-list workflows. Claude Code, Cursor, and Windsurf

149.6k
0
AI

flags

Logo of vercel
vercel

本地化技能摘要: The React Framework # Feature Flags Use this skill when adding or changing framework feature flags in Next.js internals. It covers blog, browser, compiler workflows. Claude Code, Cursor, and Windsurf workflows.

138.4k
0
浏览器

pr-review

Logo of pytorch
pytorch

本地化技能摘要: Usage Modes No Argument If the user invokes /pr-review with no arguments, do not perform a review. It covers autograd, deep-learning, gpu workflows. Claude Code, Cursor, and Windsurf workflows.

98.6k
0
开发者工具