copy over

bxyu-nvidia · bxyu-nvidia · commit aff56ac39e43 · 2025-12-10T13:23:49.000-08:00
Signed-off-by: Brian Yu &lt;bxyu@nvidia.com&gt;
diff --git a/docs/about/concepts/key-terminology.md b/docs/about/concepts/key-terminology.md
@@ -85,8 +85,8 @@ Online vs Offline Training
 Multi-turn
     Conversations spanning multiple exchanges where context and state persist across turns.
 
-Multi-step  
-    Complex tasks requiring models to break problems into sequential steps, often using tools and intermediate reasoning.
+Multi-step
+    Complex tasks requiring agents to break problems into sequential steps, often using tools and intermediate reasoning.
 
 Tool Use / Function Calling
     Models invoking external capabilities (APIs, calculators, databases) to accomplish tasks beyond text generation.
diff --git a/docs/index.md b/docs/index.md
@@ -171,6 +171,16 @@ how-to-faq.md
 reference/cli-commands.md
 ```
 
+```{toctree}
+:caption: Training
+:hidden:
+:maxdepth: 1
+
+training/index
+training/rl-framework-integration/index.md
+```
+
+
 ```{toctree}
 :caption: Reference
 :hidden:
diff --git a/docs/training/index.md b/docs/training/index.md
@@ -0,0 +1,20 @@
+(training-index)=
+
+# Training with NeMo Gym
+
+Conceptual guides for training with NeMo Gym.
+
+---
+
+::::{grid} 1 1 1 1
+:gutter: 1 1 1 2
+
+:::{grid-item-card} {octicon}`workflow;1.5em;sd-mr-1` Integrate Gym into RL frameworks
+:link: training-framework-integration
+:link-type: ref
+Implement NeMo Gym integration into a new training framework.
++++
+{bdg-primary}`training` {bdg-secondary}`infra`
+:::
+
+::::
diff --git a/docs/training/rl-framework-integration/generation-backend-and-openai-compatible-http-server.md b/docs/training/rl-framework-integration/generation-backend-and-openai-compatible-http-server.md
@@ -0,0 +1,94 @@
+(generation-backend-and-openai-compatible-http-server)=
+
+# Generation Backend
+
+Gym requires an OpenAI-compatible HTTP server to handle model generations during training. This page covers the server requirements and existing implementations across popular RL frameworks.
+
+## OpenAI-Compatible Server Requirements
+
+Gym communicates with generation backends using the OpenAI HTTP API specification. Your generation server must implement endpoints compatible with one of these reference implementations:
+
+```{list-table}
+:header-rows: 1
+:widths: 30 70
+
+* - Provider
+  - Documentation
+* - OpenAI API
+  - [Responses API Reference](https://platform.openai.com/docs/api-reference/responses/create)
+* - Gemini
+  - [OpenAI Compatibility](https://ai.google.dev/gemini-api/docs/openai)
+* - vLLM
+  - [OpenAI-Compatible Server](https://docs.vllm.ai/en/latest/serving/openai_compatible_server/)
+* - SGLang
+  - [OpenAI-Compatible APIs](https://docs.sglang.io/basic_usage/openai_api.html)
+* - TGI
+  - [OpenAI Messages API](https://huggingface.co/docs/text-generation-inference/en/reference/api_reference#openai-messages-api)
+```
+
+## Generation in RL Training
+
+Most RL frameworks that support policy optimization algorithms (PPO, GRPO) require online on-policy model generations. Integrating generation backends into the RL training loop introduces several challenges:
+
+- **Refit**: Synchronizing model weights between training and generation
+- **Off-policyness**: Ensuring generations reflect the current policy state
+- **Latency**: Minimizing generation overhead during training iterations
+
+## Existing Framework Implementations
+
+The following table shows how popular RL frameworks implement generation backends.
+
+:::{tip}
+If your framework uses vLLM or SGLang, you can reference these implementations when adding OpenAI HTTP server support.
+:::
+
+```{list-table}
+:header-rows: 1
+:widths: 25 25 50
+
+* - Framework
+  - Generation Backend
+  - Reference Implementation
+* - NeMo RL
+  - vLLM
+  - [vllm_generation.py](https://github.com/NVIDIA-NeMo/RL/blob/a99bc262e5cde92575538c31ccacde27c60c3681/nemo_rl/models/generation/vllm/vllm_generation.py)
+* - VeRL
+  - HF, vLLM, SGLang
+  - [hf_rollout.py](https://github.com/volcengine/verl/blob/fd893c788dbdb967c6eb62845b09a02e38819ac1/verl/workers/rollout/hf_rollout.py), [vLLM rollout](https://github.com/volcengine/verl/tree/fd893c788dbdb967c6eb62845b09a02e38819ac1/verl/workers/rollout/vllm_rollout), [SGLang rollout](https://github.com/volcengine/verl/tree/fd893c788dbdb967c6eb62845b09a02e38819ac1/verl/workers/rollout/sglang_rollout)
+* - TRL
+  - vLLM, HF
+  - [grpo_trainer.py (vLLM)](https://github.com/huggingface/trl/blob/cbd90d4297a877587a07bdcd82f8fc87338efe5b/trl/trainer/grpo_trainer.py#L557), [grpo_trainer.py (HF)](https://github.com/huggingface/trl/blob/cbd90d4297a877587a07bdcd82f8fc87338efe5b/trl/trainer/grpo_trainer.py#L661)
+* - Slime
+  - SGLang
+  - [sglang_engine.py](https://github.com/THUDM/slime/blob/0612652a8e6ed7fd670ecc29101d4ca877490bf6/slime/backends/sglang_utils/sglang_engine.py#L87)
+* - OpenPIPE ART
+  - vLLM
+  - [vLLM module](https://github.com/OpenPipe/ART/tree/6273a6fa5457e87e696b1c3a5820292826684370/src/art/vllm)
+```
+
+NeMo RL, VeRL, Slime, and OpenPIPE ART all expose OpenAI-compatible HTTP server endpoints.
+
+## Integration Guidelines
+
+### Frameworks Using vLLM or SGLang
+
+If your training framework already uses vLLM or SGLang but does not expose an OpenAI-compatible HTTP server:
+
+1. Reference the implementations listed above
+2. Add server endpoints that follow the OpenAI API specification
+3. Test your implementation using the [vLLM HTTP server tests from NeMo RL](https://github.com/NVIDIA-NeMo/RL/blob/a99bc262e5cde92575538c31ccacde27c60c3681/tests/unit/models/generation/test_vllm_generation.py#L1079-L1247)
+
+### Frameworks Using Other Backends
+
+If your training framework does not use vLLM or SGLang as a generation backend, you may need significant refactoring to achieve proper Gym integration. Consider:
+
+- Migrating to vLLM or SGLang for generation
+- Implementing an adapter layer that exposes OpenAI-compatible endpoints
+- Evaluating the complexity of maintaining a custom generation backend
+
+## Related Topics
+
+After setting up your generation backend, proceed to:
+
+- {doc}`openai-compatible-http-server-on-policy-correction` - Required fixes for multi-step and multi-turn scenarios
+- {doc}`gym-integration-footprint-and-form-factor` - Full integration component breakdown
diff --git a/docs/training/rl-framework-integration/gym-integration-footprint-and-form-factor.md b/docs/training/rl-framework-integration/gym-integration-footprint-and-form-factor.md
@@ -0,0 +1,106 @@
+(gym-integration-footprint-and-form-factor)=
+
+# Integration Footprint
+
+This page provides a reference for the components required to integrate Gym into your training framework. Each component includes links to the NeMo RL reference implementation and corresponding tests.
+
+## Integration Components
+
+A complete Gym integration consists of five components, implemented in sequence:
+
+```{list-table}
+:header-rows: 1
+:widths: 5 25 35 35
+
+* - 
+  - Component
+  - Implementation
+  - Tests
+* - 1
+  - **OpenAI-Compatible HTTP Server**
+  - [vllm_worker_async.py:264](https://github.com/NVIDIA-NeMo/RL/blob/64ab08df3edf25131959fc474b44ed5e36a1600b/nemo_rl/models/generation/vllm/vllm_worker_async.py#L264)
+  - [test_vllm_generation.py:1107](https://github.com/NVIDIA-NeMo/RL/blob/64ab08df3edf25131959fc474b44ed5e36a1600b/tests/unit/models/generation/test_vllm_generation.py#L1107)
+* - 2
+  - **On-Policy Token ID Fixes**
+  - [vllm_worker_async.py:40](https://github.com/NVIDIA-NeMo/RL/blob/64ab08df3edf25131959fc474b44ed5e36a1600b/nemo_rl/models/generation/vllm/vllm_worker_async.py#L40)
+  - [test_vllm_generation.py:1250](https://github.com/NVIDIA-NeMo/RL/blob/64ab08df3edf25131959fc474b44ed5e36a1600b/tests/unit/models/generation/test_vllm_generation.py#L1250)
+* - 3
+  - **Gym Spinup and Integration**
+  - [nemo_gym.py](https://github.com/NVIDIA-NeMo/RL/blob/64ab08df3edf25131959fc474b44ed5e36a1600b/nemo_rl/environments/nemo_gym.py)
+  - [test_nemo_gym.py](https://github.com/NVIDIA-NeMo/RL/blob/64ab08df3edf25131959fc474b44ed5e36a1600b/tests/unit/environments/test_nemo_gym.py)
+* - 4
+  - **Rollout Orchestration**
+  - [rollouts.py:975](https://github.com/NVIDIA-NeMo/RL/blob/64ab08df3edf25131959fc474b44ed5e36a1600b/nemo_rl/experience/rollouts.py#L975)
+  - [test_rollouts.py:754](https://github.com/NVIDIA-NeMo/RL/blob/64ab08df3edf25131959fc474b44ed5e36a1600b/tests/unit/experience/test_rollouts.py#L754)
+* - 5
+  - **GRPO Train Loop Integration**
+  - [grpo.py:1157](https://github.com/NVIDIA-NeMo/RL/blob/64ab08df3edf25131959fc474b44ed5e36a1600b/nemo_rl/algorithms/grpo.py#L1157)
+  - End-to-end tests in progress
+```
+
+:::{note}
+As of December 8, 2025, end-to-end tests for GRPO train loop integration are still being implemented in the NeMo RL repository.
+:::
+
+## Component Details
+
+### 1. OpenAI-Compatible HTTP Server
+
+**Purpose**: Expose your generation backend as an OpenAI-compatible endpoint.
+
+**Prerequisites**: vLLM or SGLang generation backend.
+
+**Reference**: Refer to {doc}`generation-backend-and-openai-compatible-http-server` for implementation guidance.
+
+### 2. On-Policy Token ID Fixes
+
+**Purpose**: Prevent train-generation mismatch in multi-step and multi-turn scenarios.
+
+**Prerequisites**: OpenAI-compatible HTTP server.
+
+**Reference**: Refer to {doc}`openai-compatible-http-server-on-policy-correction` for technical details.
+
+### 3. Gym Spinup and Integration
+
+**Purpose**: Initialize and connect to Gym training environments.
+
+**Key responsibilities**:
+
+- Environment configuration loading
+- Connection management
+- State synchronization
+
+### 4. Rollout Orchestration
+
+**Purpose**: Coordinate rollout collection between the policy and Gym environments.
+
+**Key responsibilities**:
+
+- Batch rollout management
+- Multi-step and multi-turn handling
+- Token ID tracking for on-policy corrections
+
+### 5. GRPO Train Loop Integration
+
+**Purpose**: Integrate Gym rollouts into the policy optimization training loop.
+
+**Key responsibilities**:
+
+- Rollout scheduling within training iterations
+- Loss calculation with Gym-generated experiences
+- Weight synchronization between training and generation
+
+## Implementation Checklist
+
+Use this checklist to track your integration progress:
+
+- [ ] OpenAI-compatible HTTP server implemented and tested
+- [ ] On-policy token ID fixes implemented and tested
+- [ ] Gym spinup and environment connection working
+- [ ] Rollout orchestration handling multi-step/multi-turn scenarios
+- [ ] GRPO (or equivalent) train loop integration complete
+
+## Related Topics
+
+- {doc}`gym-rl-framework-integration-success-criteria` - Validate your integration
+- {doc}`generation-backend-and-openai-compatible-http-server` - Generation backend setup
diff --git a/docs/training/rl-framework-integration/gym-rl-framework-integration-success-criteria.md b/docs/training/rl-framework-integration/gym-rl-framework-integration-success-criteria.md
@@ -0,0 +1,84 @@
+(gym-rl-framework-integration-success-criteria)=
+
+# Success Criteria
+
+Use these criteria to validate that your Gym integration is working correctly. A successful integration must pass all validation benchmarks.
+
+:::{tip}
+These success criteria may evolve as new integration challenges are discovered. Check this page for updates when troubleshooting integration issues.
+:::
+
+## Validation Checklist
+
+### 1. Component Form Factor
+
+Verify that your integration implements all required components as specified in {doc}`gym-integration-footprint-and-form-factor`:
+
+- [ ] OpenAI-compatible HTTP server
+- [ ] On-policy token ID fixes
+- [ ] Gym spinup and integration
+- [ ] Rollout orchestration
+- [ ] Training loop integration
+
+### 2. Environment Configuration
+
+Verify that your integration can load and run arbitrary Gym training environments through configuration:
+
+- [ ] Environment configuration loads from YAML
+- [ ] Multiple environments can be selected at runtime
+- [ ] Environment parameters are configurable without code changes
+
+### 3. Math Reasoning Benchmark
+
+Train on the DAPO17k math training environment and verify model improvement on AIME24.
+
+```{list-table}
+:header-rows: 1
+:widths: 25 75
+
+* - Parameter
+  - Value
+* - Training environment
+  - [DAPO17k math environment](https://github.com/NVIDIA-NeMo/Gym/blob/299e8c04f4a3bbf0f6069139092225f2fe3aa70f/resources_servers/math_with_judge/configs/bytedtsinghua_dapo17k.yaml)
+* - Base model
+  - [Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)
+* - Minimum training steps
+  - 1,000
+* - Validation set
+  - AIME24 (included with training environment)
+* - Target accuracy
+  - ≥85%
+```
+
+### 4. Workplace Assistant Benchmark
+
+Train on the workplace assistant environment and verify validation set improvements.
+
+```{list-table}
+:header-rows: 1
+:widths: 25 75
+
+* - Parameter
+  - Value
+* - Training environment
+  - [Workplace assistant environment](https://github.com/NVIDIA-NeMo/Gym/tree/299e8c04f4a3bbf0f6069139092225f2fe3aa70f/resources_servers/workplace_assistant)
+* - Base model
+  - [Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)
+* - Minimum training steps
+  - 100
+* - Success criterion
+  - Observable validation set improvement
+```
+
+## Troubleshooting
+
+If your integration fails to meet the success criteria:
+
+1. **Training crashes**: Check for off-policy issues. Refer to {doc}`openai-compatible-http-server-on-policy-correction`
+2. **No improvement**: Verify rollout orchestration is correctly tracking token IDs
+3. **Environment errors**: Verify OpenAI-compatible HTTP server endpoints match the specification
+
+## Related Topics
+
+- {doc}`gym-integration-footprint-and-form-factor` - Required integration components
+- {doc}`openai-compatible-http-server-on-policy-correction` - On-policy training fixes
diff --git a/docs/training/rl-framework-integration/index.md b/docs/training/rl-framework-integration/index.md
diff --git a/docs/training/rl-framework-integration/openai-compatible-http-server-on-policy-correction.md b/docs/training/rl-framework-integration/openai-compatible-http-server-on-policy-correction.md