DexForce · yuecideng · Jan 22, 2026 · Jan 20, 2026 · Jan 20, 2026 · Jan 20, 2026
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -75,7 +75,7 @@ jobs:
       - uses: actions/checkout@v4
       - name: Run tests
         run: |
-          pip install -e . --extra-index-url http://pyp.open3dv.site:2345/simple/ --trusted-host pyp.open3dv.site
+          pip install -e .[lerobot] --extra-index-url http://pyp.open3dv.site:2345/simple/ --trusted-host pyp.open3dv.site
           echo "Unit test Start"
           export HF_ENDPOINT=https://hf-mirror.com
           pip uninstall pymeshlab -y

diff --git a/configs/agents/rl/push_cube/gym_config.json b/configs/agents/rl/push_cube/gym_config.json
@@ -113,9 +113,8 @@
             }
         },
         "extensions": {
-            "obs_mode": "state",
+            "action_type": "delta_qpos",
             "episode_length": 100,
-            "joint_limits": 0.5,
             "action_scale": 0.1,
             "success_threshold": 0.1
         }

diff --git a/configs/agents/rl/push_cube/train_config.json b/configs/agents/rl/push_cube/train_config.json
@@ -4,15 +4,15 @@
         "gym_config": "configs/agents/rl/push_cube/gym_config.json",
         "seed": 42,
         "device": "cuda:0",
-        "headless": false,
+        "headless": true,
         "enable_rt": false,
         "gpu_id": 0,
-        "num_envs": 8,
+        "num_envs": 64,
         "iterations": 1000,
         "rollout_steps": 1024,
         "eval_freq": 200,
         "save_freq": 200,
-        "use_wandb": true,
+        "use_wandb": false,
         "wandb_project_name": "embodychain-push_cube",
         "events": {
             "eval": {

diff --git a/docs/source/overview/gym/env.md b/docs/source/overview/gym/env.md
@@ -5,14 +5,19 @@
 
 The {class}`~envs.EmbodiedEnv` is the core environment class in EmbodiChain designed for complex Embodied AI tasks. It adopts a **configuration-driven** architecture, allowing users to define robots, sensors, objects, lighting, and automated behaviors (events) purely through configuration classes, minimizing the need for boilerplate code.
 
+For **Reinforcement Learning** tasks, EmbodiChain provides {class}`~envs.RLEnv`, a specialized subclass that extends {class}`~envs.EmbodiedEnv` with RL-specific utilities such as flexible action preprocessing, goal management, and standardized info structure.
+
 ## Core Architecture
 
-Unlike the standard {class}`~envs.BaseEnv`, the {class}`~envs.EmbodiedEnv` integrates several manager systems to handle the complexity of simulation:
+EmbodiChain provides a hierarchy of environment classes for different task types:
 
-* **Scene Management**: Automatically loads and manages robots, sensors, and scene objects defined in the configuration.
-* **Event Manager**: Handles automated behaviors such as domain randomization, scene setup, and dynamic asset swapping.
-* **Observation Manager**: Allows flexible extension of observation spaces without modifying the environment code.
-* **Dataset Manager**: Built-in support for collecting demonstration data during simulation steps.
+* **{class}`~envs.BaseEnv`**: Minimal environment for simple tasks with custom simulation logic.
+* **{class}`~envs.EmbodiedEnv`**: Feature-rich environment for Embodied AI tasks (IL, custom control). Integrates manager systems:
+  * **Scene Management**: Automatically loads and manages robots, sensors, and scene objects.
+  * **Event Manager**: Domain randomization, scene setup, and dynamic asset swapping.
+  * **Observation Manager**: Flexible observation space extensions.
+  * **Dataset Manager**: Built-in support for demonstration data collection.
+* **{class}`~envs.RLEnv`**: Specialized environment for RL tasks, extending {class}`~envs.EmbodiedEnv` with action preprocessing, goal management, and standardized reward/info structure.
 
 ## Configuration System
 
@@ -77,7 +82,7 @@ The {class}`~envs.EmbodiedEnvCfg` class exposes the following additional paramet
   Dataset collection settings. Defaults to None, in which case no dataset collection is performed. Please refer to the {class}`~envs.managers.DatasetManager` class for more details.
 
 * **extensions** (Union[Dict[str, Any], None]): 
-  Task-specific extension parameters that are automatically bound to the environment instance. This allows passing custom parameters (e.g., ``episode_length``, ``obs_mode``, ``action_scale``) without modifying the base configuration class. These parameters are accessible as instance attributes after environment initialization. For example, if ``extensions = {"episode_length": 500}``, you can access it via ``self.episode_length``. Defaults to None.
+  Task-specific extension parameters that are automatically bound to the environment instance. This allows passing custom parameters (e.g., ``episode_length``, ``action_type``, ``action_scale``) without modifying the base configuration class. These parameters are accessible as instance attributes after environment initialization. For example, if ``extensions = {"episode_length": 500}``, you can access it via ``self.episode_length``. Defaults to None.
 
 * **filter_visual_rand** (bool): 
   Whether to filter out visual randomization functors. Useful for debugging motion and physics issues when visual randomization interferes with the debugging process. Defaults to ``False``.
@@ -108,7 +113,8 @@ class MyTaskEnvCfg(EmbodiedEnvCfg):
     # 4. Task Extensions
     extensions = {       # Task-specific parameters
         "episode_length": 500,
-        "obs_mode": "state",
+        "action_type": "delta_qpos",
+        "action_scale": 0.1,
     }
 ```
 
@@ -165,54 +171,104 @@ The manager operates in a single mode ``"save"`` which handles both recording an
 
 The dataset manager is called automatically during {meth}`~envs.Env.step()`, ensuring all observation-action pairs are recorded without additional user code.
 
+## Reinforcement Learning Environment
+
+For RL tasks, EmbodiChain provides {class}`~envs.RLEnv`, a specialized base class that extends {class}`~envs.EmbodiedEnv` with RL-specific utilities:
+
+* **Action Preprocessing**: Flexible action transformation supporting delta_qpos, absolute qpos, joint velocity, joint force, and end-effector pose (with IK).
+* **Goal Management**: Built-in goal pose tracking and visualization with axis markers.
+* **Standardized Info Structure**: Template methods for computing task-specific success/failure conditions and metrics.
+* **Episode Management**: Configurable episode length and truncation logic.
+
+### Configuration Extensions for RL
+
+RL environments use the ``extensions`` field to pass task-specific parameters:
+
+```python
+extensions = {
+    "action_type": "delta_qpos",      # Action type: delta_qpos, qpos, qvel, qf, eef_pose
+    "action_scale": 0.1,              # Scaling factor applied to all actions
+    "episode_length": 100,            # Maximum episode length
+    "success_threshold": 0.1,         # Task-specific success threshold (optional)
+}
+```
+
 ## Creating a Custom Task
 
-To create a new task, inherit from {class}`~envs.EmbodiedEnv` and implement the task-specific logic.
+### For Reinforcement Learning Tasks
+
+Inherit from {class}`~envs.RLEnv` and implement the task-specific logic:
+
+```python
+from embodichain.lab.gym.envs import RLEnv, EmbodiedEnvCfg
+from embodichain.lab.gym.utils.registration import register_env
+
+@register_env("MyRLTask-v0", max_episode_steps=100)
+class MyRLTaskEnv(RLEnv):
+    def __init__(self, cfg: MyTaskEnvCfg, **kwargs):
+        super().__init__(cfg, **kwargs)
+
+    def compute_task_state(self, **kwargs):
+        # Required: Compute task-specific success/failure and metrics
+        # Returns: Tuple[success, fail, metrics]
+        #   - success: torch.Tensor of shape (num_envs,) with boolean values
+        #   - fail: torch.Tensor of shape (num_envs,) with boolean values
+        #   - metrics: Dict of metric tensors for logging
+
+        is_success = ...  # Compute success condition
+        is_fail = torch.zeros_like(is_success)
+        metrics = {"distance": ..., "angle_error": ...}
+
+        return is_success, is_fail, metrics
+
+    def check_truncated(self, obs, info):
+        # Optional: Override to add custom truncation conditions
+        # Default: episode_length timeout
+        is_timeout = super().check_truncated(obs, info)
+        is_fallen = ...  # Custom condition (e.g., robot fell)
+        return is_timeout | is_fallen
+```
+
+Configure rewards through the {class}`~envs.managers.RewardManager` in your environment config rather than overriding ``get_reward``.
+
+### For Imitation Learning Tasks
+
+Inherit from {class}`~envs.EmbodiedEnv` for IL tasks:
 
 ```python
 from embodichain.lab.gym.envs import EmbodiedEnv, EmbodiedEnvCfg
 from embodichain.lab.gym.utils.registration import register_env
 
-@register_env("MyTask-v0", max_episode_steps=500)
-class MyTaskEnv(EmbodiedEnv):
+@register_env("MyILTask-v0", max_episode_steps=500)
+class MyILTaskEnv(EmbodiedEnv):
     def __init__(self, cfg: MyTaskEnvCfg, **kwargs):
         super().__init__(cfg, **kwargs)
 
     def create_demo_action_list(self, *args, **kwargs):
-        # Optional: Implement for expert demonstration data generation (for Imitation Learning)
-        # This method is used to generate scripted demonstrations for IL data collection.
+        # Required: Generate scripted demonstrations for data collection
         # Must set self.action_length = len(action_list) if returning actions
         pass
 
     def is_task_success(self, **kwargs):
-        # Optional: Define success criteria (mainly for IL data collection)
+        # Required: Define success criteria for filtering successful episodes
         # Returns: torch.Tensor of shape (num_envs,) with boolean values
         return success_tensor
 
-    def get_reward(self, obs, action, info):
-        # Optional: Override for RL tasks
-        # Returns: torch.Tensor of shape (num_envs,)
-        return super().get_reward(obs, action, info)
-
     def get_info(self, **kwargs):
         # Optional: Override to add custom info fields
-        # Should include "success" and "fail" keys for termination
         info = super().get_info(**kwargs)
         info["custom_metric"] = ...
         return info
 ```
 
-```{note}
-The {meth}`~envs.EmbodiedEnv.create_demo_action_list` method is specifically designed for expert demonstration data generation in Imitation Learning scenarios. For Reinforcement Learning tasks, you should override the {meth}`~envs.EmbodiedEnv.get_reward` method instead.
-```
-
 For a complete example of a modular environment setup, please refer to the {ref}`tutorial_modular_env` tutorial.
 
 ## See Also
 
 - {ref}`tutorial_create_basic_env` - Creating basic environments
 - {ref}`tutorial_modular_env` - Advanced modular environment setup
-- {doc}`/api_reference/embodichain/embodichain.lab.gym.envs` - Complete API reference for EmbodiedEnv and EmbodiedEnvCfg
+- {ref}`tutorial_rl` - Reinforcement learning training guide
+- {doc}`/api_reference/embodichain/embodichain.lab.gym.envs` - Complete API reference for EmbodiedEnv, RLEnv, and configurations
 
 ```{toctree}
 :maxdepth: 1

diff --git a/docs/source/tutorial/rl.rst b/docs/source/tutorial/rl.rst
@@ -78,6 +78,13 @@ The ``env`` section defines the task environment:
 - **id**: Environment registry ID (e.g., "PushCubeRL")
 - **cfg**: Environment-specific configuration parameters
 
+For RL environments (inheriting from ``RLEnv``), use the ``extensions`` field for RL-specific parameters:
+
+- **action_type**: Action type - "delta_qpos" (default), "qpos", "qvel", "qf", "eef_pose"
+- **action_scale**: Scaling factor applied to all actions (default: 1.0)
+- **episode_length**: Maximum episode length (default: 1000)
+- **success_threshold**: Task-specific success threshold (optional)
+
 Example:
 
 .. code-block:: json
@@ -86,10 +93,12 @@ Example:
      "id": "PushCubeRL",
      "cfg": {
        "num_envs": 4,
-       "obs_mode": "state",
-       "episode_length": 100,
-       "action_scale": 0.1,
-       "success_threshold": 0.1
+       "extensions": {
+         "action_type": "delta_qpos",
+         "action_scale": 0.1,
+         "episode_length": 100,
+         "success_threshold": 0.1
+       }
      }
    }
 
@@ -321,41 +330,74 @@ Adding a New Environment
 
 To add a new RL environment:
 
-1. Create an environment class inheriting from ``EmbodiedEnv``
-2. Register it with the Gymnasium registry:
+1. Create an environment class inheriting from ``RLEnv`` (which provides action preprocessing, goal management, and standardized info structure):
 
 .. code-block:: python
 
+   from embodichain.lab.gym.envs import RLEnv, EmbodiedEnvCfg
    from embodichain.lab.gym.utils.registration import register_env
+   import torch
 
    @register_env("MyTaskRL", max_episode_steps=100, override=True)
-   class MyTaskEnv(EmbodiedEnv):
-       cfg: MyTaskEnvCfg
-       ...
+   class MyTaskEnv(RLEnv):
+       def __init__(self, cfg: EmbodiedEnvCfg = None, **kwargs):
+           super().__init__(cfg, **kwargs)
+
+       def compute_task_state(self, **kwargs):
+           """Compute success/failure conditions and metrics."""
+           is_success = ...  # Define success condition
+           is_fail = torch.zeros_like(is_success)
+           metrics = {"distance": ..., "error": ...}
+           return is_success, is_fail, metrics
+
+       def check_truncated(self, obs, info):
+           """Optional: Add custom truncation conditions."""
+           is_timeout = super().check_truncated(obs, info)
+           # Add custom conditions if needed
+           return is_timeout
 
-3. Use the environment ID in your JSON config:
+2. Configure the environment in your JSON config with RL-specific extensions:
 
 .. code-block:: json
 
    "env": {
      "id": "MyTaskRL",
      "cfg": {
-       ...
+       "num_envs": 4,
+       "extensions": {
+         "action_type": "delta_qpos",
+         "action_scale": 0.1,
+         "episode_length": 100,
+         "success_threshold": 0.05
+       }
      }
    }
 
+The ``RLEnv`` base class provides:
+
+- **Action Preprocessing**: Automatically handles different action types (delta_qpos, qpos, qvel, qf, eef_pose)
+- **Action Scaling**: Applies ``action_scale`` to all actions
+- **Goal Management**: Built-in goal pose tracking and visualization
+- **Standardized Info**: Implements ``get_info()`` using ``compute_task_state()`` template method
+
 Best Practices
 ~~~~~~~~~~~~~~
 
-- **Device Management**: Device is single-sourced from ``runtime.cuda``. All components (trainer/algorithm/policy/env) share the same device.
+- **Use RLEnv for RL Tasks**: Always inherit from ``RLEnv`` for reinforcement learning tasks. It provides action preprocessing, goal management, and standardized info structure out of the box.
+
+- **Action Type Configuration**: Configure ``action_type`` in the environment's ``extensions`` field. The default is "delta_qpos" (incremental joint positions). Other options: "qpos" (absolute), "qvel" (velocity), "qf" (force), "eef_pose" (end-effector pose with IK).
 
-- **Action Scaling**: Keep action scaling in the environment, not in the policy.
+- **Action Scaling**: Use ``action_scale`` in the environment's ``extensions`` field to scale actions. This is applied in ``RLEnv._preprocess_action()`` before robot control.
+
+- **Device Management**: Device is single-sourced from ``runtime.cuda``. All components (trainer/algorithm/policy/env) share the same device.
 
 - **Observation Format**: Environments should provide consistent observation shape/types (torch.float32) and a single ``done = terminated | truncated``.
 
 - **Algorithm Interface**: Algorithms must implement ``initialize_buffer()``, ``collect_rollout()``, and ``update()`` methods. The algorithm completely controls data collection and buffer management.
 
-- **Reward Components**: Organize reward components in ``info["rewards"]`` dictionary and metrics in ``info["metrics"]`` dictionary. The trainer performs dense per-step logging directly from environment info.
+- **Reward Configuration**: Use the ``RewardManager`` in your environment config to define reward components. Organize reward components in ``info["rewards"]`` dictionary and metrics in ``info["metrics"]`` dictionary. The trainer performs dense per-step logging directly from environment info.
+
+- **Template Methods**: Override ``compute_task_state()`` to define success/failure conditions and metrics. Override ``check_truncated()`` for custom truncation logic.
 
 - **Configuration**: Use JSON for all hyperparameters. This makes experiments reproducible and easy to track.
 

diff --git a/embodichain/agents/rl/algo/ppo.py b/embodichain/agents/rl/algo/ppo.py
@@ -94,8 +94,12 @@ def collect_rollout(
                 current_obs, deterministic=False
             )
 
+            # Wrap action as dict for env processing
+            action_type = getattr(env, "action_type", "delta_qpos")
+            action_dict = {action_type: actions}
+
             # Step environment
-            result = env.step(actions)
+            result = env.step(action_dict)
             next_obs, reward, terminated, truncated, env_info = result
             done = terminated | truncated
             # Light dtype normalization

diff --git a/embodichain/agents/rl/train.py b/embodichain/agents/rl/train.py
@@ -39,12 +39,20 @@
 from embodichain.lab.gym.envs.managers.cfg import EventCfg
 
 
-def main():
+def parse_args():
+    """Parse command line arguments."""
     parser = argparse.ArgumentParser()
     parser.add_argument("--config", type=str, required=True, help="Path to JSON config")
-    args = parser.parse_args()
+    return parser.parse_args()
+
+
+def train_from_config(config_path: str):
+    """Run training from a config file path.
 
-    with open(args.config, "r") as f:
+    Args:
+        config_path: Path to the JSON config file
+    """
+    with open(config_path, "r") as f:
         cfg_json = json.load(f)
 
     trainer_cfg = cfg_json["trainer"]
@@ -274,8 +282,28 @@ def main():
                 wandb.finish()
             except Exception:
                 pass
+
+        # Clean up environments to prevent resource leaks
+        try:
+            if env is not None:
+                env.close()
+        except Exception as e:
+            logger.log_warning(f"Failed to close training environment: {e}")
+
+        try:
+            if eval_env is not None:
+                eval_env.close()
+        except Exception as e:
+            logger.log_warning(f"Failed to close evaluation environment: {e}")
+
         logger.log_info("Training finished")
 
 
+def main():
+    """Main entry point for command-line training."""
+    args = parse_args()
+    train_from_config(args.config)
+
+
 if __name__ == "__main__":
     main()