dm-env
dm-env is a Python interface for Reinforcement Learning (RL) environments, providing a foundational API for interacting with environments. It is actively maintained by DeepMind and is currently at version 1.6. The library focuses on core components such as `Environment`, `TimeStep`, and `specs` for defining actions, observations, rewards, and discounts.
Common errors
-
AttributeError: 'ArraySpec' object has no attribute 'sample'
cause Unlike some other RL frameworks (e.g., OpenAI Gym spaces), `dm-env.specs.ArraySpec` and `BoundedArraySpec` do not inherently provide a `.sample()` method for generating random values within their defined bounds. Users often expect this functionality to easily sample actions or observations.fixManually implement sampling logic based on the `dtype`, `shape`, `minimum`, and `maximum` properties of the spec. For example, use `np.random.uniform(spec.minimum, spec.maximum, size=spec.shape)` for `BoundedArraySpec` or `np.zeros(spec.shape, dtype=spec.dtype)` as a placeholder. -
MemoryError / Detected OOM-kill event(s)
cause While not directly a `dm-env` issue, larger RL setups using `dm-env` environments, especially with extensive replay buffers or complex observation spaces, can lead to high memory consumption and Out-Of-Memory (OOM) errors. This is particularly common in deep RL agent implementations.fixMonitor memory usage of your agent and environment. Reduce replay buffer size, optimize observation/action data types (e.g., use `np.uint8` instead of `np.float32` if appropriate), or consider using techniques like experience replay compression. -
Incompatibility with Gymnasium/Gym-style environments or agents
cause The `dm-env` interface is distinct from the popular Gymnasium (formerly OpenAI Gym) API. Directly using environments implemented with `dm-env` alongside agents designed for Gymnasium, or vice-versa, requires conversion or wrappers.fixUse a compatibility wrapper library like `shimmy` (e.g., `shimmy.DmControlCompatibilityV0` for `dm-control` environments which use `dm-env` internally) to convert `dm-env` environments to the Gymnasium interface if you need to use Gymnasium-compatible agents or tools. Alternatively, implement a custom adapter or wrapper.
Warnings
- breaking From `dm-env` version 1.4 onwards, the library officially supports Python 3.6 and newer. Older Python versions are not supported for recent releases.
- gotcha Calling `env.step(action)` on a newly created environment instance or immediately after a `TimeStep` with `StepType.LAST` will implicitly trigger a reset, and the provided `action` argument will be ignored. The environment will return a `TimeStep` with `StepType.FIRST`.
- gotcha A `discount` value of 0.0 in a `TimeStep` does not necessarily signify the end of a sequence (episode). The `StepType` enum (specifically `StepType.LAST`) should be used to determine if a sequence has terminated.
Install
-
pip install dm-env
Imports
- Environment
from dm_env import Environment
- TimeStep
from dm_env import TimeStep
- StepType
from dm_env import StepType
- specs
from dm_env import specs
- ArraySpec
from dm_env.specs import ArraySpec
- BoundedArraySpec
from dm_env.specs import BoundedArraySpec
Quickstart
import numpy as np
from dm_env import Environment, TimeStep, specs, StepType
class SimpleCountingEnv(Environment):
def __init__(self, max_count=5):
self._max_count = max_count
self._current_count = 0
self._reset_next_step = True
def discount_spec(self):
return specs.BoundedArray(shape=(), dtype=float, minimum=0.0, maximum=1.0, name='discount')
def observation_spec(self):
return specs.BoundedArray(shape=(), dtype=int, minimum=0, maximum=self._max_count, name='count')
def action_spec(self):
return specs.BoundedArray(shape=(), dtype=int, minimum=0, maximum=1, name='action') # 0: no-op, 1: increment
def reward_spec(self):
return specs.Array(shape=(), dtype=float, name='reward')
def _reset(self):
self._current_count = 0
self._reset_next_step = False
return TimeStep(step_type=StepType.FIRST,
reward=None,
discount=None,
observation=np.asarray(self._current_count, dtype=int))
def _step(self, action):
if self._reset_next_step:
return self._reset()
if action == 1:
self._current_count += 1
if self._current_count >= self._max_count:
self._reset_next_step = True
return TimeStep(step_type=StepType.LAST,
reward=np.asarray(1.0, dtype=float),
discount=np.asarray(0.0, dtype=float),
observation=np.asarray(self._current_count, dtype=int))
else:
return TimeStep(step_type=StepType.MID,
reward=np.asarray(0.0, dtype=float),
discount=np.asarray(1.0, dtype=float),
observation=np.asarray(self._current_count, dtype=int))
def reset(self):
return self._reset()
def step(self, action):
return self._step(action)
# --- Example Usage ---
env = SimpleCountingEnv()
timestep = env.reset()
print(f"Initial: {timestep.observation}")
while not timestep.last():
action = 1 # Always try to increment
timestep = env.step(action)
print(f"Step {env._current_count}: Obs={timestep.observation}, Reward={timestep.reward}, Type={timestep.step_type.name}")
# Demonstrating reset after LAST timestep
timestep = env.step(0) # Action is ignored here
print(f"After last, calling step (action ignored): {timestep.observation}")