Tianshou

2.0.1 verified Fri May 01 auth: no python

A library for deep reinforcement learning, providing a modular and flexible framework for implementing and benchmarking RL algorithms. Current version is 2.0.1, released in 2025. The library follows a steady release cadence, with major version 2.0 overhauling the procedural API and separating algorithms from policies.

pip install tianshou

Common errors

error ModuleNotFoundError: No module named 'tianshou.policy' ↓

cause Using an older version of tianshou (<2.0.0) that does not have the 'policy' submodule.

fix

Upgrade to tianshou>=2.0.0: pip install --upgrade tianshou.

error AttributeError: module 'tianshou' has no attribute 'Collector' ↓

cause Collector was moved to tianshou.data in v2.0.0.

fix

Change import to from tianshou.data import Collector.

error ValueError: The number of buffers in VectorReplayBuffer must be greater than 0 ↓

cause VectorReplayBuffer requires buffer_num > 0, but you may have passed an empty list or zero.

fix

Ensure you provide a positive integer for buffer_num, typically equal to the number of vector environments.

error gym.error.Error: Cannot instantiate environment. Possibly the environment id is misspelled or the environment is not installed. ↓

cause Tianshou v2.0.0 uses gymnasium as the environment backend. If you try to use `gym.make` without installing gymnasium, or if you registered an environment under the old gym API.

fix

Install gymnasium: pip install gymnasium. If using custom environments, ensure they are compatible with gymnasium.

Warnings

breaking v2.0.0 completely overhauled the API. `Policy` and `Algorithm` abstractions are separated; many imports changed. Old code using `from tianshou import Policy` will break. ↓

fix Update imports to use submodules (e.g., `from tianshou.policy import ...`) and adopt the new `Algorithm`/`Policy` split.

breaking In v2.0.0, `ReplayBuffer` was moved from `tianshou.data` (already existed) but some internal APIs changed. The `VectorReplayBuffer` is now the recommended buffer for vectorized environments. ↓

fix Use `from tianshou.data import VectorReplayBuffer` and ensure buffer API calls are updated.

gotcha Tianshou 2.0.0+ requires Python 3.11+ and drops support for older Python versions. Also, `gymnasium` is used instead of `gym`. ↓

fix Upgrade to Python 3.11+ and install `gymnasium` if you rely on custom environments.

deprecated The `trainer` module's API (e.g., `offpolicy_trainer`) is being replaced in future versions by a more streamlined high-level interface. Check documentation for upcoming changes. ↓

fix Consider using the high-level `experiment` package for new projects.

Imports

Policy

wrong

from tianshou import Policy

correct

from tianshou.policy import Policy

v2.0.0 moved Policy to a submodule.

Collector

wrong

from tianshou import Collector

correct

from tianshou.data import Collector

Collector was moved to tianshou.data after v1.0.0.

ReplayBuffer

wrong

from tianshou.replay import ReplayBuffer

correct

from tianshou.data import ReplayBuffer

ReplayBuffer moved to tianshou.data in v2.0.0.

Quickstart

Train a DQN agent on CartPole-v1 using off-policy training.

import gymnasium as gym
from tianshou.data import Collector, VectorReplayBuffer
from tianshou.env import DummyVectorEnv
from tianshou.policy import DQNPolicy
from tianshou.trainer import offpolicy_trainer
import torch

env = gym.make('CartPole-v1')
train_envs = DummyVectorEnv([lambda: gym.make('CartPole-v1') for _ in range(10)])
test_envs = DummyVectorEnv([lambda: gym.make('CartPole-v1') for _ in range(10)])

state_shape = env.observation_space.shape or env.observation_space.n
action_shape = env.action_space.shape or env.action_space.n

policy = DQNPolicy(
    state_shape=state_shape,
    action_shape=action_shape,
    model=torch.nn.Linear(state_shape, action_shape),
    optim=torch.optim.Adam(policy_network.parameters(), lr=1e-3),
).to('cpu')
buffer = VectorReplayBuffer(total_size=20000, buffer_num=len(train_envs))
collector = Collector(policy, train_envs, buffer)
test_collector = Collector(policy, test_envs)

def stop_fn(reward):
    return reward >= 195

result = offpolicy_trainer(
    policy=policy,
    train_collector=collector,
    test_collector=test_collector,
    max_epoch=10,
    step_per_epoch=1000,
    step_per_collect=10,
    episode_per_test=5,
    batch_size=64,
    stop_fn=stop_fn,
    update_per_step=1,
)
print(f'Finished training in {result.timing.total_time_seconds}s')