Cache-DiT

1.3.5 · active · verified Thu Apr 16

Cache-DiT is a PyTorch-native inference engine designed for Diffusion Transformers (DiTs). It provides hybrid cache acceleration (DBCache, TaylorSeer, SCM), comprehensive parallelism optimizations (Context, Tensor, 2D/3D), and low-bit quantization (FP8, INT8, INT4). The library integrates seamlessly with Hugging Face Diffusers, SGLang Diffusion, vLLM-Omni, and ComfyUI to deliver significant speedups for image and video generation. Currently at version 1.3.5, it maintains an active release cadence with frequent updates and hotfixes.

Common errors

Warnings

Install

Imports

Quickstart

To quickly get started, load a `DiffusionPipeline` model and apply `cache_dit.enable_cache()` to activate acceleration. You can customize caching, parallelism, and quantization through `DBCacheConfig`, `ParallelismConfig`, and `QuantizeConfig` respectively. The example demonstrates enabling hybrid acceleration with FP8 quantization and Ulysses parallelism.

import torch
from diffusers import DiffusionPipeline
import cache_dit
from cache_dit import DBCacheConfig, ParallelismConfig, QuantizeConfig

# Load a DiffusionPipeline model
pip = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5").to("cuda")

# Enable Cache-DiT acceleration with default cache, parallelism, and quantization
cache_dit.enable_cache(
    pip,
    cache_config=DBCacheConfig(), # Default cache settings
    parallelism_config=ParallelismConfig(ulysses_size=2), # Example: enable Ulysses parallelism
    quantize_config=QuantizeConfig(quant_type="float8_per_row") # Example: enable FP8 quantization
)

# Run inference as usual
prompt = "a photo of an astronaut riding a horse on mars"
image = pip(prompt).images[0]
image.save("astronaut_horse_mars.png")

# To disable cache_dit acceleration later:
# cache_dit.disable_cache(pip)

view raw JSON →