Cache-DiT
Cache-DiT is a PyTorch-native inference engine designed for Diffusion Transformers (DiTs). It provides hybrid cache acceleration (DBCache, TaylorSeer, SCM), comprehensive parallelism optimizations (Context, Tensor, 2D/3D), and low-bit quantization (FP8, INT8, INT4). The library integrates seamlessly with Hugging Face Diffusers, SGLang Diffusion, vLLM-Omni, and ComfyUI to deliver significant speedups for image and video generation. Currently at version 1.3.5, it maintains an active release cadence with frequent updates and hotfixes.
Common errors
-
Generated images have visible artifacts or lower quality.
cause Overly aggressive caching settings (e.g., high `residual_diff_threshold`, too few `Fn_compute_blocks`, or insufficient inference steps) leading to accumulation of errors or poor feature reuse.fixAdjust `cache_config` parameters: Decrease `residual_diff_threshold` (e.g., to 0.12-0.15 from default 0.24), increase `Fn_compute_blocks` (e.g., to 8-12 from default 1), and ensure the `num_inference_steps` is adequate (e.g., >10 steps) to allow for proper warmup and cache effectiveness. -
`CACHE_DIT_BUILD_SVDQUANT=1` was set, but `nvcc` was not found. Activate the CUDA toolchain before building.
cause The CUDA compiler (`nvcc`) is not found in the system's PATH or the active environment, which is required when attempting to build Cache-DiT with SVDQuant support from source.fixEnsure that your CUDA toolkit is correctly installed and its `bin` directory is in your system's PATH. If using `conda`, activate your CUDA-enabled environment (e.g., `conda activate cdit`) before running the pip install command. -
Performance Dashboard shows 0% cache hit?
cause This typically means the model was not properly detected by the caching mechanism, inference steps were too short (e.g., <10 steps preventing proper warmup), or the caching mechanism was not successfully enabled.fixIn ComfyUI, try manual preset selection for the model in the CacheDiT node. For Python, ensure `cache_dit.enable_cache()` was called successfully. Increase the number of inference steps in your sampler, as a 'warmup' phase (e.g., 3-4 steps) is usually required before caching becomes effective. Check the console logs for 'Lightweight cache enabled' message to confirm activation.
Warnings
- gotcha Aggressive caching settings can degrade output quality. Over-caching, particularly with high `residual_diff_threshold` or insufficient `Fn_compute_blocks`, can introduce artifacts or lower fidelity images.
- gotcha Using `cache-dit` with `torch.compile` and dynamic input shapes may trigger `recompile_limit` errors, leading to fallback to eager mode and negating performance benefits.
- gotcha FP8 quantization with tensor parallelism may encounter memory layout mismatch errors in certain layers. The `per_tensor_fallback` option addresses this.
- gotcha When integrating with ComfyUI, specific DiT models like LTX-2 or WAN2.2 14B MoE require dedicated optimizer nodes (e.g., `⚡ LTX2 Cache Optimizer`, `⚡ Wan Cache Optimizer`) instead of the general `⚡ CacheDiT Accelerator` node for optimal performance and quality.
Install
-
pip install -U cache-dit
Imports
- cache_dit
import cache_dit
- DBCacheConfig
from cache_dit import DBCacheConfig, ParallelismConfig, QuantizeConfig
- ParallelismConfig
from cache_dit import DBCacheConfig, ParallelismConfig, QuantizeConfig
- QuantizeConfig
from cache_dit import DBCacheConfig, ParallelismConfig, QuantizeConfig
- enable_cache
cache_dit.enable_cache(pipeline_instance)
Quickstart
import torch
from diffusers import DiffusionPipeline
import cache_dit
from cache_dit import DBCacheConfig, ParallelismConfig, QuantizeConfig
# Load a DiffusionPipeline model
pip = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5").to("cuda")
# Enable Cache-DiT acceleration with default cache, parallelism, and quantization
cache_dit.enable_cache(
pip,
cache_config=DBCacheConfig(), # Default cache settings
parallelism_config=ParallelismConfig(ulysses_size=2), # Example: enable Ulysses parallelism
quantize_config=QuantizeConfig(quant_type="float8_per_row") # Example: enable FP8 quantization
)
# Run inference as usual
prompt = "a photo of an astronaut riding a horse on mars"
image = pip(prompt).images[0]
image.save("astronaut_horse_mars.png")
# To disable cache_dit acceleration later:
# cache_dit.disable_cache(pip)