{"id":7061,"library":"cache-dit","title":"Cache-DiT","description":"Cache-DiT is a PyTorch-native inference engine designed for Diffusion Transformers (DiTs). It provides hybrid cache acceleration (DBCache, TaylorSeer, SCM), comprehensive parallelism optimizations (Context, Tensor, 2D/3D), and low-bit quantization (FP8, INT8, INT4). The library integrates seamlessly with Hugging Face Diffusers, SGLang Diffusion, vLLM-Omni, and ComfyUI to deliver significant speedups for image and video generation. Currently at version 1.3.5, it maintains an active release cadence with frequent updates and hotfixes.","status":"active","version":"1.3.5","language":"en","source_language":"en","source_url":"https://github.com/vipshop/cache-dit","tags":["pytorch","inference","quantization","dit","diffusion","parallelism","cache","acceleration","gpu","huggingface-diffusers","comfyui"],"install":[{"cmd":"pip install -U cache-dit","lang":"bash","label":"Install stable release from PyPI"}],"dependencies":[{"reason":"Required Python version.","package":"python","optional":false},{"reason":"PyTorch-native inference engine, fundamental dependency.","package":"torch","optional":false},{"reason":"Built on top of the Diffusers library, integrates with DiffusionPipeline.","package":"diffusers","optional":false},{"reason":"Required for building SVDQuant from source.","package":"setuptools-scm","optional":true},{"reason":"CUDA toolchain required for building CUDA extensions like SVDQuant from source.","package":"nvcc","optional":true},{"reason":"Recommended for leveraging Context Parallelism features for distributed inference.","package":"para-attn","optional":true}],"imports":[{"symbol":"cache_dit","correct":"import cache_dit"},{"symbol":"DBCacheConfig","correct":"from cache_dit import DBCacheConfig, ParallelismConfig, QuantizeConfig"},{"symbol":"ParallelismConfig","correct":"from cache_dit import DBCacheConfig, ParallelismConfig, QuantizeConfig"},{"symbol":"QuantizeConfig","correct":"from cache_dit import DBCacheConfig, ParallelismConfig, QuantizeConfig"},{"symbol":"enable_cache","correct":"cache_dit.enable_cache(pipeline_instance)"}],"quickstart":{"code":"import torch\nfrom diffusers import DiffusionPipeline\nimport cache_dit\nfrom cache_dit import DBCacheConfig, ParallelismConfig, QuantizeConfig\n\n# Load a DiffusionPipeline model\npip = DiffusionPipeline.from_pretrained(\"runwayml/stable-diffusion-v1-5\").to(\"cuda\")\n\n# Enable Cache-DiT acceleration with default cache, parallelism, and quantization\ncache_dit.enable_cache(\n    pip,\n    cache_config=DBCacheConfig(), # Default cache settings\n    parallelism_config=ParallelismConfig(ulysses_size=2), # Example: enable Ulysses parallelism\n    quantize_config=QuantizeConfig(quant_type=\"float8_per_row\") # Example: enable FP8 quantization\n)\n\n# Run inference as usual\nprompt = \"a photo of an astronaut riding a horse on mars\"\nimage = pip(prompt).images[0]\nimage.save(\"astronaut_horse_mars.png\")\n\n# To disable cache_dit acceleration later:\n# cache_dit.disable_cache(pip)\n","lang":"python","description":"To quickly get started, load a `DiffusionPipeline` model and apply `cache_dit.enable_cache()` to activate acceleration. You can customize caching, parallelism, and quantization through `DBCacheConfig`, `ParallelismConfig`, and `QuantizeConfig` respectively. The example demonstrates enabling hybrid acceleration with FP8 quantization and Ulysses parallelism."},"warnings":[{"fix":"Monitor output quality and adjust `cache_config` parameters. Consider lowering `residual_diff_threshold` from the default (0.24) and increasing `Fn_compute_blocks` from the default (1) for better quality. Ensure sufficient inference steps are used.","message":"Aggressive caching settings can degrade output quality. Over-caching, particularly with high `residual_diff_threshold` or insufficient `Fn_compute_blocks`, can introduce artifacts or lower fidelity images.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Increase the `recompile_limit` for `torch._dynamo` if you encounter this issue. For example, `torch._dynamo.config.recompile_limit = 100`.","message":"Using `cache-dit` with `torch.compile` and dynamic input shapes may trigger `recompile_limit` errors, leading to fallback to eager mode and negating performance benefits.","severity":"gotcha","affected_versions":"All versions when using `torch.compile`"},{"fix":"Set `per_tensor_fallback=True` (which is often the default) in `QuantizeConfig` when calling `enable_cache` to allow unsupported layers to fall back to FP8 per-tensor quantization, preventing errors.","message":"FP8 quantization with tensor parallelism may encounter memory layout mismatch errors in certain layers. The `per_tensor_fallback` option addresses this.","severity":"gotcha","affected_versions":"All versions supporting FP8 quantization and tensor parallelism (v1.3.0+)"},{"fix":"Refer to the `ComfyUI-CacheDiT` documentation and use the model-specific optimizer nodes for LTX-2 and WAN2.2 14B models.","message":"When integrating with ComfyUI, specific DiT models like LTX-2 or WAN2.2 14B MoE require dedicated optimizer nodes (e.g., `⚡ LTX2 Cache Optimizer`, `⚡ Wan Cache Optimizer`) instead of the general `⚡ CacheDiT Accelerator` node for optimal performance and quality.","severity":"gotcha","affected_versions":"All versions when used with ComfyUI-CacheDiT"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Adjust `cache_config` parameters: Decrease `residual_diff_threshold` (e.g., to 0.12-0.15 from default 0.24), increase `Fn_compute_blocks` (e.g., to 8-12 from default 1), and ensure the `num_inference_steps` is adequate (e.g., >10 steps) to allow for proper warmup and cache effectiveness.","cause":"Overly aggressive caching settings (e.g., high `residual_diff_threshold`, too few `Fn_compute_blocks`, or insufficient inference steps) leading to accumulation of errors or poor feature reuse.","error":"Generated images have visible artifacts or lower quality."},{"fix":"Ensure that your CUDA toolkit is correctly installed and its `bin` directory is in your system's PATH. If using `conda`, activate your CUDA-enabled environment (e.g., `conda activate cdit`) before running the pip install command.","cause":"The CUDA compiler (`nvcc`) is not found in the system's PATH or the active environment, which is required when attempting to build Cache-DiT with SVDQuant support from source.","error":"`CACHE_DIT_BUILD_SVDQUANT=1` was set, but `nvcc` was not found. Activate the CUDA toolchain before building."},{"fix":"In ComfyUI, try manual preset selection for the model in the CacheDiT node. For Python, ensure `cache_dit.enable_cache()` was called successfully. Increase the number of inference steps in your sampler, as a 'warmup' phase (e.g., 3-4 steps) is usually required before caching becomes effective. Check the console logs for 'Lightweight cache enabled' message to confirm activation.","cause":"This typically means the model was not properly detected by the caching mechanism, inference steps were too short (e.g., <10 steps preventing proper warmup), or the caching mechanism was not successfully enabled.","error":"Performance Dashboard shows 0% cache hit?"}]}