Arctic Inference
Arctic Inference is an open-source vLLM plugin developed by Snowflake AI Research, designed for high-throughput, low-latency inference of Large Language Models (LLMs) and embeddings. It achieves this through advanced optimizations like Shift Parallelism, Speculative Decoding, SwiftKV, and Arctic Ulysses. The library seamlessly integrates with and automatically patches vLLM (v0.8.4 and later), allowing users to leverage these performance gains while continuing to use familiar vLLM APIs and CLI. The current version is 0.1.2, with an active development cycle releasing updates and research findings.
Common errors
-
Failure to Load and run a vllm server for Model
cause This error often indicates an incompatibility between your Python environment, CUDA toolkit, and the installed vLLM/Arctic Inference versions, or issues with model loading (e.g., missing model files, incorrect paths).fixVerify that your CUDA toolkit version is compatible with your PyTorch and vLLM installations. Ensure Python 3.10 is used (as seen in examples). Check vLLM's official documentation for exact hardware and software requirements. Ensure the model path is correct and accessible. Reinstall `arctic-inference[vllm]` in a fresh virtual environment if necessary. -
RPC timeout error after 5 min hang with speculative config arctic-suffix=True
cause This likely indicates a performance bottleneck or configuration issue when using suffix decoding, particularly under high concurrency or with specific models/workloads, causing the request to exceed the default RPC timeout.fixEvaluate your GPU resources and model size. Consider adjusting `vLLM`'s parallelization settings (`--tensor-parallel-size`, `--pipeline-parallel-size`). Try disabling `arctic-suffix=True` temporarily to isolate if suffix decoding is the root cause, and then re-evaluate performance with adjusted parameters. Check GitHub issues for similar problems and suggested configurations.
Warnings
- breaking Arctic Inference requires vLLM version 0.8.4 or newer. Older versions of vLLM are not compatible and will not benefit from or correctly integrate with Arctic Inference's optimizations.
- gotcha Arctic Inference operates as a plugin that patches vLLM. Its advanced features (e.g., Shift Parallelism, Speculative Decoding) are not automatically enabled just by installation. You must explicitly activate them by setting the `ARCTIC_INFERENCE_ENABLED=1` environment variable or by passing specific CLI flags to `vllm serve` (e.g., `--enable-shift-parallel`, `--speculative-config`).
- gotcha When using `arctic-inference` v0.1.2 with speculative decoding, there's a known issue leading to 'Structured output error in parallel' when sending multiple structured output requests concurrently.
Install
-
pip install arctic-inference[vllm]
Quickstart
import os
from vllm import LLM, SamplingParams
# Enable Arctic Inference (optional, but recommended for full benefits)
# This can also be set as an environment variable before running vLLM:
# os.environ['ARCTIC_INFERENCE_ENABLED'] = '1'
# NOTE: For advanced features like Shift Parallelism or Speculative Decoding,
# you typically run vLLM via CLI with specific arguments, e.g.:
# ARCTIC_INFERENCE_ENABLED=1 vllm serve Snowflake/Llama-3.1-SwiftKV-8B-Instruct \
# --quantization "fp8" \
# --tensor-parallel-size 1 \
# --ulysses-sequence-parallel-size 2 \
# --enable-shift-parallel \
# --speculative-config '{ "method": "arctic", "model":"Snowflake/Arctic-LSTM-Speculator-Llama-3.1-8B-Instruct", "num_speculative_tokens": 3, "enable_suffix_decoding": true, "disable_by_batch_size": 64 }'
# Example for programmatic offline inference (basic usage)
# Ensure a model is available locally or on Hugging Face Hub
model_name = "meta-llama/Llama-2-7b-hf" # Replace with your chosen model
# Initialize LLM with Arctic Inference enabled (implicitly if env var is set)
# You can also pass vLLM arguments directly to LLM constructor
llm = LLM(model=model_name,
enable_prefix_caching=True,
tensor_parallel_size=os.environ.get('TP_SIZE', 1))
sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=50)
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"What is the best way to learn Python programming?"
]
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for prompt, output in zip(prompts, outputs):
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")