Arctic Inference

0.1.2 · active · verified Thu Apr 16

Arctic Inference is an open-source vLLM plugin developed by Snowflake AI Research, designed for high-throughput, low-latency inference of Large Language Models (LLMs) and embeddings. It achieves this through advanced optimizations like Shift Parallelism, Speculative Decoding, SwiftKV, and Arctic Ulysses. The library seamlessly integrates with and automatically patches vLLM (v0.8.4 and later), allowing users to leverage these performance gains while continuing to use familiar vLLM APIs and CLI. The current version is 0.1.2, with an active development cycle releasing updates and research findings.

Common errors

Warnings

Install

Quickstart

This quickstart demonstrates programmatic offline inference using `vLLM` after `arctic-inference` has been installed. Arctic Inference implicitly applies its optimizations to `vLLM` if enabled via an environment variable or specific `vLLM` CLI flags. For full feature utilization, running `vLLM` via the command line with parameters like `--enable-shift-parallel` and `--speculative-config` is the primary method.

import os
from vllm import LLM, SamplingParams

# Enable Arctic Inference (optional, but recommended for full benefits)
# This can also be set as an environment variable before running vLLM:
# os.environ['ARCTIC_INFERENCE_ENABLED'] = '1'

# NOTE: For advanced features like Shift Parallelism or Speculative Decoding,
# you typically run vLLM via CLI with specific arguments, e.g.:
# ARCTIC_INFERENCE_ENABLED=1 vllm serve Snowflake/Llama-3.1-SwiftKV-8B-Instruct \
#   --quantization "fp8" \
#   --tensor-parallel-size 1 \
#   --ulysses-sequence-parallel-size 2 \
#   --enable-shift-parallel \
#   --speculative-config '{ "method": "arctic", "model":"Snowflake/Arctic-LSTM-Speculator-Llama-3.1-8B-Instruct", "num_speculative_tokens": 3, "enable_suffix_decoding": true, "disable_by_batch_size": 64 }'

# Example for programmatic offline inference (basic usage)
# Ensure a model is available locally or on Hugging Face Hub
model_name = "meta-llama/Llama-2-7b-hf" # Replace with your chosen model

# Initialize LLM with Arctic Inference enabled (implicitly if env var is set)
# You can also pass vLLM arguments directly to LLM constructor
llm = LLM(model=model_name, 
          enable_prefix_caching=True, 
          tensor_parallel_size=os.environ.get('TP_SIZE', 1)) 

sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=50)

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "What is the best way to learn Python programming?"
]

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for prompt, output in zip(prompts, outputs):
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

view raw JSON →