IPEX-LLM
IPEX-LLM is a PyTorch-based library developed by Intel for optimizing Large Language Models (LLMs) on Intel CPUs and GPUs (XPUs). It provides tools for efficient inference and fine-tuning, leveraging Intel hardware accelerations. The current stable version is 2.2.0, with frequent nightly builds and updates, often released in conjunction with the broader BigDL project.
Warnings
- breaking The library was rebranded from `BigDL-LLM` to `ipex-llm`. This changes package names, import paths, and some CLI tools.
- gotcha IPEX-LLM installations are hardware-specific. Users must install the correct extras for their target platform (`[cpu]` for Intel CPUs or `[xpu]` for Intel GPUs). Installing without the correct extra may lead to missing dependencies or suboptimal performance.
- gotcha IPEX-LLM's performance is highly dependent on specific PyTorch and underlying Intel oneAPI library versions. Incompatible versions can lead to errors or degraded performance.
- deprecated While still functional, the `ipex_llm.optimize_model` API is being superseded by the higher-level `ipex_llm.LLM` and `ipex_llm.transformers.AutoModel` APIs for model loading and quantization.
Install
-
pip install ipex-llm[cpu] -
pip install --pre --upgrade ipex-llm[xpu]
Imports
- LLM
from ipex_llm import LLM
- optimize_model
from ipex_llm import optimize_model
- AutoModel
from ipex_llm.transformers import AutoModel
- AutoTokenizer
from ipex_llm.transformers import AutoTokenizer
Quickstart
from ipex_llm import LLM
# Instantiate LLM model
model = LLM(
model_name='/path/to/your/model',
optimize_type='int4',
dtype='auto',
trust_remote_code=True
)
# Example for text generation
prompt = "What is the capital of France?"
output = model(prompt)
print(output)
# For AutoModel/AutoTokenizer
from ipex_llm.transformers import AutoModel, AutoTokenizer
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(
model_id,
load_in_4bit=True, # or load_in_low_bit, quantize=4 etc.
torch_dtype='auto'
)
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(input_ids, max_new_tokens=32)
print(tokenizer.decode(output[0], skip_special_tokens=True))