Neural Networks Compression Framework
The Neural Networks Compression Framework (NNCF) is a Python library developed by Intel as part of the OpenVINO Toolkit, providing advanced algorithms for optimizing deep learning models for faster and smaller inference. It supports models from PyTorch, TensorFlow (deprecated), ONNX, and OpenVINO IR formats, offering techniques such as Post-Training Quantization, Quantization-Aware Training, Weight Compression, and Pruning. NNCF is actively maintained with frequent releases, with the current stable version being 3.1.0.
Warnings
- breaking NNCFGraph, a core internal representation, was migrated from `nx.DiGraph` to `nx.MultiDiGraph` in v3.1.0 to support models with parallel/multi-edges. This can break code that directly interacts with NNCF's internal graph structure.
- breaking The `nncf.CompressWeightsMode.CB4_F8E4M3` mode option was renamed to `nncf.CompressWeightsMode.CB4`.
- breaking The `nncf.CompressWeightsMode.E2M1` mode option was renamed to `nncf.CompressWeightsMode.MXFP4`.
- deprecated The TensorFlow backend is deprecated and will be removed in future releases. It is recommended to use PyTorch models for training-aware optimization and OpenVINO IR, PyTorch, or ONNX for post-training methods.
- deprecated Several experimental NNCF methods including NAS, Structural Pruning, AutoML, Knowledge Distillation, Mixed-Precision Quantization, and Movement Sparsity are deprecated and will be removed in future releases.
- gotcha When using Quantization-Aware Training with NNCF, it is generally recommended to turn off Dropout layers (and similar layers like DropConnect) during training to prevent accuracy degradation.
- gotcha Users may encounter 'CUDA out of memory' errors during compression-aware training due to the increased GPU memory footprint of NNCF-compressed models. Additionally, `gcc`, `nvcc`, `ninja`, or `cl.exe` errors can occur if CUDA development tools are not properly installed or configured in the PATH/PYTHONPATH for PyTorch.
Install
-
pip install nncf -
pip install nncf[openvino] -
pip install nncf[torch] -
pip install nncf[tensorflow]
Imports
- quantize
import nncf quantized_model = nncf.quantize(model, calibration_dataset)
- compress_weights
import nncf compressed_model = nncf.compress_weights(model)
- prune
import nncf pruned_model = nncf.prune(model, config)
- NNCFConfig
from nncf import NNCFConfig
- ModelType
from nncf import ModelType
- QuantizationPreset
from nncf import QuantizationPreset
- AdvancedQuantizationParameters
from nncf.quantization.advanced_parameters import AdvancedQuantizationParameters
- IgnoredScope
from nncf import IgnoredScope
- get_config
from nncf.torch import get_config
- load_from_config
from nncf.torch import load_from_config
Quickstart
import nncf
import openvino as ov
import torch
from torchvision import datasets, transforms, models
import os
# 1. Load a pre-trained PyTorch model
model = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)
model.eval()
# 2. Convert PyTorch model to OpenVINO Model
# Create a dummy input for tracing
dummy_input = torch.randn(1, 3, 224, 224)
ov_model = ov.convert_model(model, example_input=dummy_input)
# 3. Prepare a calibration dataset (example with random data)
# In a real scenario, use representative data from your dataset
class RandomDataset(torch.utils.data.Dataset):
def __init__(self, size=300):
self.size = size
def __len__(self):
return self.size
def __getitem__(self, idx):
return torch.randn(3, 224, 224), 0 # dummy label
calibration_dataset = RandomDataset()
# 4. Define a transformation function for the calibration dataset
def transform_fn(data_item):
return data_item[0].numpy() # NNCF expects NumPy array for OpenVINO PTQ
# 5. Apply Post-Training Quantization (PTQ)
print("Applying Post-Training Quantization...")
quantized_ov_model = nncf.quantize(
ov_model,
nncf.Dataset(calibration_dataset, transform_fn)
)
# 6. Save the quantized OpenVINO model
output_dir = "./quantized_model"
os.makedirs(output_dir, exist_ok=True)
model_path = os.path.join(output_dir, "resnet18_quantized.xml")
ov.save_model(quantized_ov_model, model_path)
print(f"Quantized model saved to {model_path}")
# To load and use the quantized model:
# core = ov.Core()
# loaded_model = core.read_model(model_path)
# compiled_model = core.compile_model(loaded_model, "CPU")
# # Inference goes here
# print("Model loaded and compiled for inference.")