Intel Neural CompressorIntel Corporation
This open source Python* library performs framework-independent AI model optimization.
Vendor
Intel Corporation
Company Website
Product details
Deploy More Efficient Deep Learning Models
Intel® Neural Compressor performs model optimization to reduce the model size and increase the speed of deep learning inference for deployment on CPUs, GPUs, or Intel® Gaudi® AI accelerators. This open source Python* library automates popular model optimization technologies, such as quantization, pruning, and knowledge distillation across multiple deep learning frameworks. Using this library, you can:
- Converge quickly on quantized models through automatic accuracy-driven tuning strategies.
- Prune the least important parameters for large models.
- Distill knowledge from a larger model to improve the accuracy of a smaller model for deployment.
Features
- Model Optimization Techniques - Quantize activations and weights to int8, FP8, or a mixture of FP32, FP16, FP8, bfloat16, and int8 to reduce model size and to speed inference while minimizing precision loss. Quantize during training, posttraining, or dynamically, based on the runtime data range. - Prune parameters that have minimal effect on accuracy to reduce the size of a model. Configure pruning patterns, criteria, and schedule. - Automatically tune quantization and pruning to meet accuracy goals. - Distill knowledge from a larger model (“teacher”) to a smaller model (“student”) to improve the accuracy of the compressed model. - Customize quantization with advanced techniques such as SmoothQuant, layer-wise quantization, and weight-only quantization (WOQ) for low-bit inference.
- Automation - Achieve objectives with expected accuracy criteria using built-in strategies to automatically apply quantization techniques to operations. - Combine multiple model optimization techniques with one-shot optimization orchestration.
- Interoperability - Optimize and export PyTorch* or TensorFlow* models. - Optimize and export Open Neural Network Exchange (ONNX*) Runtime models with Intel Neural Compressor 2.x. As of version 3.x, Intel Neural Compressor is upstreamed into open source ONNX for built-in cross-platform deployment. - Use familiar PyTorch, TensorFlow, or Hugging Face* Transformer style APIs to configure and autotune model compression.