Resource Hub
Curated SDKs, models, guides, tools, and papers for NPU-accelerated generative AI development.
ONNX Runtime
Cross-platform inference engine with NPU execution providers for AMD XDNA (DirectML), Intel NPU, and Qualcomm Hexagon. The primary runtime for portable NPU deployment.
Intel OpenVINO Toolkit
Intel's open-source inference toolkit optimized for Intel NPU. Supports model conversion from PyTorch, TensorFlow, and ONNX with automatic NPU acceleration via the AUTO device plugin.
AMD Ryzen AI SDK
AMD's developer SDK for Ryzen AI series NPUs. Provides Vitis AI integration, ONNX Runtime DirectML extensions, and profiling tools targeting XDNA and XDNA 2 architectures.
Qualcomm AI Engine (QNN SDK)
The Qualcomm Neural Network SDK for Snapdragon NPU (Hexagon DSP). Supports model conversion, runtime execution, and hardware-specific optimizations for Snapdragon X Elite and X Plus.
DirectML
Microsoft's low-level ML API for Windows hardware acceleration. DirectML is the backbone of NPU support in ONNX Runtime and Windows AI APIs — essential reading for Windows NPU developers.
Windows AI APIs (WinML + Phi Silica)
Windows built-in AI APIs for NPU-accelerated inference, including Phi Silica (Microsoft's on-device LLM), OCR, and image description APIs. Available in Windows 11 24H2+ with Copilot+ PCs.
LLaMA 3.2 (NPU-Optimized)
Meta's LLaMA 3.2 1B and 3B variants, pre-quantized to INT4 with ONNX export, ready for deployment via ONNX Runtime on AMD XDNA and Intel NPU. Available via Hugging Face GGUF and ONNX formats.
Phi-3.5 Mini INT4
Microsoft's Phi-3.5 Mini language model, pre-optimized with Olive for NPU targets. At 3.8B parameters with INT4 quantization, it achieves excellent quality-speed tradeoffs on all supported NPUs.
Whisper Large v3 (NPU ONNX)
OpenAI Whisper Large v3 exported to ONNX with INT8 quantization for NPU-accelerated real-time speech recognition. Includes a ready-to-run Python demo with DirectML execution provider.
Stable Diffusion ONNX (NPU)
Stable Diffusion XL Turbo and SDXL-Lightning pipelines exported to ONNX with DirectML NPU acceleration. Achieves 1-4 step inference on Copilot+ PC hardware with competitive quality.
Getting Started: Your First NPU Inference
Step-by-step guide from environment setup to running your first model on NPU hardware. Covers Windows 11 setup, driver installation, ONNX Runtime configuration, and a working Phi-3.5 demo.
INT4 Quantization with Olive: Full Pipeline
Complete walkthrough for quantizing any HuggingFace LLM to INT4 using Microsoft Olive, with NPU-specific calibration, accuracy evaluation, and ONNX export ready for deployment.
Profiling NPU Workloads on Windows
How to use AMD uProf, Intel VTune, and Windows Performance Analyzer to profile NPU utilization, identify bottlenecks, and optimize your inference pipeline.
NPU Benchmark
Universal NPU performance testing tool that benchmarks AMD XDNA, Intel NPU, and Qualcomm Hexagon with a unified score. Real AI workloads, transparent methodology, global leaderboard.
Microsoft Olive
Hardware-aware model optimization toolchain for NPU targets. Automates quantization, pruning, and ONNX export pipelines with NPU-specific optimizations. Open source on GitHub.
Netron — ONNX Model Visualizer
Browser-based visualizer for ONNX, TensorFlow, and CoreML models. Essential for inspecting model graphs, verifying NPU-compatible operator sets, and debugging export issues.
LLM Inference on Mobile NPUs: Survey 2024
Comprehensive survey of LLM inference techniques optimized for mobile and edge NPUs, covering quantization strategies, memory bandwidth constraints, and deployment frameworks across leading platforms.
AWQ: Activation-aware Weight Quantization for LLM
MIT SONG Lab's AWQ paper — the quantization method behind most efficient INT4 NPU deployments. Key reading for understanding why INT4 quantization works well for on-device LLM inference.
Add a Resource
Know a great SDK, model, guide, or tool that belongs here? Submit it to the community resource hub. We review and publish within 48 hours.