Technology
NVIDIA TensorRT-LLM
An open-source library and Python API for high-performance, real-time Large Language Model (LLM) inference on NVIDIA GPUs.
TensorRT-LLM is the definitive, open-source solution for maximizing LLM inference performance on NVIDIA hardware, from data center to desktop. The library provides a modular Python runtime, PyTorch-native model authoring, and a stable production API for seamless deployment. It achieves record-setting performance gains—up to 8X improvement in some benchmarks—by integrating state-of-the-art optimizations like FP8/NVFP4 quantization, paged attention, and advanced speculative decoding techniques (e.g., EAGLE-3). Developers use the Python and C++ runtimes to build highly efficient TensorRT engines, cutting operational costs while delivering blazingly fast user experiences.
Related technologies
Recent Talks & Demos
Showing 1-1 of 1