Technology

NVIDIA TensorRT-LLM

An open-source library and Python API for high-performance, real-time Large Language Model (LLM) inference on NVIDIA GPUs.

TensorRT-LLM is the definitive, open-source solution for maximizing LLM inference performance on NVIDIA hardware, from data center to desktop. The library provides a modular Python runtime, PyTorch-native model authoring, and a stable production API for seamless deployment. It achieves record-setting performance gains—up to 8X improvement in some benchmarks—by integrating state-of-the-art optimizations like FP8/NVFP4 quantization, paged attention, and advanced speculative decoding techniques (e.g., EAGLE-3). Developers use the Python and C++ runtimes to build highly efficient TensorRT engines, cutting operational costs while delivering blazingly fast user experiences.

https://docs.nvidia.com/deeplearning/tensorrt-llm/user-guide/index.html

1 project · 1 city

Related technologies

Gemma 7B IT 1 Mistral 7B IT 1 Outlook Add-in 1 Windows 4

Recent Talks & Demos

Showing 1-1 of 1

Members-Only

OutlookLLM

Seattle Mar 14

NVIDIA TensorRT-LLM Mistral 7B IT