Nvidia announced today that it is releasing Version 8 of its Tensor RT software development kit for high-performance deep learning inference with enhancements designed to boost performance and accuracy for an increasing amount of AI inference happening at the edge and in embedded systems.
Earlier this year Tensor RT Version 7 helped Nvidia achieve high-scoring benchmarks in the MLPerf testing program. “Now, we are releasing Version 8, which is twice as powerful and accurate as Version 7, and supports sparsity on Nvidia Ampere GPUs, which reduces the overall memory requirement,” said Siddharth Sharma, head of product marketing, AI software, at Nvidia.
The performance enhancement comes via transformer optimizations, while quantization aware training enabled the accuracy improv further why these improvements are so important:
“Traditionally, AI training is done in the data center,” he said. “You start with petabytes of data, hundreds of thousands of hours of speech data, and you train the model to the highest level of accuracy that you can. Once you’ve trained it, you throw it over the wall for inference. Once you get to the inference portion you actually have to make some hard choices because inference happens now not only at the data center, but at the edge. You’re doing it in embedded systems, you’re doing it in automotive systems, and so on. During inference your goal is to retain the highest accuracy you can from your training process and run it on hardware devices to achieve the lowest response time and maximized throughput for your customers.”
However, the need to be as accurate as possible sometimes can conflict with the amount of memory and throughput available at the edge. A well-trained, highly-accurate model can be too slow to run. In Tensor RT Version 8, performance enhancements are achieved through transformer optimization, quantization aware training helps improve accuracy, and sparsity allows some parts of the deep learning model to be weighted as less important so that memory can be allocated to the computing requirements for those that need more.
In addition to announcing Tensor RT 8 availability, Nvidia also announced it had made a breakthrough on BERT-large inference using Tensor RT. It was able to perform inference on BERT, one of the world’s most widely used transformer-based models, in just 1.2 milliseconds. That speed of inference “can make conversational AI smarter,” and enhance the performance of numerous interactive applications, said Kari Briski, director of product management, AI software, at Nvidia.