Nvidia hypes clients using Triton on large language models

Nvidia has revealed more details regarding how AI firms and other technology companies are using its solutions to work with large language models (LLMs), which increasingly are being relied upon to handle many tasks in support of websites, applications and AI functionality.

The company published a blog post on the topic this week, which comes a couple weeks after Nvidia launched new solutions to help developers and companies train LLMs.

Among the clients Nvidia discussed in its blog post, French firm NLP Cloud, which enables an AI-powered software service for text data that feeds applications like an Internet news service for employees of a European airline, is using Nvidia’s Triton Inference Servers to work with about 25 different LLMs to support AI natural language processing capabilities. The largest of those LLMs “has 20 billion parameters, a key measure of the sophistication of a model,” the blog post stated. “And now it’s implementing BLOOM, an LLM with a whopping 176 billion parameters.”

NLP Cloud chief Julien Salinas stated, “Very quickly the main challenge we faced was server costs. Triton turned out to be a great way to make full use of the GPUs at our disposal.”

The start-up was able to leverage FasterTransformer, a part of Triton that automates complex jobs like splitting up models across many GPUs, and can enable Nvidia A100 Tensor Core GPUs to process as many as 10 requests at a time — twice the throughput of alternative software, the company said.

NLP Cloud also has adopted NVIDIA Nemo Megatron, an end-to-end framework for training and deploying LLMs with trillions of parameters, that it is using to train custom versions of LLMs to support more languages and enhance efficiency. 

The blog post also mentions how Microsoft’s Translate service employed Triton to run inference on models with up to 5 billion parameters, getting a massive acceleration in the process. Twitter, NLP provider Cohere, Tokyo-based chatbot developer rinna, and Tel Aviv-based Tabnine, which runs a service that has automated up to 30% of the code written by a million developers globally, also have used Nvidia GPUs with Triton for their LLM needs.