As enterprise adoption of AI technology matures, more companies will be shifting their attention from AI training–the leveraging of corporate data to train large language models (LLMs)--to AI inference–allowing AI models to draw their own conclusions and act on them based on what they have learned.
While much of the coverage of AI to date has focused on the massive computing power needed to train LLMs, the shift toward AI inferencing will require a new round of innovation to help enable the cost-efficient operation of AI on a large scale.
IBM Research has been working on new ways to make that happen by working with technologies such as speculative decoding and paged attention. Speculative decoding is an optimization technique that can help LLMs generate tokens faster, lowering latency by two to three times, which from a user point of view could allow customers using an AI-enabled chatbot to have a much better experience.
However, reducing latency also typically cuts throughput, or the number of people that can use a model at once, which can increase costs for a company hosting an AI model, according to IBM Research. This is where paged attention comes into play, as it is a memory optimization technique that supports more efficient memory use to help avoid a decline in throughput–and even increase throughput. Using these techniques, IBM Research recently managed to cut the latency of serving its open-source Granite 20B code model in half while also quadrupling its throughput, researchers claimed.
To learn more about these innovations, Fierce Electronics exchanged e-mails with Raghu Kiran Ganti, Distinguished Engineer, IBM Research, and Mudhakar Srivatsa, Distinguished Engineer, AI Platform, IBM Research.
Fierce Electronics: A lot of the focus so far on AI has been on LLM training, and not as much on AI inference. Why do you think this might be the case, and why is it important to optimize inferencing?
Mudhakar Srivatsa: Media focus has predominantly been on LLM training rather than inferencing because training LLMs involve highly publicized breakthroughs, immense computational resources, and sophisticated algorithms, which often capture the imagination and interest of the public and the media. Training milestones are seen as significant technological advancements that push the boundaries of what AI can achieve.
Optimizing inferencing is essential for enterprise AI deployment at scale because it directly impacts user experience and operational costs. One of the main challenges with LLM inferencing has been the high latency and cost associated with generating responses. Inferencing improvements, like those IBM has made with speculative decoding and paged attention, can significantly enhance performance by reducing latency and increasing throughput. These optimizations make AI more accessible and cost-effective for enterprises, enabling wider adoption and more seamless user interactions. For lack of a better term, inferencing isn’t as flashy as training, but it’s certainly just as important.
FE: What do you think are the most important AI inferencing challenges to address right now?
Srivatsa: The most important AI inferencing challenges to address include reducing latency to improve user experience, enhancing throughput to serve more users simultaneously, optimizing memory usage to manage large models efficiently, and increasing cost efficiency to make AI solutions more accessible. Additionally, ensuring model adaptability for different use cases and improving energy efficiency for sustainability are crucial. Techniques like speculative decoding and paged attention, as demonstrated by IBM, are vital in overcoming these challenges, leading to faster, more scalable, and cost-effective AI deployments.
FE: How does speculative decoding address these challenges?
Raghu Kiran Ganti: Speculative decoding can help LLMs generate tokens faster, lowering latency by two to three times — and giving customers a much better experience. The forward pass is modified so that the LLM evaluates several prospective tokens that come after the one it’s about to generate. If the “speculated” tokens are verified, one forward pass can produce two or more tokens for the price of one. Once the LLM hits an incorrect token, it stops and generates its next token based on all tokens it has validated to that point. This technique allows for the generation of multiple tokens per forward pass, effectively doubling or tripling the speed of AI inferencing. This can lead to reduced costs for companies running the models.
FE: Can this technique be used on all different kinds of models?
Ganti: The speculative decoding technique has shown promising results in improving the inferencing speeds of foundation models, particularly when applied to IBM’s Granite 20B code model. The application is beneficial in structured scenarios, such as code models, where the predictable nature of syntax enhances its efficiency. In natural language processing the unpredictability of human language can present more challenges than the structured nature of code. While the technique has been implemented in several open-source models by IBM, its adaptability to different types of models may vary, depending on the specific characteristics and requirements of each model. It would be essential to consider the structure and predictability of the data when evaluating the suitability of speculative decoding for different models.
FE: And what role does paged attention play in optimizing AI inference?
Srivatsa: Paged attention is crucial for optimizing LLM inferencing by improving memory management. Traditional attention mechanisms can lead to memory fragmentation, but paged attention divides memory into smaller blocks, reducing redundancy and freeing up resources. This enhances both latency and throughput by allowing models to handle longer inputs and generate faster responses. Paged attention thus makes LLM inferencing more scalable and cost-effective, essential for practical enterprise applications.
FE: How can companies start taking advantage of these optimization techniques?
Ganti: Companies can collaborate with IBM or they can utilize other open-source tools. Direct collaboration with IBM allows for tailored integration of this technology into existing AI models, benefiting from IBM’s expertise in AI optimization. Alternatively, companies can access IBM’s open-source speculator tool on platforms like Hugging Face, enabling in-house teams to adapt and integrate the technology independently. Both pathways offer opportunities to enhance AI inferencing efficiency, speed up operations, and reduce costs. Resources for those who want to get started include:
-
Speculator models for Meta Llama3 8B, IBM Granite 7B lab, Meta Llama2 13B, and