Every empire has a weakness, whether it's a small thermal exhaust port into which a proton torpedo could be shot to trigger a chain reaction, or the nature of high-performance GPUs to also be massively expensive and power-hungry.
While Nvidia has ruled over the AI chip market in the early going, a number of companies are aware of this empire’s potential weakness as the AI evolution shifts from training to inference, and hyperscalers and enterprises look to consider less costly and more power-efficient options that do not rely on GPUs.
The latest of those challengers is Sagence AI, a six-year-old Santa Clara, California, company that believes analog in-memory computing architecture that merges storage and compute inside memory cells could be the key to better AI inference (The company was called Analog Inference until just before its emergence from stealth this month.)
Richard Terrill, vice president of strategy and business development at Sagence AI, recently told Fierce Electronics that the AI market remains wide open for GPU alternatives because the biggest AI opportunity–inference–is barely off the starting blocks. “I like to use the analogy that [AI] training is going to university. You go to school for a number of years, you study, you spend a lot of time, effort, and capital, and you become really intelligent, but you haven't created any value yet.”
He added, “It's not until you get a job [or in AI] take that trained network or that trained chip and deploy the inference engine, that you can actually start creating commercial value. Looking at people in stores, keeping kids out of intersections and traffic accidents and the like–that's where the value side is. That's where the real win is going to be.”
Nvidia has acknowledged this point as well. Some of the company’s most recent earnings call was spent discussing AI inference, and Nvidia CEO Jensen Huang claimed his firm is well-positioned as more of its customers turn to AI inference to power a new era of AI applications, such as agentic AI offerings.
“We're seeing inference really starting to scale up for our company,” Huang said on the earnings call. “We are the largest inference platform in the world today because our installed base is so large. And everything that was trained on Amperes and Hoppers can inference incredibly on Amperes and Hoppers. As we move to Blackwells [Nvidia’s newest GPUs] for training foundation models, that also will leave a large installed base of extraordinary infrastructure for inference.”
But Sagence AI’s counterpoint is that just because the burly Blackwells of the world are great at training AI does not mean they will be the best option for AI inference. “The challenge that most companies have faced with going into the inference world is they've taken the same machinery that they've been doing the training with, typically digital-schedule general purpose computing, although with some very specialist capabilities that are quite impressive, but when you go and draft it for inference, you're carrying a lot of legacy and baggage that's not needed because inference is not a general purpose computing problem. It's actually a very well understood data flow processing problem, and you don't need all that general capability.”
Sagence AI’s solution relies on using a non-volatile Flash memory chip for storing AI weights (the numerical values that support AI decision-making) to enable inference with “almost no current consumption required,” Terrill said. He explained, “In digital processing, when you take a neural network, there's tens or hundreds of millions of weights sitting there somewhere, and you've got thousands, maybe millions, of computing units, a finite resource. You've got a really big scheduling problem. You've got to take all these weights and shuffle them in and out of a limited number of arithmetic units, take the intermediate results for them somewhere else [off-chip for AI to make a decision].”
The technology solution proposed by Sagence AI takes the weights and applies them to the control gate of [an analog] flash cell that interacts with the floating gate of the flash cell, and out comes a current that's a multiplication,” Terrill said. “We've just done a vector matrix multiplication.”
In addition to consuming less power, Sagence AI’s approach also offers “a significant cost savings,” Terrill claimed, “because we've eliminated the off-chip memory... We don't have all the scheduling overhead, and there's also a performance increase, because we can put hundreds of millions of cells on a chip and run them in parallel, and we don't have to clock super fast to get our performance. We just get really efficient… and extraordinarily high performance simply by having so many computational units that they can run all concurrently, basically getting 100% utilization.”
Can Sagence AI point to MLPerf results (Nvidia’s favored comparison) or evidence from customer deployments to back up these claims? Not yet. The company is aiming to announce products based on its technology next year, and for now offers a comparison based on its own simulation for how it thinks its approach can best Nvidia’s. Terrill said, in an implementation on the 70-billion-parameter Llama 2 large language mode at 666,000 tokens per second, an Nvidia approach using H100 GPUs would require 944 rack units eating 625 kilowatts of power and at a projected price of around $39 million. Meanwhile, Sagence, using in-memory compute analog inference on the same job would use 23.6 rack units costing about $2 million and using 10.6 kilowatts.
Notably, that comparison is based on H100s and not Nvidia’s shiny new Blackwell GPUs. H100’s are the more widely deployed and used GPUs for now, so that makes sense, but Terrill said Sagence AI knows that Nvidia is not a sitting duck, and will keep refining and evolving its technology, putting pressure on would-be competitors to do the same.
Sagence AI also is not the only one looking at the transition to AI inference as opening for challenging Nvidia, nor is it the only company looking to leverage analog inference to do so, with firms like IBM, Infineon, D-Matrix, and others eyeing old-school analog for inference as well.
As Sagence AI ramps up its approach into products in the year to come, it will target vision and large language models inference applications. Many AI applications will not necessarily generate a lot of revenue directly, and this, too, is another reason Sagence AI believes it is time for a different approach. “How much are people willing to pay for the results of a chat? Not a lot, but if it costs you a whole bunch to deliver it, you don't have a business,” Terrill said. “That's the rub, and that's why we think that what we're doing–and what others certainly will do–is the path forward.”