How supercomputers get super: separated subsystems and liquid cooling

When the computing power that once filled an entire room can now be found in your smartphone, what still makes a supercomputer super?

In the 1995 feature film Apollo 13, astronaut Jim Lovell, portrayed by Tom Hanks, is enthusiastic about the computational capabilities that are supporting the Saturn V rocket going to the moon and back. By comparison, today’s businesses quickly spin up compute, networking, and storage as needed via the cloud. The heyday of Cray and Silicon Graphics have long since passed. Today’s iPhone has more computing power than NASA did in the 1970s.

It’s not that supercomputers no longer exist, it’s that what they comprised of and what they are tasked to do has changed. Advances in artificial intelligence (AI) and machine learning (ML) and their subsequent democratization means supercomputing is no longer just the domain of universities and research organizations – it’s becoming an integral part of many data driven businesses.

In the meantime, hardware and architectures have evolved to handle more data, faster. Essentially, the guts of today’s supercomputer are configured differently than they were.

Super growth ahead for supercomputers

Recently released research by Fact.MR shows strong demand for the high processing power of supercomputers. The firm has forecast the global supercomputer market is set to reach US$19 billion by 2033, expanding at a 10.5% CAGR from 2023 to 2033, driven in part by enormous increases in demand for data centers, ML, and AI among businesses, particularly those in the private and public sectors of government and education since the onset of the COVID-19 pandemic.

As defined by Fact.MR, “a supercomputer can execute high-level processing at a quicker rate than a regular computer. It typically contains many processors, which results in faster circuit switching, allowing a user to access and process a huge amount of data in less time.” It noted that one of the most powerful supercomputers launched in the United Kingdom is the Cambridge-1, unveiled by Nvidia in July 2021. It enables health researchers ands scientists to combine AI and simulations in support of the country's life sciences industry.

picture of computer banks and a sign
(Nvidia)

More recently, Nvidia announced that Hewlett Packard Enterprise (HPE) would be incorporating 384 NVIDIA A100 Tensor Core GPUs as part a new supercomputer for Mohamed bin Zayed University of Artificial Intelligence in the United Ara Emirates (UAE). It will be used to run complex AI models with extremely large data sets and increase predictability in research analyses in fields including energy, transportation, and the environment.

One thing hasn’t changed about supercomputers: they’re still designed to handle large data sets. However, supercomputing can mean different things to different people, according to Mike Houston, vice president and chief architect for AI systems at Nvidia. His first job as an undergraduate in the 1990s was  with San Diego Supercomputer Center. “This was the heyday of high-performance computing (HPC) systems – an amazing, amazing time,” he recalled. But it was also the beginning of a transition into the next millennium toward more commodity hardware interconnected in clusters.

Houston said the interconnects were key – commodity processors would be connected by something like InfiniBand. As today’s supercomputing is much more about AI learning from data, I/O becomes a big problem – an optimized AI system is a “pretty wicked” HPC system with robust interconnection capabilities, he said. Nvidia’s approach is to build a system that is the best match for the capabilities of its GPUs and to do fabric designs based predominantly on InfiniBand as well as Ethernet-based designs.

Tuning and tweaking just as important as raw computing power

Nvidia’s collaboration with Microsoft is a good example of how supercomputing can take different forms. Microsoft normally builds traditional data centers with pizza box servers, but Microsoft Research has employed the Nvidia DGX SuperPOD, a blueprint for assembling and scaling AI supercomputing infrastructure, so they can have it available in its cloud, allowing Nvidia to demonstrate InfiniBand and the many subtleties involved in scaling up supercomputing, he said. “Everybody has to make their own little tweaks to the systems to fit into their data centers, but we're trying to make those differences smaller and smaller over time so that everybody can get the benefit of these big iron AI systems.”

Houston said there’s a misconception that designers throw a ton of computing power into a box and mash multiple boxes together, and “boom, supercomputer!” Rather, a great deal of time is spent on the topology. “You're flowing a lot of data and you need to get data from the accelerators to GPUs into the network very, very fast.” Houston said there’s also a lot of tweaking for firmware and configuration. “You just don't throw stuff in a machine, and it magically works at high performance.”

Components also matter, Houston said, including processors and memory, both volatile and non-volatile. A lot of components inside a supercomputer are commodity but assembled in a customized way.  High Bandwidth Memory (HBM) and the latest and greatest DDR5 DRAM are common in supercomputers today, as is flash memory. “We’re trying to get a hierarchy of memories,” he said. Low power DRAM is also being employed, too. “We've been adopting a lot of the commodity components but figuring out how to tweak them and push them.”

At a basic level, a supercomputer started as a CPU connected to other components that you run code against, said Michael McNerney, vice president of marketing and network security at Supermicro, but the architecture has been blown up thanks to advances in storage, fabrics, and GPUs. “Everything is moving away from working through the CPU,” he said, because it’s been identified as a bottleneck. “I would almost call it more of a throughput-centric versus CPU-centric architecture.”

Building a better supercomputer is no longer about just adding more and more cores to the CPU, McNerney said. Instead, performance has moved beyond the CPU into the different subsystems. “You're seeing these other subsystems really stepping up in performance.” Examples include new storage form factors with high capacities, accelerators, or custom ASICs, he said. While smarter networking is also freeing the CPU to do other things, accelerators are being built into the CPU itself.

McNerney said disaggregated nature of modern supercomputers also means the different subsystems can be swapped as they become more efficient – CPUs can be changed out without having to overhaul the rest of infrastructure, which means spinning up supercomputing capabilities is no longer a big capital investment in a single large machine.  

Supercomputing gets even cooler

But for all the talk spreading out the work to different subsystems to improve overall performance, the emergence of liquid cooling could play a key role in democratizing supercomputing. Liquid cooling isn’t new and has been used in some of the largest supercomputers as well as workstations and game boxes to cool GPUs.

As it becomes mores mainstream, McNerney said liquid cooling will go from being used in only 10% of supercomputers to the vast majority in the relative near future to offset the heat generated by power-consuming components. The efficiency of liquid cooling will make the overall footprint of a supercomputing cheaper, allowing organizations to get more computing for their power budget. “It’s just like any constrained commodity,” McNerney said.

The fierce optimization that everyone is attempting to drive efficiency in data centers is motivated by the desire to deliver a more competitive product in whatever industry segment they’re operating. McNerney said that arguably just about any company running a data center with web services is in a sense running their own supercomputer to improve their business, differentiate themselves and be more competitive.  

While it makes sense to use public cloud and managed service providers for non-strategic workloads, “outsourcing your core competency and your differentiation doesn't typically end very well,” McNerney noted. As supercomputers become more accessible and affordable, they’re also becoming a requirement. He said Supermicro’s customers are looking at how technology that used to be the domain of advanced government labs and organizations with deep pockets can be applied to their businesses to build competitive advantage.

 “If you don't figure it out, someone else probably will, and that could be problematic.”

RELATED: Nvidia builds UK’s fastest supercomputer in ‘record time’