Meta, with metaverse in mind, unveils AI Research SuperCluster

Meta has teamed with Nvidia, Pure Storage and Penguin Computing to build an AI Research SuperCluster (RSC), which Meta claimed will be the fastest AI supercomputer in the world when it’s finished in mid-2022, and pave the way toward building the metaverse.

Mark Zuckerberg, CEO of Meta Platforms (formerly Facebook) said in a statement emailed to Fierce Electronics, "Meta has developed what we believe is the world's fastest AI supercomputer. We're calling it RSC for AI Research SuperCluster and it'll be complete later this year. The experiences we're building for the metaverse require enormous compute power (quintillions of operations / second!) and RSC will enable new AI models that can learn from trillions of examples, understand hundreds of languages, and more."

The RSC, which is already operational, is the successor to a first-generation infrastructure which was built in 2017 and uses 22,000 Nvidia V100 Tensor Core GPUs in a single cluster, according to a Meta blog post. Meta officials said in the blog post that they wanted to design a new super-cluster to take advantage of the most recent GPU advances.

The result is a new super-cluster so far comprises “a total of 760 NVIDIA DGX A100 systems as its compute nodes, for a total of 6,080 GPUs — with each A100 GPU being more powerful than the V100 used in our previous system,” the post stated. “The GPUs communicate via an NVIDIA Quantum 200 Gb/s InfiniBand two-level Clos fabric that has no oversubscription. RSC’s storage tier has 175 petabytes of Pure Storage FlashArray, 46 petabytes of cache storage in Penguin Computing Altus systems, and 10 petabytes of Pure Storage FlashBlade.”

Meta researchers already are using the RSC to train large models in natural language processing (NLP) and computer vision for research, and the ultimate goal for the new systems is to enable Meta to eventually train models with trillions of parameters. Nvidia also has mentioned that same massive goal–the future requirement for organizations to train AI models with “trillions of parameters”--in announcements over the last year related to its evolution of its AI processors.

While Meta’s first-generation infrastructure performs 35,000 training jobs a day, company officials said of the new RSC, “Early benchmarks on RSC, compared with Meta’s legacy production and research infrastructure, have shown that it runs computer vision workflows up to 20 times faster, runs the Nvidia Collective Communication Library (NCCL) more than nine times faster, and trains large-scale NLP models three times faster. That means a model with tens of billions of parameters can finish training in three weeks, compared with nine weeks before.”

Nvidia said the Meta deployment is its largest DGX A100 deployment to date, and while not every company will need 760 of them, the semiconductor firm expects more deployments of everything from single A100s to DGX SuperPODs as more companies across a variety of industries embrace AI use cases. 

Charlie Boyle, vice president and general manager of DGX systems at Nvidia, told Fierce Electronics via email, “Nvidia DGX systems make it possible for every organization to use the same technology foundation as Meta for their own AI initiatives… Digital twins and simulation are driving adoption of AI across many industries. Whether delivering better customer experiences with NLP-powered services, performing cutting-edge medical research, optimizing supply chains, or extracting intelligence from mountains of data, researchers using AI clusters like Meta’s Nvidia-powered system will continue to raise the bar for what’s possible as they solve the world’s greatest challenges.”

When it’s completed later this year, Meta’s AI RSC InfiniBand network fabric will grow from 6,080 GPU endpoints to 16,000 GPUs, making it one of the largest such networks deployed to date. Meta also plans to upgrade a caching and storage system that can serve 16 TB/s of training data, scaling it up to 1 exabyte–the equivalent of 36,000 years of high-quality video, according to Meta.

“We expect such a step function change in compute capability to enable us not only to create more accurate AI models for our existing services, but also to enable completely new user experiences, especially in the metaverse,” the Meta blog post stated. “Our long-term investments in self-supervised learning and in building next-generation AI infrastructure with RSC are helping us create the foundational technologies that will power the metaverse and advance the broader AI community as well.”

RELATED: Nvidia's Huang on enterprise AI, getting meta and buying Arm