AI boosts optical connectivity demand

Cloud computing raised expectations for data speeds, but artificial intelligence (AI) workloads are placing even more pressures on bandwidth to move data faster and reliably.

While protocols like the Compute Express Link (CXL) are helping to optimize where data is stored so it is closer to where it needs to be, connectivity remains crucial to moving it as fast as possible. After a dip in adoption, optical transceiver technology is seeing an uptick to scale AI in the data center by companies like Amazon and Google, while connectivity is getting baked into full-stack systems along with hardware and software.

In its July 2023 Mega Datacenter Optics report, optical communications market research firm LightCounting said increased optical transceiver sales correlate with significant spikes in sales of GPUs and GPU-based systems for AI clusters. The firm’s forecast for sales of Ethernet optical transceivers for applications in AI clusters will add up to $17.6 billion over the next five years, while all other applications of Ethernet transceivers combined will generate $28.5 billion over the same period.

Deployment to support AI clusters is offsetting cuts in spending by cloud computing companies in anticipation of a recession that has yet to materialize.

The demand for AI connectivity precedes the recent hype of ChatGTP by a few years, with Google deploying more optics in its AI clusters than the rest of its data center infrastructure back in 2019-2020. LightCounting estimates optical transceivers deployed in AI clusters already accounted for 25% of the total market in 2022.

In an interview with Fierce Electronics, LightCounting CEO Vlad Kozlov said there was a slowdown in demand by the end of 2022, which meant the first quarter of this year saw a downturn in optical transceiver sales. By April, he said, it became clear that Nvidia was doing brisk business in AI that bumped up LightCounting’s forecast. “It also alerted all the competitors who are building an AI infrastructure.”

Broader interest in AI drives need for streamlined infrastructure deployment

Nvidia’s key advantage in the AI infrastructure market is that it has developed a full-stack system that includes optical connectivity, hardware, and software, Kozlov said. “If you look at what Google and Amazon are doing, they are developing AI hardware and software internally.”

These companies and others like Microsoft have been making money from AI applications for a while now, he said, but along came ChatGPT, which created the perception that AI could be used more widely across different industries to improve worker productivity. “In addition to very large companies, many other companies started paying attention to AI.”

A full stack solution is appealing because many companies don’t have the expertise to build software and hardware, Kozlov said, so turning to Nvidia makes for it easier to start building out AI infrastructure.

When it comes to connectivity, optical transceivers have the advantage of being able to support higher data rates over longer distances. While copper has been more resilient than expected, distances shrink at higher data rates. “At the same time, the AI clusters are getting larger.” Kozlov said Google is talking about having tens of thousands of GPUs in their arrays. “When the systems get that large, obviously the distances get larger too. You need to use more optics.”

In addition to being able to transport signal at high speeds over longer distances, Kozlov said the appeal of optical switches is that they make it easier to reconfigure the connectivity – flexibility is useful when GPUs are across different parts of the data center and contributes to scalability and reliability as problematic nodes can be quickly bypassed to keep a model running.

Optical transceivers are protocol agnostic, which means they work well with Ethernet and InfiniBand. There are also no fiber connectors – it’s all electrical.

Kozlov said a notable trend is the move from optic cables to pluggable optics. “It's just a small device that you plug in to a server.”

Nvidia is one of the drivers of the latest optics boom, he said, as it moves to pluggable transceivers due the high density of connection in its solutions. The LightCounting report said Nvidia’s earnings call in April 2023 provided a hint as to how much funding overall is going into AI infrastructure, even though the actual number is unknown. Nvidia’s Q2 2023 revenue was forecast to grow 50% sequentially, driven largely by sales of GPUs and GPU-based systems for AI clusters, which has led some financial analysts to estimate cumulative transceiver revenues for the next five years at $100 billion or more for AI applications alone.

Workloads define data center design

Gilad Shainer, senior vice president of networking at Nvidia, told Fierce Electronics in an interview that the entire data center needs to be designed for purpose – that includes AI workloads and the connectivity they require. “Everything has to work in a balanced way.”

AI workloads are distributed across a data center among different connected devices, he said. “The element that defines what the data center can do is the network. The way that you connect everything together defines what kind of workloads you will be able to run.”

Once those elements are in place, the necessary chips and ASICs that go into the data center can be created, Shainer said.

InfiniBand plays a key role in Nvidia’s solutions for AI data centers. The company’s Quantum InfiniBand in-network computing platform is designed for high-performance computing (HPC), AI, and hyperscale cloud infrastructures. But Nvidia also has its NVLink, a wire-based serial multi-lane near-range communications link. It uses a proprietary high-speed signaling interconnect (NVHS), and unlike PCI Express, a device can consist of multiple NVLinks, and devices use mesh networking to communicate instead of a central hub.

Shainer said NVLink comes into play to share memory access and along with InfiniBand allows for a full platform to be scaled out.

The purpose of the platform itself matters, he added. “What kind of what kind of workload do you want to run? What kind of things do you want to solve?” Once you determine what the data center is meant to do, you can fit the appropriate networking inside it, Shainer said, including InfiniBand and NVLink, to attain the necessary data throughput, which is the more difficult part, while also driving a lot of bandwidth quickly, he said. “The reason that you need to do it quickly is because you are dealing with distributed computing.”

What matters is what the network is capable of when under a full load at scale, Shainer said, with the slowest link determining the entire performance of the system. Data centers built for traditional cloud computing were not designed to deliver the performance required for AI workloads, which can require thousands of GPUs to work at extremely low latency. “That's a very complicated process that you need to synchronize,” he said.

Steve Carlini, VP of innovation and data centers at Schneider Electric, said AI workloads have turned IT and physical infrastructure on its head, and not just from a networking perspective. In an interview with Fierce Electronics, he said there has been a mad scramble in recent years to build out AI-capable infrastructure across companies of varied sizes. “It's not deploying standard socket X86 servers anymore.”

AI data centers have CPUs, GPU, accelerators and DPUs that need an architecture to move the data in and out, Carlini said. “Every GPU accelerator has a network port, and they are all operating in sync.”

Another significant difference that separates AI data centers from traditional ones is that they are running workloads all the time, Carlini said, and that changes design parameters. “It's an incredible amount of heat.”

Pulling together 10,000 GPUs and putting them in racks of servers does face physical limits, not the very least of which is heat, Carlini said. Aside from the various cooling options, the solution is to space everything apart. “The barrier to that is with the networking.” Running 400 gig InfiniBand is not cheap, even though optical transceiver costs have dropped dramatically and can cover longer distances, he said. “The big hesitancy is actually the cost of the networking because each GPU has its own has connection into the network.”

Slower connectivity has its place

Not every connection needs to be a fast optical one like InfiniBand. Carlini said there are innovations based on copper, like Broadcom’s Jericho 3, which is designed for AI clusters and can run as fast as 800 gigs. “That seems to be kind of the solution that a lot of people are waiting for.”

Carlini said fast, scalable connectivity with the reliability and lower cost of copper is the Holy Grail, but any data center is going to have a hierarchy of solutions that includes optical, copper, and wireless.

Data that is going to be ingested could be shipped to the data center by copper, while WiFi could sufficiently serve management functions, Carlini said. Real-time mission critical AI would need optical connectivity, especially as models begin to ingest more video and images and not just text, he said.

Lightcounting’s Koslov said the next frontier for optical connectivity is pluggable – it will be co-packaged with GPUs and ASICs. “You don't have to worry about plugging in a transceiver. There will be optics coming out of electronic chips.”

He said the fundamental advantage of optical isn’t going away. “We don't see any alternatives.”

However, copper continues to find ways to improve despite its limitations, Koslov said. “Optics and copper will coexist.”