AI

AI Revolutionizes Data Centers Amid Optical Tech Advances

LinkedIn Google+ Pinterest Tumblr

The recent Optical Fiber Communication (OFC) Conference in Los Angeles highlighted the increasing role of Artificial Intelligence (AI) in transforming data center infrastructure. At the heart of this transformation is the challenge to adapt to AI’s demands on networking—a sentiment echoed by Kannan Raj, AI infrastructure architect at Oracle. He noted that traditional link specifications are no longer sufficient. “Back when the IEEE specs were formed and written off, they said, we need links to have 2.4e-4 effect error. That by no means is acceptable today,” said Raj during a panel discussion.

The crux of the issue lies in AI’s rapid scale-up, scale-out, and scale-across requirements. These dimensions redefine connectivity needs, compelling the industry to rethink network architecture. The massive interconnected network, consisting of large numbers of components, presents a unique set of challenges. According to Raj, “We are dealing with millions of links, millions of units, components, hardware. I call it the tyranny of large numbers.”

Data centers must remain resilient despite these challenges, as failures could disrupt AI training processes, causing delays and inefficiencies. This ubiquity of AI mirrors previous technological revolutions, such as the rise of cloud services, but demands more fundamental changes.

Three connectivity types are explored: scale-up, scale-out, and scale-across. Scale-up involves linking more devices within a cluster, providing benefits like ultra-low latency. Scale-out expands connectivity over multiple racks, aiming to overcome the physical limits of individual servers, albeit with increased dependency on the network’s stability. Scale-across takes this yet further, integrating thousands of GPUs across global data centers, creating expansive “AI factories.”

Raj described how these architectures support AI advancements: “Scale-up is highly localized… It’s synchronous, message passing type, low latency…Scale-out is basically within the pod. So it’s suitable for running inference. Scale-across can vary [depending] on who you talk to. It can be 10 kilometers to thousands of kilometers.”

Optical technologies, such as coherent optics and co-packaged optics, are pivotal in achieving the necessary high-density, low-latency connections. Innovations in optical transportation, like advancements in 400G, 800G, and beyond, enable greater spectral efficiency and extended reach. Raj explained, “Scale-up today is mostly…copper and it is getting to a point where there will be some hybrid optical-copper essentially, and then transitioning to optical also.”

Despite these innovations, challenges persist. The overlapping features of scale-up, scale-out, and scale-across make resilience a crucial consideration. Furthermore, multi-planar network fabrics—comprising multiple interconnected systems—are essential for large AI clusters, distributing workloads effectively across a broadened network.

In conclusion, the telecommunications industry is at a pivotal threshold. It must adapt and expand network infrastructures to meet the demands of massive AI workloads. Emphasizing the need for seamless virtual environments, Raj underscored potential challenges: “RDMA [Remote Direct Memory Access] unlike TCP is very very unforgiving when it comes to the performance of the network.” Thus, as AI continues to shape network strategies, optical technology and innovative connectivity solutions will be key pillars for future readiness.

Write A Comment