Meta’s AI Hardware Bonanza: Open Designs, Open Future

Meta’s Open AI Hardware Vision

At the Open Compute Project (OCP) Global Summit 2024, Meta is showcasing its latest open AI hardware designs in collaboration with the OCP community. These innovations include a new AI platform, cutting-edge open rack designs, and advanced network fabrics and components. By sharing these designs, Meta hopes to inspire collaboration and foster innovation, inviting those passionate about building the future of AI to engage and shape the next generation of open hardware.

Catalina: Open Architecture for AI Infrastructure

Meta announced the upcoming release of Catalina, a high-powered rack designed for AI workloads, to the OCP community. Catalina is based on the NVIDIA Blackwell platform full rack-scale solution, focusing on modularity and flexibility. It supports the latest NVIDIA GB200 Grace Blackwell Superchip, meeting the growing demands of modern AI infrastructure.

With Catalina, Meta introduced the Orv3, a high-power rack (HPR) capable of supporting up to 140kW, addressing the growing power demands of GPUs. The full solution is liquid-cooled and consists of various components, including a compute tray, switch tray, fabric switch, management switch, battery backup unit, and a rack management controller.

The Grand Teton Platform Supports AMD Accelerators

Meta’s next-generation AI platform, Grand Teton, has been expanded to support the AMD Instinct MI300X, in addition to its existing support for other accelerator designs. Grand Teton features a single monolithic system design with integrated power, control, compute, and fabric interfaces, simplifying system deployment and enabling rapid scaling for large-scale AI inference workloads.

With increased compute capacity, expanded memory, and higher network bandwidth, Grand Teton allows faster convergence on larger models and supports scaling up training cluster sizes efficiently.

Open Disaggregated Scheduled Fabric

Meta introduced the Disaggregated Scheduled Fabric (DSF) for its next-generation AI clusters, offering advantages over existing switches. DSF is powered by the open OCP-SAI standard and FBOSS, Meta’s network operating system. It supports an open and standard Ethernet-based RoCE interface across various GPUs, NICs, and vendors like NVIDIA, Broadcom, and AMD.

In addition to DSF, Meta also developed new 51T fabric switches based on Broadcom and Cisco ASICs and shared its new FBNIC, containing Meta’s first network ASIC design.

Meta and Microsoft: Driving Open Innovation Together

Meta and Microsoft have a long-standing partnership within OCP, collaborating on initiatives like the Switch Abstraction Interface (SAI), Open Accelerator Module (OAM) standard, and SSD standardization. Their current collaboration focuses on Mount Diablo, a new disaggregated power rack featuring a scalable 400 VDC unit, enhancing efficiency and scalability for AI accelerators.

The Open Future of AI Infrastructure

Meta is committed to open source AI, believing that open software frameworks, standardized models, and open hardware systems are crucial for realizing AI’s full potential. By addressing AI’s infrastructure needs together through the OCP community, Meta aims to unlock the true promise of open AI for everyone.