Vengineerの妄想

人生を妄想しています。

MetaのLlama 3は、(H100 x 24,576) x 2セットで学習

はじめに

今日投稿する予定の内容をいったんキャンセルして、Meta の LIama3の学習環境について、ブログがアップされたのために記録として残します。

engineering.fb.com

LIama 3 は、H100 x 24,576台で学習

上記のブログによると、

To lead in developing AI means leading investments in hardware infrastructure. Hardware infrastructure plays an important role in AI’s future. Today, we’re sharing details on two versions of our 24,576-GPU data center scale cluster at Meta. These clusters support our current and next generation AI models, including Llama 3, the successor to Llama 2, our publicly released LLM, as well as AI research and development across GenAI and other areas .

とあります。

そして、

Marking a major investment in Meta’s AI future, we are announcing two 24k GPU clusters. We are sharing details on the hardware, network, storage, design, performance, and software that help us extract high throughput and reliability for various AI workloads. We use this cluster design for Llama 3 training.

とあるので、2セットです

ノード間は、400Gbps

  • Arista 7800 iwith Wedge400 and Minipack2 oCP rack switeches
  • NVIDIA Quanum2 InfiniBand fabric

LIama 3 は、RoCE

将来的な H100 の数

While we’ve had a long history of building AI infrastructure, we first shared details on our AI Research SuperCluster (RSC), featuring 16,000 NVIDIA A100 GPUs, in 2022.

2022年で、A100 x 16K

These two AI training cluster designs are a part of our larger roadmap for the future of AI. By the end of 2024, we’re aiming to continue to grow our infrastructure build-out that will include 350,000 NVIDIA H100s as part of a portfolio that will feature compute power equivalent to nearly 600,000 H100s.

最終的には、H100 x 600K

おわりに

ほとんどの会社は、もう無理です。。。