はじめに

今日投稿する予定の内容をいったんキャンセルして、Meta の LIama3の学習環境について、ブログがアップされたのために記録として残します。

engineering.fb.com

LIama 3 は、H100 x 24,576台で学習

上記のブログによると、

To lead in developing AI means leading investments in hardware infrastructure. Hardware infrastructure plays an important role in AI’s future. Today, we’re sharing details on two versions of our 24,576-GPU data center scale cluster at Meta. These clusters support our current and next generation AI models, including Llama 3, the successor to Llama 2, our publicly released LLM, as well as AI research and development across GenAI and other areas .

とあります。

そして、

Marking a major investment in Meta’s AI future, we are announcing two 24k GPU clusters. We are sharing details on the hardware, network, storage, design, performance, and software that help us extract high throughput and reliability for various AI workloads. We use this cluster design for Llama 3 training.

とあるので、2セットです

ノード間は、400Gbps

Arista 7800 iwith Wedge400 and Minipack2 oCP rack switeches
NVIDIA Quanum2 InfiniBand fabric

LIama 3 は、RoCE

将来的な H100 の数

While we’ve had a long history of building AI infrastructure, we first shared details on our AI Research SuperCluster (RSC), featuring 16,000 NVIDIA A100 GPUs, in 2022.

2022年で、A100 x 16K

These two AI training cluster designs are a part of our larger roadmap for the future of AI. By the end of 2024, we’re aiming to continue to grow our infrastructure build-out that will include 350,000 NVIDIA H100s as part of a portfolio that will feature compute power equivalent to nearly 600,000 H100s.

最終的には、H100 x 600K

おわりに

Meta は24K x H100 でLlama3を学習
最終的にはH100を600K導入
2023末までに150K x H100だったのに4倍か
ほとんどの会社は手を出せない
$30K x 600K = $18B
NVIDIAのQ4.2023のData Centerの売り上げに相当する https://t.co/gmmLCw44ZY
— Vengineer＠ (@Vengineer) 2024年3月12日