はじめに
今日投稿する予定の内容をいったんキャンセルして、Meta の LIama3の学習環境について、ブログがアップされたのために記録として残します。
LIama 3 は、H100 x 24,576台で学習
上記のブログによると、
To lead in developing AI means leading investments in hardware infrastructure. Hardware infrastructure plays an important role in AI’s future. Today, we’re sharing details on two versions of our 24,576-GPU data center scale cluster at Meta. These clusters support our current and next generation AI models, including Llama 3, the successor to Llama 2, our publicly released LLM, as well as AI research and development across GenAI and other areas .
とあります。
そして、
Marking a major investment in Meta’s AI future, we are announcing two 24k GPU clusters. We are sharing details on the hardware, network, storage, design, performance, and software that help us extract high throughput and reliability for various AI workloads. We use this cluster design for Llama 3 training.
とあるので、2セットです
ノード間は、400Gbps
- Arista 7800 iwith Wedge400 and Minipack2 oCP rack switeches
- NVIDIA Quanum2 InfiniBand fabric
LIama 3 は、RoCE
将来的な H100 の数
While we’ve had a long history of building AI infrastructure, we first shared details on our AI Research SuperCluster (RSC), featuring 16,000 NVIDIA A100 GPUs, in 2022.
2022年で、A100 x 16K
These two AI training cluster designs are a part of our larger roadmap for the future of AI. By the end of 2024, we’re aiming to continue to grow our infrastructure build-out that will include 350,000 NVIDIA H100s as part of a portfolio that will feature compute power equivalent to nearly 600,000 H100s.
最終的には、H100 x 600K
おわりに
Meta は24K x H100 でLlama3を学習
— Vengineer@ (@Vengineer) 2024年3月12日
最終的にはH100を600K導入
2023末までに150K x H100だったのに4倍か
ほとんどの会社は手を出せない
$30K x 600K = $18B
NVIDIAのQ4.2023のData Centerの売り上げに相当する https://t.co/gmmLCw44ZY
ほとんどの会社は、もう無理です。。。