はじめに

Xの投稿にて、GroqのLLM Inferenceが超絶速いというのが流れてきます。

groq.com

にアクセスすると、

Mixtral 8x7B-32K
LIama 2 70B-4k

が動きます。デフォルトは、Mixtral 8x7B-32K のようです。

とりあえず、質問してみました。

GroqのTSPがLLM推論で速い理由を2000文字程度で説明してください。

458.03 T/s

回答

Groq's Tensor Streaming Processor (TSP) is designed to accelerate machine learning workloads, including large language models (LLMs) like me, by providing a highly parallel and specialized architecture. Here are some reasons why TSP can be particularly fast for LLM inference:

1. **Massive Parallelism**: TSP contains a vast number of processing elements (PEs) that can execute multiple instructions simultaneously. This high degree of parallelism allows TSP to process large numbers of data points concurrently, significantly speeding up computations required for LLM inference.
2. **SIMD (Single Instruction, Multiple Data) Design**: TSP's PEs are designed to execute the same instruction on multiple data points simultaneously. This SIMD approach enables TSP to perform the same matrix and vector operations required for LLM computations on large groups of data points simultaneously, further increasing throughput.
3. **Optimized for Matrix and Vector Operations**: TSP is specifically designed to handle the matrix and vector operations that are prevalent in LLM computations. This focus on specific operations allows TSP to optimize its architecture for these workloads, reducing the overhead and latency associated with more general-purpose architectures.
4. **Efficient Memory Utilization**: TSP incorporates a high-bandwidth memory hierarchy that includes large on-chip SRAMs and off-chip HBM2 (High Bandwidth Memory) to minimize data movement and reduce the impact of memory-bound operations. This efficient memory utilization reduces the time spent on data transfers and allows TSP to focus on computationally intensive tasks.
5. **Reduced Communication Overhead**: TSP's architecture minimizes the communication overhead between processing elements. By reducing the need for data transfers between PEs, TSP can maintain high levels of parallelism without sacrificing performance due to communication latencies.
6. **Efficient Data Reuse**: TSP's design encourages data reuse, reducing the need to fetch data from memory. This data reuse leads to fewer memory accesses, lower power consumption, and higher overall performance.
7. **Support for Sparsity**: TSP can efficiently handle sparse data, which is common in LLM inference. By taking advantage of sparsity, TSP can skip unnecessary computations, further improving performance and reducing energy consumption.
8. **Easy Programmability**: TSP's programming model is designed to be intuitive and easy to use. This ease of programming allows developers to efficiently utilize TSP's capabilities without requiring extensive low-level optimization, leading to faster development cycles and quicker deployment.

In summary, Groq's TSP offers a highly parallel and specialized architecture that is optimized for the matrix and vector operations prevalent in LLM inference. Its efficient memory utilization, reduced communication overhead, and support for sparsity contribute to its ability to deliver fast and efficient LLM inference.

groq のチップ

Groq のチップについては、このブログでも何度も取り上げています。

vengineer.hatenablog.com

で、何でそんなに速いかというと、チップにDRAMを抱えていなくて、内部のSRAMだけを使っているためです。DRAMへのアクセスによる遅延等の影響を受けないためです。

しかしながら、1つのチップにあるSRAMの容量は限りがあるので、非常に多くのチップを繋げて、LLM推論をしているようです。

LIama 2 70B にて推論するために、568個のチップを使っているようです。

Mouserに、GroqCardが約$20,000で売っているので、568個となると、$11M です。

いづれなくなるのでスクリーンショットを残しておきます。

このブログでも説明しましたが、このGroqCardを8枚入れたものがノード(GroqNode)になります。各チップを水色のケーブルで繋げる感じです。

vengineer.hatenablog.com

568個となると、568/8 = 71ノード。1ラック、8ノードとすると、10ラック必要になります。

LPU

LPU (The Language Processing Unit) なるものが突然出てきました。

vengineer.hatenablog.com

新しいシステム

現在のGroqNodeベースです。Single Core Cluster での LPU数は、264個 (4 Racks)

2nd Gen System なるものになり、1枚のボードの上に、4個のTSP、2個のFPGA、1個のCPUが載り、EthenetとTSP間接続のポートが出ているボードを8枚積んだものです。これだと、Single Core Cluster での LPU数は、4,128個 (33 Racks)になり、現在のシステムより10倍以上大きなモデルに使えます。

そして、現在、Samsung 4nmで開発中の新しいLPU(V2-LPU)を使ったシステムが、3rd Gen System で 336x V2-LPU が1つの塊になり、680,064 LPU (675 Racksまでいけそうです。

vengineer.hatenablog.com

Graphcore IPU M-2000

Graphcore のチップもDRAMを抱えていなくて、SRAMだけのシステムですが、上記の 2nd Gen System の構成と非常に似ているのが、IPU-M2000です。

vengineer.hatenablog.com

なんで、GroqのLPUは超絶速くLLM推論ができて盛り上がっているのに、Graphcoreは身売りの話が出ているのか？

それは、

最初からGroq は推論一択でシステムを作っているから
Graphcore は、学習に注目していたから

LLMになり、どう考えてもDRAMを抱えないで学習するには無理があります。なので、誰も Graphcore のシステムを使って学習したいと思わなかったのでは (OpenAIのChatGPTが出てくる前からGraphcoreのシステムは売れていなかったのそうではないかもしれませんが)

vengineer.hatenablog.com

最後に

GroqCardがMouserで$20,000 のようですが、DRAMも抱えていない、GF14nm で開発したチップはそんなにしないと思います。NVIDIA の Gaming GPU と同じぐらいで作れそうです。あー、Tenstorrent の Grayskull Dev Kit が $599/$799 で売っています。Grayskyll は GF12nm で作っています。

vengineer.hatenablog.com

Grayskullの die size は、600mm²クラスで、GroqのTSPは 800mm² なので、Groq の TSP の方が高くなると思いますが、チップが2倍になってもメモリが無いので $1,000 でPCIe Cardは作れそうです。

となると、568個でも、$568,000 になり、GroqNodeのホスト側のコストを考えても、$1M ぐらいでは？と思っています。売値は2-3倍として、$2-3M。

ということで、$2-3M で LIama 2 7B を超絶速く使いたいユーザは居そうですかね。

Cerebras SystemsのCS-2 (内部SRAMは、40GB) ぐらいの価格帯ではないでしょうか？