Vengineerの妄想

人生を妄想しています。

TenstorrentのWorkstationのLLM性能など

はじめに

数少ないあたしのXのフォローの一人である、sally -san の eetimes の下記の記事、

I called in at @tenstorrent to see the team's Llama3.1-70B demo up and running on one of their 8-chip systems at 15 tokens/second/user, a figure the company considers a work-in-progress as it continues to optimise its software. More in the article:https://t.co/AAGFAv3Gr8
— Sally Ward-Foxton (@sallywf) 2024年10月1日

TenstorrentのサイトやXの投稿ではわからないことが色々と書いてありましたので、記録に残します。

eetimesの記事

www.eetimes.com

上記の記事の日本語訳もあります。

eetimes.itmedia.co.jp

Loud Box workstation

Wormhole x 2 ボードが4枚刺さっている、空冷版のワークステーションである TT-Loud ox workstation 。実物は下記のような感じ。

CPU : 2x Intel® Xeon® Silver 4309Y (8C/16T ea., up to 2.8GHz, 105W, Ark)
Memory : 512GB (16x32GB) DDR4-3200 ECC RDIMM (12 Slots Free)
Storage : 3.8TB U.2 NVMe PCIe 4.0 x4
4x Tenstorrent Wormhole™ n300s Tensix Processor/4x Warp 100 Interconnects
Cables : 2x QSFP-DD 400GbE Cable, 0.6m/2ft.

お値段は、$12,000 です。NVIDIAのA100、1基ぐらい。

TT-LouaBox workstation の LLM性能

Llama3.1-70B (BF8 precision) at 15 tokens per second per user for 32 concurrent users

H100ベースのAPIだと、around 20-50 tokens/s.

Nvidia DGX-H100 だと、H100 x 8基で、$300,000 なので、25倍。

上記の LIama3.1-70B の性能だと、3倍強なので、コスパは8倍ぐらいいいですね。

とは言え、70B ですからね。。。。。

chiplet

Tenstorrent modified Blue Cheetah’s BoW PHY, working closely with the company to make something that works well while being area-efficient.

chiplet の標準化としての、UCIe ですが、BoW PHYベースなんですね。

これですかね。

www.design-reuse.com

Blackwellの次は、chiplet ベースになり、全体を Tenstorrent で開発しているんですね。

下記のブログでも Tenstorrent の chiplet について書いています。Memory die, I/O die, AI Compute die, CPU die と。

Blackwell ベースの workstation

Friendly Box は、Wormholeベースの workstation よりもお安く、2 - 3x ぐらい速いようです。。。

これ、お安くするための仕組みが必要ですよね。TT-LoadBox では、Wormhole が 8個だったのを 4個とか2個にしているんですかね。そうしないと、消費電力的に無理っぽいですね。

おわりに

Tenstorrent -san

Groq
SambaNova Systems
Cerebras Systems

とは、違う、道を行っているような気がします。その理由は、CEO の Jim -san のお気持ちに繋がるのかもしれないと、妄想しています。

関連ブログ

vengineer.hatenablog.com