GraphcoreのBertのコード - Vengineerの妄想

@Vengineerの戯言 : Twitter
SystemVerilogの世界へようこそ、すべては、SystemC v0.9公開から始まった

Graphcoreのサンプルコードには、ResNetもありますが、もっと大きな Bert

github.com

以下の2種類のBertモデルにて、学習

BERT Base – 12 layers (transformer blocks), 110 million parameters
BERT Large – 24 layers (transformer blocks), 340 million parameters

Graphcoreでは、

BERT Large : 14 IPUs for pre-training (Wikipedia) and 13 IPUs for training on the SQuAD dataset. There are 2 layers of the model per IPU and the other IPUs are used for embedding and projection.
BERT Base : 8 IPUs for pre-training (Wikipedia) and 7 IPUs for training on the SQuAD dataset (2 layers of the model per IPU, other IPUs used for embedding and projection).

になったようです。14個の IPU なので、DellサーバーでOKということに。

なお、IPUは偶数個にて使うので、奇数の場合は使わないチップが出てくると。

BERT Largeでは、pipeline parallel にて、14個のIPUを以下のようなパイプラインでつなげていったようです。

　IPU0: Tied Token Embedding/Projection

　IPU1: Position Embedding/Losses

　IPU2: Transformer layer 1 and 2

　IPU3: Transformer layer 3 and 4

... IPU13: Transformer layer 23 and 24