Various updates (10th Jan)

* Update BERT inference instructions
* Preserve order of operations in graph when folding batch norm ops in CNN inference
* Various README typo and documentation fixes
* Tidy PopART pipelining example
In PopART pipelining code example - replace popart.SessionOptionsCore() with popart.SessionOptions()
* Test fixesClearer accuracy assertion error in public example tests
* Fixes to avoid warnings in kernel benchmarks run harness
* Add a more advanced/complete Poplar example
* Add a simple Tensorflow custom op example.
* BERT Inference bug fixes

具体的には、

が追加された模様。

Advanced Poplar Example では、Poplar APIにて、Graphcore内のTileにマッピングするにはどうしているかも多少分かります。Custom Opを読み込んでいるが、このCuston Opについては、Creating a simple Tensorflow Custom Op　にあるね。

Popart pipeline の例題には解説が付いていませんが、ソースコード (pipeline.py) を見ると、

# ------------ Model Definition ---------------------------------------
# <------------ ipu0 --------><---------- ipu1 --------------------->
#
# d0 --|-- Gemm --|-- Relu --|-- Gemm--|-- Relu --><--Softmax --> Out
# w0 --| w1 --|
# b0 --| b1 --|

のような感じで、2つのIPUでのPipelineの実装例のようです。

どうらや、 builder.virtualGraphというもので、IPUへのマッピングを決めているっぽい

2020-01-16

TensorFlow LondonのGraphcoreプレゼンビデオ

@Vengineerの戯言 : Twitter
SystemVerilogの世界へようこそ、すべては、SystemC v0.9公開から始まった

GraphcoreのこのツイートからTensorFlow LondonでのGraphcoreのビデオが公開されたことを知りました。

Another chance to listen to @graphcoreai's Dave Lacey talk about IPUs, #TensorFlow and how to use both for #ML innovation! @ThinkRiseLDN @seldon_io Watch the full recording here: https://t.co/bZrarqapdh
— graphcore (@graphcoreai) 2020年1月8日

www.youtube.com

vengineer.hatenablog.com

で参照したTweetのビデオのようです。

このビデオから、どうやって、TensorFlowと一緒に使うかをメモします。

$ cd poplar_sdk_ubuntu_18_004-0.8.8/

$ source gc_drivers-ubuntu_18_04-0.0.32/enable.sh

$ source poplar-ubuntu_18_04-0.8.0/enable.sh

$ virtualenv gc_venv

$ source gc_venv/bin/activate

(gc_venv) $

(gc_venv) $ pip install gc_tensorflow-0.2.5-cp27-cp27mu-cp27mu-linux_86_64.whl

(gc_venv) $ pip install gc_tensorboard-1.0.0-py2-none-any.whl

あれ、Python 2.7、使っている。。。

モデルについては、

self.network = ResNet(image_data, num_classes)

self.network.build_model()

15分ぐらいから POPLAR GRAPH FRAMEWORK の説明が出てきます。

TensorFlow XLAの説明も出てきます。

17:52頃

Cluster from core graph を

Tensorflow XLA with IPU specific optimizations => HLO graph を

The Graphcore XLA backend => Poplar graph

18:23頃

IPU SPECIFIC HLO OPTIMIZATIONS

1. Fusing - recognize common sub-graphs with existing optimized poplar library functions

2. Graph outlining

3. Analysis of which operations consume inputs and variables (for data layout)

4. Detection of "in-place" operations

19:24頃

THE IPU BACKEND

MatMul => poplin::matMul

BiasAdd => popops::add

20:32頃

MULTI-IPU CONSTRUCTS

vengineer.hatenablog.com

の説明

21:14頃

PIPELINED EXECUTION

Improve compute efficiency of a multi-IPU model

- Parts of the model are split across different devices

- One IPU is operating on one mini-batch, another is operating on a different mini-batch

- FIFOs used to store activations between FWD and BWD passes

- Scheduler for pipeline stages

- Python API similar to the xla.compile(fn, ...) interface

21:38頃

IPU1 : FWD BWD

IPU2 : FWD BWD

IPU1 : FWD.1 FWD.2 FWD.3 BWD.1 FWD.1 BWD.2 FWD.2 BWD.3 FWD.3

IPU2 : FWD.1 BWD.1 FWD.2 BWD.2 FWD.3 BWD.4 FWD.1 BWD.1

PCIeボードには、2個のIPUが載っていて、2チップで1つのモデルを学習するので上のような流れになるんだね。

22:32頃

PIPELINED EXECUTION

def fn1...

def fn2 ...

pipe = pilelining_ops.pipeline (

computational_stages=[fn1, fn2],

pipeline_depth=250,

2020-01-15

Groqのクラウド用推論チップ：TSPのちょっと詳細

@Vengineerの戯言 : Twitter
SystemVerilogの世界へようこそ、すべては、SystemC v0.9公開から始まった

Google TPU v1を開発したメンバーが
創業したGroqの推論用アクセラレータ(TSP)
その詳細がかなり分かるレポートがアップされています。
読んだら、感想を聞かせてください。 https://t.co/vjkYUOr1Kx
— Vengineer＠アマゾンプライムで映画三昧 (@Vengineer) 2020年1月9日

でツイートした、Groqのクラウド用推論チップ：TSPのThe Linley GroupのMicroprocessor Report：GROQ ROCKS NEURAL NETWORKS　を読んでみました。

ターゲット周波数は、1.0GHz
最初は、900MHzであったが、
最終的には、1.25GHzは行きそうと

ResNet-50にて、
各レイテンシーは
TSP (batch=1) : 0.05ms
NVIDIA V100 (batch=1) : 0.87ms
NVIDIA V100 (batch=128) : 16ms
Habana Goya (batch=1) : 0.24 ms

バッチサイズは1で0.05 msだと、20,000 枚推論できるんですよね。

ベンチマーク値としては、20400枚

内部構成は、以下の4つのUnit をベースになっています。

Vector Unit
Memory Unit
Switch Unit
Matrix Unit

具体的には、

Matrix <=> Switch <=> Memory <=> Vector <=> Memory <=> Switch <=> Matrix

の塊を20個(+1個がスペア)になっている。

メモリは内部SRAMだけ。

上の Memory Unit には、5.5MBytes あって、それが40個で220MBytes

バッチサイズが1なので、画像データは1枚のみ。残りはパラメータとモデルの中間データのみ。

220MBに入りきらない場合は、複数枚使ってやるんでしょうかね。

下記の写真は、ここの写真の一部を引用しています。

f:id:Vengineer:20200111085401j:plain — Groqの基板の上部にある3つのコネクタ

基板の上に、3つのコネクタがあります。Reportの Figure.3 にも、PCIe and Other I/O ともあります。

追記)、2020.01.18

www.hpcwire.com

2020-01-14

Cerebrasからhigh-level polyhedral compilerの論文が

@Vengineerの戯言 : Twitter
SystemVerilogの世界へようこそ、すべては、SystemC v0.9公開から始まった

Twitterに流れてきた、これ

Generating SIMD Instructions for Cerebras CS-1 using Polyhedral Compilation Techniques - https://t.co/RiuqP9Pxr8 #ScholarAlerts
— koumei tomida (@koumeitomida) 2020年1月12日

Cerebras Systems の CS-1 のお話。内容は、

This paper describes a high-level polyhedral compiler that takes a high-level algorithm description that can be written manually or extracted from a TensorFlow computation graph and generates input to the low-level C-based compiler.

です。

　a high-level polyhedral compiler

polyhedral

あ、Tensor Comprehensions

vengineer.hatenablog.com

で知った polyhedral compiler

Cerebrasの論文の筆頭者のSven Verdoolaegeさん、Tensor Comprehensions の論文にも。2018年10月にCerebras Systems に移った模様。

このブログは2018年12月にアップしたので、だいたい合っていますね。

vengineer.hatenablog.com

For its core polyhedral compilation operations, dtg_codegen relies on isl

ともあるので、そうですね。isl を見てみたら、11月に更新されていますね。

また、

is also used by Tensor Comprehensions (Vasilache et al. 2019)

とあるので、ビンゴですね。

2020-01-13

AWSのArmチップGraviton 2

@Vengineerの戯言 : Twitter
SystemVerilogの世界へようこそ、すべては、SystemC v0.9公開から始まった

AWS re:Invent 2019にて発表された Graviton 2

www.publickey1.jp

この記事にあるように、AWSは2015年にAnnapurna Labsを買収しています。

Amazon EC2におけるサーバの仮想化機能をより高性能化するためにハードウェアへのオフロード用ASICを自社で開発するためだと説明されていました。

このASICは「Nitro System」の一部として、現在はすべてのAmazon EC2サーバに組み込まれています。

でしたが、自社でプロセッサを開発したのが、「Graviton」で今回発表したのが、「Graviton 2」が第二世代。

スペックは、ここにもありますが、

64ビットARM Neoverseコアに基づいた7nm製造プロセスで、64 vCPU、25Gbps Networking

そう、Armの最先端コアを先端プロセスにて開発。つまり、開発費がお高い。でも、Amazonなら問題にならないレベル。

Keynoteのビデオの17:30頃からChipのお話があります。

www.youtube.com

AWS re:Invent 2019情報

dev.classmethod.jp

2020-01-12

Cloud TPU Driver API ソースコード解析のスライドを公開しました。

@Vengineerの戯言 : Twitter
SystemVerilogの世界へようこそ、すべては、SystemC v0.9公開から始まった

現時点での未公開まとめスライドは、以下の3点です。

1)、TensorFlow Lite Delegateとは？
2)、Cloud TPU Driver API ソースコード解析
3)、Cloud Deep Learning Chips

興味ある方？どのくらいいますか？
拡散もお願いいたします。。
— Vengineer＠アマゾンプライムで映画三昧 (@Vengineer) 2019年12月8日

一番人気だったの TensorFLow Lite Delegate のスライドは、2019.12.15に公開しました。

vengineer.hatenablog.com

二番目に人気があった Cloud Deep Learning Chips のおまとめスライドは、2019.12.31 に公開しました。

vengineer.hatenablog.com

最後まで残っていた「Cloud TPU Driver API ソースコード解析のスライド」を Slideshare にて公開しました。ご利用ください。

なお、公開されてから2か月経っていますので、ソースコードは更新されているところがあります。それについては、追っていません。。

Cloud TPU Driver API ソースコード解析 from Mr. Vengineer