TensorFlow LondonのGraphcoreプレゼンビデオ

@Vengineerの戯言 : Twitter
SystemVerilogの世界へようこそ、すべては、SystemC v0.9公開から始まった

GraphcoreのこのツイートからTensorFlow LondonでのGraphcoreのビデオが公開されたことを知りました。

Another chance to listen to @graphcoreai's Dave Lacey talk about IPUs, #TensorFlow and how to use both for #ML innovation! @ThinkRiseLDN @seldon_io Watch the full recording here: https://t.co/bZrarqapdh
— graphcore (@graphcoreai) 2020年1月8日

www.youtube.com

vengineer.hatenablog.com

で参照したTweetのビデオのようです。

このビデオから、どうやって、TensorFlowと一緒に使うかをメモします。

$ cd poplar_sdk_ubuntu_18_004-0.8.8/

$ source gc_drivers-ubuntu_18_04-0.0.32/enable.sh

$ source poplar-ubuntu_18_04-0.8.0/enable.sh

$ virtualenv gc_venv

$ source gc_venv/bin/activate

(gc_venv) $

(gc_venv) $ pip install gc_tensorflow-0.2.5-cp27-cp27mu-cp27mu-linux_86_64.whl

(gc_venv) $ pip install gc_tensorboard-1.0.0-py2-none-any.whl

あれ、Python 2.7、使っている。。。

モデルについては、

self.network = ResNet(image_data, num_classes)

self.network.build_model()

15分ぐらいから POPLAR GRAPH FRAMEWORK の説明が出てきます。

TensorFlow XLAの説明も出てきます。

17:52頃

Cluster from core graph を

Tensorflow XLA with IPU specific optimizations => HLO graph を

The Graphcore XLA backend => Poplar graph

18:23頃

IPU SPECIFIC HLO OPTIMIZATIONS

1. Fusing - recognize common sub-graphs with existing optimized poplar library functions

2. Graph outlining

3. Analysis of which operations consume inputs and variables (for data layout)

4. Detection of "in-place" operations

19:24頃

THE IPU BACKEND

MatMul => poplin::matMul

BiasAdd => popops::add

20:32頃

MULTI-IPU CONSTRUCTS

vengineer.hatenablog.com

の説明

21:14頃

PIPELINED EXECUTION

Improve compute efficiency of a multi-IPU model

- Parts of the model are split across different devices

- One IPU is operating on one mini-batch, another is operating on a different mini-batch

- FIFOs used to store activations between FWD and BWD passes

- Scheduler for pipeline stages

- Python API similar to the xla.compile(fn, ...) interface

21:38頃

IPU1 : FWD BWD

IPU2 : FWD BWD

IPU1 : FWD.1 FWD.2 FWD.3 BWD.1 FWD.1 BWD.2 FWD.2 BWD.3 FWD.3

IPU2 : FWD.1 BWD.1 FWD.2 BWD.2 FWD.3 BWD.4 FWD.1 BWD.1

PCIeボードには、2個のIPUが載っていて、2チップで1つのモデルを学習するので上のような流れになるんだね。

22:32頃

PIPELINED EXECUTION

def fn1...

def fn2 ...

pipe = pilelining_ops.pipeline (

computational_stages=[fn1, fn2],

pipeline_depth=250,