2020-03-12

AWS Neuron SDK : TensorFlow - Neuron (neuron_op)

@Vengineerの戯言 : Twitter
SystemVerilogの世界へようこそ、すべては、SystemC v0.9公開から始まった

昨日の続き、

tensorflow.neuron.saved_mode.compile

の中で、TensorFlowのモデルを neuron-cc にてコンパイルして結果を読みだして、 Node の 'executable' というアトリビュートに読みだしたデータを登録しています。

workdir_path = subgraph_compilers[node.name].workdir_path
executable_path = os.path.join(workdir_path, _neuron_executable_name)
with open(executable_path, 'rb') as f:
node.attr['executable'].s = f.read() if node.name in num_cores_tuple_map:

アトリビュートにデータを登録した Node の Op 名は、neuron_op です。

neruon_op は、こんな感じになっています。

@tf_export('neuron_op')
def neuron_op(input_tensors, graph_def, input_names, input_shapes, output_names, output_dtypes, output_shapes, executabl
e="", input_batch_axis=, output_batch_axis=, model_config=, name=None):
r"""TODO: add doc.

Args:
input_tensors: A list of `Tensor` objects.
graph_def: A `string`.
input_names: A list of `strings`.
input_shapes: A list of shapes (each a `tf.TensorShape` or list of `ints`).
output_names: A list of `strings`.
output_dtypes: A list of `tf.DTypes`.
output_shapes: A list of shapes (each a `tf.TensorShape` or list of `ints`).
executable: An optional `string`. Defaults to `""`.
input_batch_axis: An optional list of `ints`. Defaults to ``.
output_batch_axis: An optional list of `ints`. Defaults to ``.
model_config: An optional list of `ints`. Defaults to ``.
name: A name for the operation (optional).

Returns:
A list of `Tensor` objects of type `output_dtypes`.

executable ってありますよね。ここに、コンパイル済みのコードが入るんですね。

実際は、_pywrap_tensorflow.TFE_Py_FastPathExecute で NeuronOP を実行しています。

_result = _pywrap_tensorflow.TFE_Py_FastPathExecute(
_ctx._context_handle, _ctx._thread_local_data.device_name, "NeuronOp",
name, _ctx.post_execution_callbacks, input_tensors, "graph_def",
graph_def, "input_names", input_names, "input_shapes", input_shapes,
"output_names", output_names, "output_dtypes", output_dtypes,
"output_shapes", output_shapes, "executable", executable,
"input_batch_axis", input_batch_axis, "output_batch_axis",
output_batch_axis, "model_config", model_config)
return _result

TFE_Py_FastPathExecute は、下記のように、TFE_Py_FastPathExecute_C を呼び直しているだけ。

m.def("TFE_Py_FastPathExecute", [](const py::args args) {
// TFE_Py_FastPathExecute requires error checking prior to returning.
return tensorflow::pyo_or_throw(TFE_Py_FastPathExecute_C(args.ptr()));
});

TFE_Py_FastPathExecute_Cでは、TFE_execute で op を実行しています。

Py_BEGIN_ALLOW_THREADS;
TFE_Execute(op, retvals.data(), &num_retvals, status);
Py_END_ALLOW_THREADS;

TFE_execute では、

void TFE_Execute(TFE_Op* op, TFE_TensorHandle** retvals, int* num_retvals,
TF_Status* status) {
absl::FixedArray<std::unique_ptr<AbstractTensorHandleInterface>> handles(
*num_retvals);
status->status = op->operation->Execute(&handles, num_retvals);
if (!status->status.ok()) {
return;
}
for (int i = 0; i < *num_retvals; ++i) {
retvals[i] = new TFE_TensorHandle{std::move(handles[i])};
}
}

で、op->operation の Execute を実行しています。Execute では、EagerExecute を実行しています。

Status OperationInterface::Execute(
absl::FixedArray<std::unique_ptr<AbstractTensorHandleInterface>>* retvals,
int* num_retvals) {
absl::FixedArray<tensorflow::TensorHandle*> handle_retvals(*num_retvals);
TF_RETURN_IF_ERROR(
EagerExecute(&operation_, handle_retvals.data(), num_retvals));
for (int i = 0; i < *num_retvals; ++i) {
retvals->at(i).reset(
new tensorflow::TensorHandleInterface(handle_retvals[i]));
}
return Status::OK();
}

EagerExecute では、

return EagerLocalExecute(op, retvals, num_retvals);

NeruonOp で tensorflow_core/python/neuron/ops/gen_neuron_op.py の中でなんかやっています。NeuonOp そのものは、tensorflow_core/python/neuron/python/ops/_neuron_op.so に中にあって実際に何をやっているかは分かりません。TensorFlow の OpKernel なので、Compute というメソッドの中で実行されるとは思いますが。。。

TensorFlow XLAと同じやり方！

アクセラレータで実行するOpをまとめて、1つのOpにいれちゃうのって、TensorFlow XLAのやり方と同じですね。つまり、TensorFlowモデルを neuron-cc にてコンパイルしたものを実行する NeuronOP を実行する段階で、Inferentia に処理を依頼するという感じになっているわけです。

一般的な推論チップでは、フレームワークで学習したモデルを専用コンパイラにてコンパイルし、独自フォーマットに変換し、ファイルに出力する。出力したファイルを使って、独自APIにて推論するって感じです。C/C++/PythonなどのAPIを用意することで柔軟に対応していますが、フレームワーク内でモデルのコンパイル => 実行というのはTensorFlow XLAだけですね。そういう意味で、この AWS Neuron SDK + TensorFlow - Neuron は非常に興味深いものだと思います。

2020-03-11

AWS Neuron SDK : TensorFlow - Neuron (neuron.saved_model.compile)

@Vengineerの戯言 : Twitter
SystemVerilogの世界へようこそ、すべては、SystemC v0.9公開から始まった

Neuron Compiler にて、TensorFlowモデルをコンパイルできるとありますが、SavedModelに関しては、TensorFlow-Neuron Compilation API にあるように、

tensorflow.neuron.saved_mode.compile

という API にて TensorFlow の中で SavedModel を Neuron Compiler にてコンパイルできます。このメソッドを実行すると、

INFO:tensorflow:Number of operations in TensorFlow session: 3978
INFO:tensorflow:Number of operations after tf.neuron optimizations: 555
INFO:tensorflow:Number of operations placed on Neuron runtime: 554

のようなメッセージが表示されます。この例では、3978個のTensorFlowのOpを tf.neuron最適化で 555個にし、Neuron runtime上では 554 個に置き換えたことを意味するようです。

ではどのようなコードを書けばいいのか？

import shutil
import tensorflow.neuron as tfn
saved_model_path = "<saved model path>"
compiled_saved_model_path = "<compiled saved model path>"
shutil.rmtree(compiled_saved_model_path, ignore_errors=True)
tfn.saved_model.compile(saved_model_path, compiled_saved_model_path)

で、tensorflow.neuron の saved_model.compile にて、TensorFlowの SavedModel を Neuron の SavedModel に変換しています。

現時点でサポートしている Op は、ここにあります。75種類のOpをサポートしています。

Neuron Compile が TensorFlowのSavedModelをコンパイルし、上記の compiled_saved_model にするとどうなるかというと、

Tutorial: Getting Started with TensorFlow-Neuron (ResNet-50 Tutorial)

から探っていきます。

pip.conf (WindowsのWSLのUbuntu上で、pyenv を使っているので、

~/.pyenv/versions/3.6.6/pip.conf に追加しました)

[global]
extra-index-url = https://pip.repos.neuron.amazonaws.com

を追加し、

pip install tensorflow-neuron

を実行して、TensorFlow - Neuron をインストール。

~/.pyenv/versions/3.6.6/lib/python3.6/site-packages ディレクトリの下に、

tensorflow

tensorflow_core

等がインストールされました。

tensorflow_core/python/neuron/__init__.py には、

from tensorflow.python.neuron.python import graph_util
from tensorflow.python.neuron.python import saved_model
from tensorflow.python.neuron.python import predictor
from tensorflow.python.neuron.python.fuse import fuse
from tensorflow.python.neuron.ops.gen_neuron_op import neuron_op

とあるので、

tensorflow_core/python/neuron/python/saved_model.py が実体っぽい。

このファイルの中に、

@tf_export('neuron.saved_model.compile')
def convert_to_inference_model(model_dir, new_model_dir, batch_size=1,
model_shape_feed_dict=None, model_feed_dict=None,
tags=None, signature_def_key=None, strip_default_attrs=False,
**kwargs):

"""Convert a `SavedModel` to a Neuron-optimized `SavedModel`.

がありました。ビンゴです。この convert_to_inference_model 関数をどんどん掘っていくと、compile_subgraph (tensorflow_core/python/neuran/python/graph_util.py) にて、neuron-cc コマンドをサブプロセスで起動して、TensorFlowのモデル(グラフ)をコンパイルしています。

今日は、ここまで。

2020-03-10

AWS Neuron SDK : Neuron Compiler / Neuron Runtime とは？

@Vengineerの戯言 : Twitter
SystemVerilogの世界へようこそ、すべては、SystemC v0.9公開から始まった

AWS Neuron SDK については、github にドキュメントがあります。

TensorFlow、MXNet、PyTorch に対応するドキュメントがあります。

各ドキュメントに従って、下記のフローにて、Deployまでできそうです。

ChooseML Framework
Build Model
Train Model
Compile Model using AWS Neuron
Deploy Model on Inf1 and AWS Inferentia

今日は、この中の共通部分である「Compile Model using AWS Neuron」を見ていきます

コンパイラは、Neuron Compiler ( neuron-cc) です。このドキュメントによると、Neuron Compiler は、TensorFlow、MXNet、PyTorch、ONNXモデルを読み込んで、Ahead-of-Time (AOT) compiler として動くようです。Neuron Compiler では、モデル内のFP32をBF16に変換しています。最終的に、NEFF file (Neuron Executable File FOrmat)として出力し、このファイルは Neuron Runtime で使われます。

Neuron Compiler の Command line reference のページを眺めていたら、対応フレームワークは、TensorFlow、MXNet、ONNX となっていますね。TensorFlowでは、Frozen GraphDef と SavedModel をサポートしているようです。

Neuron Runtime は Inferentia chips上でNEFF file内のコードを実装させています。また、Neuron Runtime は、違ったモデルを NeuronCore Group に割り当てて実行するということもしています。

NeuronCore Group とは、

A NeuronCore Group is a set of NeuronCores that are used to load and run a compiled model. At any point in time, only one model will be running in a NeuronCore Group. Within a NeuronCore Group, loaded models can be dynamically started and stopped, allowing for dynamic context switching from one model to another.

にあるように、複数のNeuronCoreにて1つもモデルを実行させるためのもののようです。複数のNeruonCoreを繋げて動かすのが、NeuronCore Pipeline です。

チップ内には4コアあるので、コア間を繋げて、パイプライン処理させるというものです。inf1.xlarge と inf1.2xlarge ではなんでかがわかりませんが、NeuronCore Pipeline Model はサポートされていないようですね。ビデオのこのスライドにあります。

NeuronCore Pipelineを有効にするには、NeuronCore Compiler 実行時にパラメータとして、--num-neuroncores とコア数を指定する必要があります。デフォルトは 1 が指定されているので、Pipelineはできません。ということで、1チップでも4コア載っているので、--num-neuroncores 4 と指定すれば、4コアでの Pipeline ができそうなのですが、何故か？

inf1.xlarge と inf1.2xlarge では、Pipeline Model をサポートしていないと。とほほ。え、日本語のサイトを見てみたら、違いますね。スライドでは、NeuronCore Pipeline Mode とありますが、サイトではInferentia チップ間相互接続とあるので、Pipeline Modelはサポートするが、チップが1個なので、チップ間相互接続はサポートしないということなんですね。ということで、 inf1.xlarge と inf1.2xlarge でも Pipeline Model が利用できるようです。

2020-03-09

AWS EC2 inf1 とは？

@Vengineerの戯言 : Twitter
SystemVerilogの世界へようこそ、すべては、SystemC v0.9公開から始まった

TwitterのTLに流れてきた AWS Inferentia

発表時のスライド。これをきっかけとして、いろいろと調べてみました

今週は、AWS Inferentia とサービスとしての AWS EC2 inf1 について書いていこうと思います。

こちらのビデオでは、AWS Inferentia を利用したサービスについて説明しています。

まずは、AWS Inferentia と AWS EC2 inf1 とは？

Inferentiaは、Neuronコア + Cache が 4個と外部DRAMのチップ。Cacheを持つことでモデルに必要なパラメータ等を外部DRAMから読み出す必要がないと。

AWS EC2 inf1 では、この Inferentia Chip を 4個載せた PCIe Board が利用できるというもの。Neuronコアでは、FP16、BF16、および INT8をサポート。

ビデオのここに、サービス一覧があります。

サービスとしては、次の4つのタイプ。

inf1.xlarge : vCPUs (4)、Memory (8GB)、Inferentia Chips (1)
inf1.2xlarge : vCPUs (8)、Memory (16GB)、Inferentia Chips (1)
inf1.6xlarge : vCPUs (24)、Memory (48GB)、Inferentia Chips (4)
inf1.24xlarge : vCPUs (96)、Memory (192GB)、Inferentia Chips (16)

1チップ、4チップ、16チップということなので、最初の2タイプは1ボードの中の1チップだけを使うようですね。4チップで1ボード、16チップで4ボード。

4ボードでは、vCPU(24) * 4 って感じですかね。メモリも 48GB x 4 = 192GB

1チップの場合は、NeuronCore Pipeline Modeは使えないと。

お値段は、1ボードで1時間2ドル以下ということですね。1日5000円。

inf1.xlarge : $ 0.368/Hr
inf1.2xlarge : $ 0.584/Hr
inf1.6xlarge : $ 1.905/Hr
inf1.24xlarge : $ 7.619/Hr

Alexa service にて、inf1 を利用した事例がビデオのこの部分から出てきます。

inf1 を使うには、AWS Neuron SDK というものを使って、下記のフローでTensorFlow、MXNet、PyTorchのモデルからInferentia Chip用のコードに変換して使います。

ChooseML Framework
Build Model
Train Model
Compile Model using AWS Neuron
Deploy Model on Inf1 and AWS Inferentia

現時点では、まだ、PyTorch には対応できていないようです(PyTorchで呼べるようになっているが、内部的には、TensorFlowに変換しているっぽい)

Compile Model using AWS Neuron の部分で、FP32 を BF16 に変換しているっぽい。

明日は、AWS Neuron SDK を調べてみたいです。

2020-03-08

Cerebras CS-1のScaledML2020のプレゼン資料

@Vengineerの戯言 : Twitter
SystemVerilogの世界へようこそ、すべては、SystemC v0.9公開から始まった

記録のために。

ScaledML2020での、Cerebras Systemsのプレゼン資料

GPUでの学習では、

Traditional: Layer-Sequntial (Single GPU : Modest Data Parallel)

　　　Run one layer at once.

　　　Medium batch size

Traditional: Layer-Sequntial (GPU Cluster : Extreme Data Parallel)

　　　Run layer replicas on parallel devices.

　　　Extremely large batch size

　　　Weight sync overhead

WSE: Layer-Sequntial (Replicas: Modest Data Parallel)

　　　Run layer replicas on fabric sections

　　　Medium batch size

　　　Low weiht sync overhead

WSE: Layer-Sequntial (Non Replicas: Low Data Parallel)

　　　Replicas not constrained to device

　　　Run one layer at once on entire wafer

　　　Small batch size

　　　No weight sync overhead

WSE: Layer-Pipelined (Model Parallel, Modest Data Parallel)

　　　Run all

2020-03-08

Xilinx FINN v0.2b beta

@Vengineerの戯言 : Twitter
SystemVerilogの世界へようこそ、すべては、SystemC v0.9公開から始まった

このブログでも、2月23日に取り上げた Xilinx の FINN

vengineer.hatenablog.com

XilinxのFINN v0.2b (beta)がリリースされたようです。

xilinx.github.io

Function Verificationのページを眺めていたら、下記のような説明がありました。

引用します。この中で、PyVerilatorというのがあって、PythonからVerilatorをコントロールするみたいです。Yaman Umuroglu さん(FINNをやっている人)の Fork もあり、こちらの PyVerilator を使えばいいようですね。

Simulation using Python

This simulation can be used right after the Brevitas Export or when the network does not contain any HLS custom nodes, so right after the streamlining transformations and before the nodes are converted into HLS layers.

Simulation using C++

This simulation can be used for a model containing several HLS custom operations. Because they are based on finn-hlslib function, C++ code can be generated from this single nodes and they can be executed by compiling the code and running the resulting executables.

Emulation using PyVerilator

The emulation using PyVerilator can be used when IP blocks were generated, either node by node or of a whole design. For that purpose PyVerilator gets the generated verilog files.

PyVerilatorの例題：simple_example.py を見てみると、

sim = pyverilator.PyVerilator.build('counter.v')

で、Verilog HDLのコードを読みこみ、Verilatorでビルドし、

# start gtkwave to view the waveforms as they are made
sim.start_gtkwave()

# add all the io and internal signals to gtkwave
sim.send_to_gtkwave(sim.io)
sim.send_to_gtkwave(sim.internals)

にて、GTKWave への波形ダンプ信号を指定し、その後、入力信号(CLK、Reset)を駆動しています。